In order to create short utterances from audio recordings, I found two solutions:
However, I am struggling to find efficient solution because recordings may contain noise or music. Moreover, I have neither a trained model nor an annotated corpus to build a new model.
Any advice would be greatly appreciated.
Thanks in advance.