In order to create short utterances from audio recordings, I found two solutions:

  • identification of speech areas using the acoustic properties of the speech signal such as energy
  • classification areas (speech/non-speech) using statistical approaches like NNs.
  • However, I am struggling to find efficient solution because recordings may contain noise or music. Moreover, I have neither a trained model nor an annotated corpus to build a new model.

    Any advice would be greatly appreciated.

    Thanks in advance.

    More Selma Kali Ali's questions See All
    Similar questions and discussions