Hello, as far as speech is concerned, you can try Turboscribe.ai or Clideo.com. They offer freemium options. In the former, you can upload an audio or video file, or you can provide a link to a youtube video. In the latter, you must upload a video file.
The success of metallurgists, oil exploration engineers, etc in extracting their precious products results from accurate knowledge of the nature of the products themselves. But ask a well-informed research scientist what determines a musical tone or any speech sound, the best response is a guess. How, then, could I recommend the best method for extracting what no one knows from where it does not exist? Our knowledge of speech and music as prescribed by psychoacoustic procedures is fundamentally flawed and practically futile. I recommend that we start it all over again from the scratch as there exists no valid method for extracting functional features of music and speech. Here is my evidence:
https://doi.org/10.18103/mra.v11i12.4828 My other relevant articles in cited in the references, to show that music, speech and hearing research is material and human resources being flushed down the drain.
For Speech - phonetic content Higher order MFCC would be ideal
For Music - Both Higher order MFCC coefficients and lower order MFCC coefficients (1-4) probably be helpful. Lower order MFCC is useful in pitch , spectral energy distribution.
F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, and K. R. Scherer, “On the Acoustics of Emotion in Audio: What Speech, Music and Sound have in Common,” Frontiers in Emotion Science, vol. 4, no. Article ID 292, pp. 1–12, May 2013