Hello everyone I'm working in audio analysis for emotion classification. I'm using parselmouth (a PRAAT integration in python) to get feature. I'm not well versed in audio analysis, starter. After reading many papers and forums. I see mfcc are used for this, I've also discovered some features they're (jitter, shimmer, hnr, f0, zero_crossing) are they used for this work?

What I've to do with audio files before extracting mfcc and these features?

After getting these features I've to predict emotion using machine learning.

It'll involve:

- The algorithm must be able to make predictions in real time or near

- Taking into account the sex and the neutral voice of each person (for example, by reducing and centering the variables of the model to consider only their variations with respect to the mean - average which will thus change value as and when the sequential analysis since it will be first calculated between 0 and 1 second, then 0 and 2 seconds, etc.)

Any help and suggestion for best practice are welcome.

Thanks

Similar questions and discussions