I am new to Machine Learning and I am currently doing research on speech emotion recognition using deep learning. I found out that recent literatures were using mostly CNN and there are only few literatures found for SER using RNN. I also found out that most approaches used MFCCs.

My questions are:

- Is it true that CNN has been proved to outperform RNN in SER?

- If yes, what are the limitations that RNN have compared with CNN?

- Also, what are the limitations of the existing CNN approaches in SER?

- Why MFCC is used the most in SER? Does MFCC have any limitations?

Any help or guidance would be appreciated.

Similar questions and discussions