I am new to Machine Learning and I am currently doing research on speech emotion recognition using deep learning. I found out that recent literatures were using mostly CNN and there are only few literatures found for SER using RNN. I also found out that most approaches used MFCCs.
My questions are:
- Is it true that CNN has been proved to outperform RNN in SER?
- If yes, what are the limitations that RNN have compared with CNN?
- Also, what are the limitations of the existing CNN approaches in SER?
- Why MFCC is used the most in SER? Does MFCC have any limitations?
Any help or guidance would be appreciated.