Has CNN taken over RNN in Speech Emotion Recognition? If yes, why?

12 June 2020 3 712 Report

I am new to Machine Learning and I am currently doing research on speech emotion recognition using deep learning. I found out that recent literatures were using mostly CNN and there are only few literatures found for SER using RNN. I also found out that most approaches used MFCCs.

My questions are:

- Is it true that CNN has been proved to outperform RNN in SER?

- If yes, what are the limitations that RNN have compared with CNN?

- Also, what are the limitations of the existing CNN approaches in SER?

- Why MFCC is used the most in SER? Does MFCC have any limitations?

Any help or guidance would be appreciated.

Md Sahidullah

Answers to some of your queries are as follows:

- It depends on network configurations, the way one creates training examples & datasets. A lack of systematic benchmarking of existing methods however creates confusion. There are several studies which shows that LSTM outperforms CNN for speech emotion recognition. Specially LSTM with attention mechanism helps to boost emotion recognition performance. Some other studies report the opposite and CNN seems to be a better choice. You can do a quick literature survey on latest papers published in last couple of INTERSPEECH, ICASSP & ASRU.

- MFCC is the default choice for most speech processing tasks including speech emotion recognition. However, MFCC is not the optimal one as it lacks prosody information, long-term information. That's why often MFCC is augmented with pitch (to be more specific log F0) and/or shifted delta coefficients. These additional information help to boost the emotion recognition performance. MFCC lacks phase information but the role of phase in emotion recognition performance is not much investigated. The parameters for MFCC computation such as the number of filters, the frequency scale are chosen experimentally and they are dependent on the dataset and the backend classifier.

Shahab Pasha

-The answer is yes. I believe it is practically and theoretically proved that CNN is a more suitable network for applications that use time-frequency domain signals and spectrograms. If the signals are analyzed in the time domain or frequency domain RNN would be more handy.

-RNN is more efficient on vector inputs and not matrices. I can't really see any reason to choose RNN over CNN when the input is a matrix.

-Like any other deep network optimizing the parameters and the network configuration (convolution layer) is the most challenging task when using CNNs.

Reema Ahmad

The RNN is more effective for SER while CNN for image and video

The MFCC is used for SER because it relates to the human

Feedback defines the constitution of an organism?

How to learn more about SPSS and its Application?

Is there a problem with my RNA pellet?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

RNA Extraction Using Hot Borate Method No Longer Working?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?