I want to implement Deep Retinal Convolution Neural Network for Speech Emotion Recognition given in this paper https://arxiv.org/ftp/arxiv/papers/1707/1707.09917.pdf. The authors of this paper achieved 99% accuracy on IEMOCAP, EMO-DB databases.
What I understood from this paper is that I first have to convert voices in to spectogram by using Data Augmentation Algorithm Based on Retinal Imaging Principle (DAARIP) algorithm and then input these into DCNN.
I am having a hard time breaking down this approach in to easy steps.