Feeding a Spectrogram or STFT plot of audio to a 2D CNN for classification is one of the approaches to classification. Are there more similar image-like representations?
In our work the system learns Mel Spectrogram representation, in facts, it is a deep neural network that, among other things, learns automatically the mel spectrogram layer focusing on certain frequencies with respect to others.