If your question is about problems or situations related to perception of sound, then the features used should be related to that: psyco-acoustic measures such as loudness, sharpness and roughness might then be useful.
Perhaps something to have a look at is the application of scattering networks to classify audio/music signals [Andén et al. https://www.di.ens.fr/~mallat/papiers/IEEESignalAndenLostanlen.pdf]. The paper also bridges the gap between time-frequency analysis (e.g. MFCC, mel-spectrograms, etc) and deep neural networks.
I have experience with the Wigner Ville distribution, cepstral analysis, short-time Fourier transform and wavelet transform fed into a Convolutional Neural Network. There are literature sources that argue why WVD is superior to other time-frequency analysis transforms for non-stationary classification tasks with CNNs. In our application (classification of ultrasound backscatter), WVD, STFT and WT gave comparable results, while cepstral analysis was inferior.