To generate a spectrum from compressed audio file, one need to decompress the audio file, perform STFT to get a spectrogram then optionally enhance it to become a mel-spectrogram, MFCC etc. Any variant seem to work as the performance don't differ much. Then the spectrogram used as inputs to a convolutional neural network.
IIRC the Ogg Vorbis file format saves the filter bank coefficient as MDCT.
Can we skip the decompression and STFT part and just use the MDCT coefficient somehow?