It is well known that audio compression (e.g., MP3, AAC) usually processes the audio data frame-by-frame. However, I am curious about the feasibility of single frame based processing.
A commonly accepted notion is that frame based processing has time resolution of audio data while a single frame processing does not have. This is similar to comparing DFT and STFT.
However, why we need time resolution of audio signal during compression? For a given audio clip, its single frame FFT has super frequency resolution (huge points) and no time resolution. However, we can still calculate tonal and non-tonal elements, masking curves, and generate quantization index, etc. In this way, the modifications of any frequency bins will be reflected throughout time domain whenever this frequency appears along the time axis in the compressed time domain audio samples.
I personally do not see any potential problems of performing single frame compression as described above. The only problem I can imagine is in terms of hardware implementation for huge DCT points. But the computational complexity of FFT is O(nlogn) which approaches a linear function of n when n is large. Hence I do not see this as a big problem with the consideration of rapid developed computer capabilities.
Please help to point out my mistakes in the above statements.