Hello,

I would like to apply an self-attention mechanism on a multichannel audio spectrogram, so a 3D tensor. In the original Transformer paper, self-attention is applied on vectors (embedded words) within a kind of temporal sequence. On my multichannel spectrogram, I would like to apply self-attention both on the temporal and frequency axes, so that the analyzed vectors are "through" the channel axes.

On tensorflow.keras MultiHeadAttention layer, there is a attention_axes parameter which seems to be interested for my problem, because I could set it up to something like (2,3) for a input feature of shape (batch_size, nFrames, nFreqBins, nDim) and hope attention will be applied on the wanted dimensions. However I don't understand how it works since it's different from the original Transformer paper, and I don't find any relevant paper addressing self-attention in several dimensions in the same manner.

Also the source code doesn't help, the algorithm is split into several sub-modules which are not self-explanatory to me.

Any insights would be precious!

Thanks a lot

More Pierre-Amaury Grumiaux's questions See All
Similar questions and discussions