1) Can self-attention be reframed as a dynamic instance-specific feature selection mechanism, and how might this perspective inform the development of more interpretable or efficient Transformer models?
2) Is it feasible to integrate multi-hop affinity propagation (as in Inf-FS) directly into attention mechanisms to capture deeper token interactions within a single layer, and what would be the theoretical or computational trade-offs?
Preprint The Origin of Self-Attention: Pairwise Affinity Matrices in ...