1) Can self-attention be reframed as a dynamic instance-specific feature selection mechanism, and how might this perspective inform the development of more interpretable or efficient Transformer models?

2) Is it feasible to integrate multi-hop affinity propagation (as in Inf-FS) directly into attention mechanisms to capture deeper token interactions within a single layer, and what would be the theoretical or computational trade-offs?

Preprint The Origin of Self-Attention: Pairwise Affinity Matrices in ...

More Giorgio Roffo's questions See All
Similar questions and discussions