We've proposed Feature Selection Gates (FSG) — a lightweight, plug-in module that injects differentiable attention gates into Vision Transformers (ViT). It learns instance-specific token relevance and routes gradients accordingly, leading to sparser, more focused, and often more interpretable attention flows.
📄 Papers:
💻 Code (easy to integrate into ViT): https://github.com/cosmoimd/feature-selection-gates
1) Can Feature Selection Gates (FSG) be generalized as a token relevance mechanism across domains like object detection, action recognition, or RT-DETR pipelines—especially where attention efficiency, interpretability, or data constraints matter?
FSG acts as a learnable filtering mechanism on attention weights. Could this paradigm offer a new class of attention regularizers or gradient routers that:
Looking forward to insights on use cases beyond medical imaging. Has anyone tried similar approaches in general vision tasks or Transformers beyond ViT?