We've proposed Feature Selection Gates (FSG) — a lightweight, plug-in module that injects differentiable attention gates into Vision Transformers (ViT). It learns instance-specific token relevance and routes gradients accordingly, leading to sparser, more focused, and often more interpretable attention flows.

📄 Papers:

  • Feature Selection Gates with Gradient Routing for Endoscopic Image Computing https://www.researchgate.net/publication/384576386_Feature_Selection_Gates_with_Gradient_Routing_for_Endoscopic_Image_Computing
  • Hard-Attention Gates with Gradient Routing for Endoscopic Image Computing https://www.researchgate.net/publication/382065314_Hard-Attention_Gates_with_Gradient_Routing_for_Endoscopic_Image_Computing

💻 Code (easy to integrate into ViT): https://github.com/cosmoimd/feature-selection-gates

1) Can Feature Selection Gates (FSG) be generalized as a token relevance mechanism across domains like object detection, action recognition, or RT-DETR pipelines—especially where attention efficiency, interpretability, or data constraints matter?

FSG acts as a learnable filtering mechanism on attention weights. Could this paradigm offer a new class of attention regularizers or gradient routers that:

  • enhance data efficiency,
  • reduce overhead in dense token maps (e.g. videos, long sequences),
  • or guide attention toward semantically aligned regions (e.g. in detection or temporal reasoning)?

Looking forward to insights on use cases beyond medical imaging. Has anyone tried similar approaches in general vision tasks or Transformers beyond ViT?

More Giorgio Roffo's questions See All
Similar questions and discussions