Can Self-Attention Be Seen as Dynamic Feature Selection?

More Giorgio Roffo's questions See All

Troubleshooting Slow Migration and Poor Separation in tRNA Electrophoresis?

Hi everyone, I've been trying to replicate the protocol described in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/22736012/ for separating aminoacyl-tRNAs from deacylated tRNAs...

14 May 2024 7,257 3 View

What are the most frequent typologies used in research in the field of industrial design?

How are industrial design research designs generally configured? What methodologies do they generally use? The idea is to have on hand a repertoire of methods and frequent actions for research in...

08 May 2024 7,666 2 View

Can I apply a mixed-effects model for unbalanced sample size and repeated measures?

In my experimental design I have 4 treatments, 3 replicates per treatment and 3 blocks. In each plot I measured whether a plant is infested or not ("Infestate" variable). This measure has been...

19 February 2024 9,483 3 View

What do you know about Q* that comes from scientific articles?

Q* (pronounced Q-Star) was mentioned by the news agency Reuters as a breakthrough at OpenAI. What is it about? Are there any scientific articles that may illuminate us or is it an "industrial secret"?

26 November 2023 7,032 0 View

How Marie-Curie actions work?

Dears, I would like to ask for some advice on the Marie-Curie actions. I know that there are different actions and I would like to know: - Whether all the actions involve the moving of a...

25 July 2023 2,084 0 View

Formation of a peculiar dimer adduct in negative ESI?

Is it possible the formation of [2M+46]* or [2M+46]- in negative ESI? It is not [M+HCOOH-H]-, as it would correspond to [M+45]-. Have you ever observed it in your LC-MS experience? What is the...

24 January 2023 4,428 5 View

How to use Fock-Space Coupled Cluster method to model diatomics at the dissociation distance?

Very often, when using relativistic Fock-Space CC method to model potential energy curves (PECs) of diatomics, I notice that the calculation diverges at the dissociation distance of the two...

24 January 2023 2,969 2 View

What is the best statistical method to compare two learning curves?

I have two learning curves that need to be compared to establish which learning method is better. Each learning curve is estimated with 11 samples for each of 5 repetitions (total 55 samples per...

28 May 2021 9,142 5 View

Where can I find hourly weather forecast data for Europe?

Dears, I am looking for reliable and regularly updated hourly weather forecast data for the European domani (or at least for the Italian domain). In particular I am interested at 7 up to 15 days...

17 May 2021 3,359 8 View

Where can I find hourly weather data covering the European domain?

Dear all, I am currently searching for hourly weather data covering the European domain. I have found quite useful the source "ERA5-Land hourly data from 1981 to present", but it seems this is...

15 April 2021 7,917 8 View

How to determine the position of occupancy of the dopant? - whether it is doped in tetrahedral or octahedral site?

Suppose a material "A" has both tetrahedral and octahedral sites and we are doping another material "B" - usually an ion into it. How can we detect if the dopant has occupied the octahedral site...

17 July 2024 4,299 4 View

Wireless insite3d convert path in to image ????

hello dear i need to learn this program any one know ??? i want to convert path propagation to Image reconstruction 2d how is that done ? lik this image??

14 July 2024 1,811 0 View

COMSOL simulation of transient wave optics propagation is not physical?

Hi guys, I am trying to simulate a 1 ps long plane wave propagating in a dispersive material, with the wave optics transient module. According to the drude-lorentz model, when the frequency of...

29 June 2024 486 1 View

How can a single photon propagate in a straight-line in a dielectric medium?

Given the well-documented quantum Rayleigh scattering of single photons in a dielectric medium [1] A. P. Vinogradov, V. Y. Shishkov, I. V. Doronin, E. S. Andrianov, A. A. Pukhov, and A. A....

17 June 2024 7,692 9 View

How to simulate a metalens with backward propagating source using a Lumerical FDTD?

Dear Sir/Madam, I am currently working on a beam shaping problem and have referenced the Metalens example provided on the Ansys Lumerical website. In that example, a forward propagating plane wave...

14 June 2024 4,333 0 View

How are cancerous cell lines generated in 1900s sold today have the same genetic profile?

From the definition of cell line, they are indefinitely propagated/subcultured. However, in practice beyond a certain passage number, they are said to loose characteristic gene profile, for e.g....

10 June 2024 6,890 1 View

How to plot ddct from qPCR?

I have calculated ddCt for my qPCR where I have treated mice and checked for change in gene expression. Since, the treated and untreated were different I calculated dCt for each of them with...

07 June 2024 4,731 0 View

How can we plot graph of attenuation and phase constant of Fabry Perot antennas in HFFS ?

I want to find The LW propagation ( beta) and attenuation (alpha ) constants versus frequency.

05 June 2024 5,260 0 View

What are possible sample selection biases in Logit model estimation?

I randomly interviewed 250 poor people and 250 non-poor people. Considering 1 for poor and 0 otherwise, does estimating a logit model aiming to capture the probability of becoming poor make sense?...

02 June 2024 2,326 2 View

Why do I get no melt curve peak and no Ct for some of my patients but not others, in a SYBR green qPCR?

I am doing a RT-qPCR for gene expression analysis of cancer patients, and there are four different groups in the results. The first group that shows over expression of my target gene (both two...

31 May 2024 1,088 0 View

Titus Kyalo

1) Can self-attention be reframed as a dynamic instance-specific feature selection mechanism, and how might this perspective inform the development of more interpretable or efficient Transformer models?

Self-attention can be seen as dynamic feature selection, where each token selects relevant tokens based on context-specific attention weights. This view emphasizes attention as choosing key features (tokens) for each position, adapting to the input. For interpretability, attention weights can highlight critical token interactions, aiding visualization and model explanation. For efficiency, focusing on high-weight tokens enables sparse attention, reducing O(n²) complexity to O(nk). This perspective inspires interpretable designs (e.g., binary attention weights) and efficient models (e.g., adaptive token selection), optimizing Transformers for long sequences and transparent decision-making.

2) Is it feasible to integrate multi-hop affinity propagation (as in Inf-FS) into attention mechanisms to capture deeper token interactions within a single layer, and what are the trade-offs?

Integrating multi-hop affinity propagation into self-attention is feasible by modeling token interactions as a graph and propagating affinities over multiple hops. This captures deeper dependencies in a single layer, potentially reducing model depth. Benefits include richer contextual representations and improved long-range dependency modeling. However, computational complexity increases (e.g., O(kn²) for k hops), and memory usage grows due to intermediate states. Over-propagation may dilute local context, impacting tasks needing direct relationships. Careful design, like sparse graphs or diffusion-based propagation, is needed to balance efficiency, representation quality, and training stability for practical implementation. for more information email: [email protected]

Sakshi Balyan

Yes, self-attention can be viewed as a dynamic feature selection mechanism. Self-attention computes attention weights that determine which input features (tokens/positions) are most relevant for each output position, effectively selecting and weighting different parts of the input sequence dynamically based on context. Unlike static feature selection that pre-determines relevant features, self-attention adapts its selection criteria for each query, allowing the model to focus on different input elements depending on what needs to be computed. The attention weights act as learnable, context-dependent gates that amplify important features while suppressing irrelevant ones, making it a sophisticated form of dynamic feature selection that operates at each layer and attention head independently.