I recently had a chat with one of my best friends who happens to be a great Machine Learning scientist, working at a very big company (Spire, Luxembourg). Our friendship goes way back, we had our Masters at the same University, and I learned a lot from him. Long story short, we were talking about Transformers, and I realized that my friend believes that what makes Transformers good is not the Attention, It is rather the skip connections. He said they wanted to make it ’fancy’ so of course they can’t just say it is the skip connections as it was invented a long time ago. Hence, it was the Attention block that got all interests and spotlight. My friend believes that attention is good but not really needed in the case of Transformers as it is doing the same task done by Skip Connections.
https://medium.com/@abdulkaderhelwan/revisiting-skip-connections-in-transformers-01a213fc7e39