The transformer architecture, initially conceived for natural language processing [5], has rapidly become the dominant paradigm across diverse domains, including computer vision, speech recognition, and multimodal learning [9, 13]. This success stems from its ability to effectively model long-range dependencies through the self-attention mechanism, leading to state-of-the-art performance on a wide array of tasks [1, 14]. However, the inherent complexity of transformers, particularly in terms of computational cost, memory footprint, and training stability, has spurred a wave of research focused on optimizing and adapting the architecture for efficient deployment and broader applicability. This literature review explores recent advancements in post-transformer architecture, focusing on key themes such as architectural modifications for improved training and performance, post-training quantization for reduced resource consumption, and the application of transformers to novel tasks.

Architectural Innovations for Enhanced Training and Performance

A central area of investigation revolves around refining the core architecture of transformers. The placement of layer normalization (LN), a crucial component for stabilizing training and accelerating convergence, has been a focal point of debate. Two primary variants exist: Post-Layer-Normalization (Post-LN), where LN is applied after the residual connection, and Pre-Layer-Normalization (Pre-LN), where LN precedes the residual connection [1]. While Pre-LN often facilitates more stable training, particularly for deep networks, it can sometimes limit model capacity. Conversely, Post-LN can lead to gradient vanishing issues [1, 6].

To address these limitations, researchers have proposed novel hybrid approaches. ResiDual [1] introduces a Pre-Post-LN (PPLN) strategy, integrating connections from both Post-LN and Pre-LN. Theoretical analysis and empirical experiments demonstrate that ResiDual mitigates the gradient vanishing issue and maintains diverse model representations, outperforming both Pre-LN and Post-LN on various machine translation benchmarks [1]. Another approach, B2T Connection [6], focuses on modifying Post-LN to improve training stability without sacrificing performance. The authors identify the LN in Post-LN as a primary source of the vanishing gradient problem and propose a method to preserve larger gradient norms in higher layers during back-propagation [6]. HybridNorm [11] adopts a similar philosophy, combining QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. This design aims to leverage the benefits of both Pre-Norm and Post-Norm, leading to improved training stability and performance, particularly in large language models (LLMs) [11]. Peri-LN [22] places layer normalization peripherally around sublayers, achieving a balance in variance growth and gradient flow, leading to convergence stability in large-scale Transformer training [22].

Beyond layer normalization, other architectural modifications aim to improve efficiency and performance. AlgoFormer [13] proposes a transformer framework with algorithmic structures, incorporating a pre-transformer for task preprocessing, a looped transformer for iterative optimization, and a post-transformer for producing the desired results [13]. This design leverages prior knowledge of tasks and underlying algorithmic structures, enabling efficient performance in specific tasks [13]. SiamixFormer [25] introduces a fully-transformer Siamese network with temporal fusion for building and change detection in bi-temporal remote sensing images [25]. The model uses pre- and post-disaster images as input, with temporal transformers for feature fusion, outperforming state-of-the-art methods on relevant datasets [25].

Post-Training Quantization for Resource Efficiency

The computational demands and memory requirements of transformers, especially large models, pose significant challenges for deployment on resource-constrained devices. Post-training quantization (PTQ) emerges as a promising solution, enabling reduced storage and computational costs by representing model weights and activations with lower precision [2, 3, 14, 15, 18]. However, the unique characteristics of transformer architectures, such as high dynamic activation ranges and the presence of structured outliers, complicate PTQ [15].

Several studies have focused on developing PTQ methods specifically tailored for transformers. AIQViT [2] introduces an architecture-informed low-rank compensation mechanism and a dynamic focusing quantizer to address the information loss incurred by weight quantization and the unbalanced distribution of post-Softmax activations, respectively [2]. NoisyQuant [3] proposes a quantizer-agnostic enhancement by adding a fixed Uniform noisy bias to the values being quantized, significantly reducing the quantization error under provable conditions [3]. AdaLog [18] introduces a novel non-uniform quantizer with an adaptive logarithm base to accommodate power-law-like distributions in activations, optimizing for hardware-friendly quantization [18]. APQ-ViT [24] presents a unified Bottom-elimination Blockwise Calibration scheme and a Matthew-effect Preserving Quantization for Softmax to improve accuracy in low-bit-width settings [24]. Q-HyViT [12] addresses challenges in quantizing efficient hybrid vision transformers, proposing solutions for highly dynamic ranges, zero-point overflow, diverse normalization, and limited model parameters [12]. These methods demonstrate that the key is not only to reduce the bits, but also to address the unique challenges in transformers.

Other studies focused on understanding and overcoming the challenges of efficient transformer quantization [15]. One study shows that transformers have unique quantization challenges, such as high dynamic activation ranges that are difficult to represent with a low bit fixed-point format [15]. To combat these challenges, the authors present three solutions based on post-training quantization and quantization-aware training [15]. Another study explores the effect of tensor-train decomposition to improve the accuracy and compress transformer vision-language neural networks [26]. The authors focus both on embedding-layer compression and partial tensorization of neural networks through an algorithmic approach [26].

Applications of Transformers in Diverse Domains

The versatility of the transformer architecture has led to its adoption across a wide range of applications, often requiring task-specific adaptations. This section highlights examples from various domains, showcasing the adaptability of the transformer framework.

In natural language processing, transformers continue to be the workhorse for various tasks. One study utilizes and adapts an NMT architecture to APE, implementing it in their own transformer model and exploring joint training of the APE task with a de-noising encoder [5]. Another work proposes an end-to-end set transformer for user-level classification of depression and gambling disorder [4]. The architecture processes a set of social media posts from a particular individual, making use of the interactions between posts and eliminating label noise at the post level [4]. The model is interpretable with modern feature attribution methods and allows for automatic dataset creation by identifying discriminating posts in a user's text-set [4]. Further advancements in NLP include the use of multilingual models for detecting check-worthy social media posts [17] and for hostility detection in Hindi posts [16].

In computer vision, transformers are being employed to solve a variety of problems. One study introduces a comprehensive dataset for event recognition in laparoscopic gynecology videos and proposes a hybrid transformer architecture to recognize specific events [9]. The architecture leverages inter-frame dependencies to counteract the adverse effects of content occlusion and motion blur, thus significantly enhancing event recognition accuracy [9]. GLassoformer [7] proposes a query-sparse transformer for post-fault power grid voltage prediction [7]. Another study presents a post-processor that relies on a-priori information transmitted from the encoder [10]. Subjective evaluations and objective scores show that the newly introduced post-processor surpasses previously published methods and can improve the quality of coded speech [10].

Transformers are also making inroads into specialized domains. For example, one study explores the use of transformers to detect a proxy for potential comorbid ADHD in people reporting anxiety symptoms from social media data [20]. Another study explores the architectural design issues in DevOps [21]. The study found eight specific and contextual architectural design issues faced by the two teams and classified architectural design issues discussed in Stack Overflow and DevOps Stack Exchange into 11 groups [21].

Hardware Acceleration for Enhanced Transformer Performance

Beyond architectural and algorithmic improvements, hardware acceleration plays a crucial role in enabling efficient transformer deployment. Several studies investigate specialized hardware designs to optimize transformer performance. T-REX [19] introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference [19]. TATAA [28] employs mixed-precision arithmetic for both linear and non-linear operations in a unified and programmable framework [28]. The hardware switches between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations [28].

Future Directions

The field of post-transformer architecture is rapidly evolving, with several promising avenues for future research.

  • Further refinement of architectural designs: Continued exploration of hybrid normalization strategies, such as those presented in ResiDual [1], HybridNorm [11], and Peri-LN [22], is likely to yield further improvements in training stability, convergence speed, and performance. Investigating the optimal placement of attention mechanisms and feed-forward networks within the transformer block will be crucial.
  • Advanced quantization techniques: Research on PTQ should focus on developing more sophisticated quantizers that can effectively handle the complex activation distributions and outliers in transformers [2, 3, 18, 24]. This includes exploring mixed-precision quantization schemes, adaptive quantization methods, and quantization-aware training techniques.
  • Specialized hardware for efficient deployment: The development of specialized hardware accelerators, such as those proposed in T-REX [19] and TATAA [28], will be essential for enabling efficient deployment of transformers on resource-constrained devices and for real-time applications.
  • Adaptation to new modalities and tasks: Transformers are still being adapted to new modalities and tasks. The development of foundation transformers [23] that can serve as a go-to architecture for various tasks and modalities is an important goal.
  • Interpretability and explainability: As transformers become more complex, understanding their decision-making processes becomes increasingly important. Future research should focus on developing techniques for interpreting and explaining the behavior of transformer models, such as the feature attribution methods used in [4].
  • Integration of domain knowledge: Incorporating domain-specific knowledge into the transformer architecture can improve performance and efficiency. For example, the AlgoFormer [13] framework leverages prior knowledge of tasks and underlying algorithmic structures.
  • Robustness and reliability: Future research should focus on improving the robustness and reliability of transformers, particularly in the face of adversarial attacks and noisy data.

In conclusion, the post-transformer landscape is characterized by continuous innovation, driven by the need to overcome the limitations of the original architecture and to expand its applicability across diverse domains. Architectural modifications, advanced quantization techniques, specialized hardware, and task-specific adaptations are all contributing to the ongoing evolution of this transformative technology. Continued research in these areas will be crucial for unlocking the full potential of transformers and enabling their widespread deployment in the years to come.

==================================================

References

  • Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan. ResiDual: Transformer with Dual Residual Connections. arXiv:2304.14802v1 (2023). Available at: http://arxiv.org/abs/2304.14802v1
  • Runqing Jiang, Ye Zhang, Longguang Wang, Pengpeng Yu, Yulan Guo. AIQViT: Architecture-Informed Post-Training Quantization for Vision Transformers. arXiv:2502.04628v1 (2025). Available at: http://arxiv.org/abs/2502.04628v1
  • Yijiang Liu, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, Shanghang Zhang. NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers. arXiv:2211.16056v2 (2022). Available at: http://arxiv.org/abs/2211.16056v2
  • Ana-Maria Bucur, Adrian Cosma, Liviu P. Dinu, Paolo Rosso. An End-to-End Set Transformer for User-Level Classification of Depression and Gambling Disorder. arXiv:2207.00753v1 (2022). Available at: http://arxiv.org/abs/2207.00753v1
  • Hongfei Xu, Qiuhui Liu, Josef van Genabith. UdS Submission for the WMT 19 Automatic Post-Editing Task. arXiv:1908.03402v1 (2019). Available at: http://arxiv.org/abs/1908.03402v1
  • Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki. B2T Connection: Serving Stability and Performance in Deep Transformers. arXiv:2206.00330v2 (2022). Available at: http://arxiv.org/abs/2206.00330v2
  • Yunling Zheng, Carson Hu, Guang Lin, Meng Yue, Bao Wang, Jack Xin. glassoformer: a query-sparse transformer for post-fault power grid voltage prediction. arXiv:2201.09145v1 (2022). Available at: http://arxiv.org/abs/2201.09145v1
  • Yizhe Xiong, Wei Huang, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jungong Han, Guiguang Ding. UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs. arXiv:2502.00439v1 (2025). Available at: http://arxiv.org/abs/2502.00439v1
  • Sahar Nasirihaghighi, Negin Ghamsarian, Heinrich Husslein, Klaus Schoeffmann. Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers. arXiv:2312.00593v1 (2023). Available at: http://arxiv.org/abs/2312.00593v1
  • Srikanth Korse, Nicola Pia, Kishan Gupta, Guillaume Fuchs. PostGAN: A GAN-Based Post-Processor to Enhance the Quality of Coded Speech. arXiv:2201.13093v1 (2022). Available at: http://arxiv.org/abs/2201.13093v1
  • Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma. HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization. arXiv:2503.04598v1 (2025). Available at: http://arxiv.org/abs/2503.04598v1
  • Jemin Lee, Yongin Kwon, Sihyeong Park, Misun Yu, Jeman Park, Hwanjun Song. Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems. arXiv:2303.12557v3 (2023). Available at: http://arxiv.org/abs/2303.12557v3
  • Yihang Gao, Chuanyang Zheng, Enze Xie, Han Shi, Tianyang Hu, Yu Li, Michael K. Ng, Zhenguo Li, Zhaoqiang Liu. AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures. arXiv:2402.13572v2 (2024). Available at: http://arxiv.org/abs/2402.13572v2
  • Zhenhua Liu, Yunhe Wang, Kai Han, Siwei Ma, Wen Gao. Post-Training Quantization for Vision Transformer. arXiv:2106.14156v1 (2021). Available at: http://arxiv.org/abs/2106.14156v1
  • Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort. Understanding and Overcoming the Challenges of Efficient Transformer Quantization. arXiv:2109.12948v1 (2021). Available at: http://arxiv.org/abs/2109.12948v1
  • Arkadipta De, Venkatesh E, Kaushal Kumar Maurya, Maunendra Sankar Desarkar. Coarse and Fine-Grained Hostility Detection in Hindi Posts using Fine Tuned Multilingual Embeddings. arXiv:2101.04998v1 (2021). Available at: http://arxiv.org/abs/2101.04998v1
  • Sebastian Kula, Michal Gregor. Multilingual Models for Check-Worthy Social Media Posts Detection. arXiv:2408.06737v1 (2024). Available at: http://arxiv.org/abs/2408.06737v1
  • Zhuguanyu Wu, Jiaxin Chen, Hanwen Zhong, Di Huang, Yunhong Wang. AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer. arXiv:2407.12951v1 (2024). Available at: http://arxiv.org/abs/2407.12951v1
  • Seunghyun Moon, Mao Li, Gregory Chen, Phil Knag, Ram Krishnamurthy, Mingoo Seok. T-REX: A 68-567 μs/token, 0.41-3.95 μJ/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET. arXiv:2503.00322v1 (2025). Available at: http://arxiv.org/abs/2503.00322v1
  • Claire S. Lee, Noelle Lim, Michael Guerzhoy. Detecting a Proxy for Potential Comorbid ADHD in People Reporting Anxiety Symptoms from Social Media Data. arXiv:2403.05561v1 (2024). Available at: http://arxiv.org/abs/2403.05561v1
  • Mojtaba Shahin, Ali Rezaei Nasab, Muhammad Ali Babar. A Qualitative Study of Architectural Design Issues in DevOps. arXiv:2108.06705v2 (2021). Available at: http://arxiv.org/abs/2108.06705v2
  • Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo. Peri-LN: Revisiting Layer Normalization in the Transformer Architecture. arXiv:2502.02732v2 (2025). Available at: http://arxiv.org/abs/2502.02732v2
  • Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, Furu Wei. Foundation Transformers. arXiv:2210.06423v2 (2022). Available at: http://arxiv.org/abs/2210.06423v2
  • Yifu Ding, Haotong Qin, Qinghua Yan, Zhenhua Chai, Junjie Liu, Xiaolin Wei, Xianglong Liu. Towards Accurate Post-Training Quantization for Vision Transformer. arXiv:2303.14341v1 (2023). Available at: http://arxiv.org/abs/2303.14341v1
  • Amir Mohammadian, Foad Ghaderi. SiamixFormer: a fully-transformer Siamese network with temporal Fusion for accurate building detection and change detection in bi-temporal remote sensing images. arXiv:2208.00657v2 (2022). Available at: http://arxiv.org/abs/2208.00657v2
  • Subhadra Vadlamannati, Ryan Solgi. Partial Tensorized Transformers for Natural Language Processing. arXiv:2310.20077v1 (2023). Available at: http://arxiv.org/abs/2310.20077v1
  • Bing Li, Ning Chen, Ulf Schlichtmann. Fast Statistical Timing Analysis for Circuits with Post-Silicon Tunable Clock Buffers. arXiv:1705.04979v1 (2017). Available at: http://arxiv.org/abs/1705.04979v1
  • Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So. TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture. arXiv:2411.03697v1 (2024). Available at: http://arxiv.org/abs/2411.03697v1
  • More Saikat Barua's questions See All
    Similar questions and discussions