How current large language model works ( A Comprehensive survey on Post-Transformer Architecture)

01 January 1970 2 8K Report

The transformer architecture, initially conceived for natural language processing [5], has rapidly become the dominant paradigm across diverse domains, including computer vision, speech recognition, and multimodal learning [9, 13]. This success stems from its ability to effectively model long-range dependencies through the self-attention mechanism, leading to state-of-the-art performance on a wide array of tasks [1, 14]. However, the inherent complexity of transformers, particularly in terms of computational cost, memory footprint, and training stability, has spurred a wave of research focused on optimizing and adapting the architecture for efficient deployment and broader applicability. This literature review explores recent advancements in post-transformer architecture, focusing on key themes such as architectural modifications for improved training and performance, post-training quantization for reduced resource consumption, and the application of transformers to novel tasks.

Architectural Innovations for Enhanced Training and Performance

A central area of investigation revolves around refining the core architecture of transformers. The placement of layer normalization (LN), a crucial component for stabilizing training and accelerating convergence, has been a focal point of debate. Two primary variants exist: Post-Layer-Normalization (Post-LN), where LN is applied after the residual connection, and Pre-Layer-Normalization (Pre-LN), where LN precedes the residual connection [1]. While Pre-LN often facilitates more stable training, particularly for deep networks, it can sometimes limit model capacity. Conversely, Post-LN can lead to gradient vanishing issues [1, 6].

To address these limitations, researchers have proposed novel hybrid approaches. ResiDual [1] introduces a Pre-Post-LN (PPLN) strategy, integrating connections from both Post-LN and Pre-LN. Theoretical analysis and empirical experiments demonstrate that ResiDual mitigates the gradient vanishing issue and maintains diverse model representations, outperforming both Pre-LN and Post-LN on various machine translation benchmarks [1]. Another approach, B2T Connection [6], focuses on modifying Post-LN to improve training stability without sacrificing performance. The authors identify the LN in Post-LN as a primary source of the vanishing gradient problem and propose a method to preserve larger gradient norms in higher layers during back-propagation [6]. HybridNorm [11] adopts a similar philosophy, combining QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. This design aims to leverage the benefits of both Pre-Norm and Post-Norm, leading to improved training stability and performance, particularly in large language models (LLMs) [11]. Peri-LN [22] places layer normalization peripherally around sublayers, achieving a balance in variance growth and gradient flow, leading to convergence stability in large-scale Transformer training [22].

Beyond layer normalization, other architectural modifications aim to improve efficiency and performance. AlgoFormer [13] proposes a transformer framework with algorithmic structures, incorporating a pre-transformer for task preprocessing, a looped transformer for iterative optimization, and a post-transformer for producing the desired results [13]. This design leverages prior knowledge of tasks and underlying algorithmic structures, enabling efficient performance in specific tasks [13]. SiamixFormer [25] introduces a fully-transformer Siamese network with temporal fusion for building and change detection in bi-temporal remote sensing images [25]. The model uses pre- and post-disaster images as input, with temporal transformers for feature fusion, outperforming state-of-the-art methods on relevant datasets [25].

Post-Training Quantization for Resource Efficiency

The computational demands and memory requirements of transformers, especially large models, pose significant challenges for deployment on resource-constrained devices. Post-training quantization (PTQ) emerges as a promising solution, enabling reduced storage and computational costs by representing model weights and activations with lower precision [2, 3, 14, 15, 18]. However, the unique characteristics of transformer architectures, such as high dynamic activation ranges and the presence of structured outliers, complicate PTQ [15].

Several studies have focused on developing PTQ methods specifically tailored for transformers. AIQViT [2] introduces an architecture-informed low-rank compensation mechanism and a dynamic focusing quantizer to address the information loss incurred by weight quantization and the unbalanced distribution of post-Softmax activations, respectively [2]. NoisyQuant [3] proposes a quantizer-agnostic enhancement by adding a fixed Uniform noisy bias to the values being quantized, significantly reducing the quantization error under provable conditions [3]. AdaLog [18] introduces a novel non-uniform quantizer with an adaptive logarithm base to accommodate power-law-like distributions in activations, optimizing for hardware-friendly quantization [18]. APQ-ViT [24] presents a unified Bottom-elimination Blockwise Calibration scheme and a Matthew-effect Preserving Quantization for Softmax to improve accuracy in low-bit-width settings [24]. Q-HyViT [12] addresses challenges in quantizing efficient hybrid vision transformers, proposing solutions for highly dynamic ranges, zero-point overflow, diverse normalization, and limited model parameters [12]. These methods demonstrate that the key is not only to reduce the bits, but also to address the unique challenges in transformers.

Other studies focused on understanding and overcoming the challenges of efficient transformer quantization [15]. One study shows that transformers have unique quantization challenges, such as high dynamic activation ranges that are difficult to represent with a low bit fixed-point format [15]. To combat these challenges, the authors present three solutions based on post-training quantization and quantization-aware training [15]. Another study explores the effect of tensor-train decomposition to improve the accuracy and compress transformer vision-language neural networks [26]. The authors focus both on embedding-layer compression and partial tensorization of neural networks through an algorithmic approach [26].

Applications of Transformers in Diverse Domains

The versatility of the transformer architecture has led to its adoption across a wide range of applications, often requiring task-specific adaptations. This section highlights examples from various domains, showcasing the adaptability of the transformer framework.

In natural language processing, transformers continue to be the workhorse for various tasks. One study utilizes and adapts an NMT architecture to APE, implementing it in their own transformer model and exploring joint training of the APE task with a de-noising encoder [5]. Another work proposes an end-to-end set transformer for user-level classification of depression and gambling disorder [4]. The architecture processes a set of social media posts from a particular individual, making use of the interactions between posts and eliminating label noise at the post level [4]. The model is interpretable with modern feature attribution methods and allows for automatic dataset creation by identifying discriminating posts in a user's text-set [4]. Further advancements in NLP include the use of multilingual models for detecting check-worthy social media posts [17] and for hostility detection in Hindi posts [16].

In computer vision, transformers are being employed to solve a variety of problems. One study introduces a comprehensive dataset for event recognition in laparoscopic gynecology videos and proposes a hybrid transformer architecture to recognize specific events [9]. The architecture leverages inter-frame dependencies to counteract the adverse effects of content occlusion and motion blur, thus significantly enhancing event recognition accuracy [9]. GLassoformer [7] proposes a query-sparse transformer for post-fault power grid voltage prediction [7]. Another study presents a post-processor that relies on a-priori information transmitted from the encoder [10]. Subjective evaluations and objective scores show that the newly introduced post-processor surpasses previously published methods and can improve the quality of coded speech [10].

Transformers are also making inroads into specialized domains. For example, one study explores the use of transformers to detect a proxy for potential comorbid ADHD in people reporting anxiety symptoms from social media data [20]. Another study explores the architectural design issues in DevOps [21]. The study found eight specific and contextual architectural design issues faced by the two teams and classified architectural design issues discussed in Stack Overflow and DevOps Stack Exchange into 11 groups [21].

Hardware Acceleration for Enhanced Transformer Performance

Beyond architectural and algorithmic improvements, hardware acceleration plays a crucial role in enabling efficient transformer deployment. Several studies investigate specialized hardware designs to optimize transformer performance. T-REX [19] introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference [19]. TATAA [28] employs mixed-precision arithmetic for both linear and non-linear operations in a unified and programmable framework [28]. The hardware switches between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations [28].

Future Directions

The field of post-transformer architecture is rapidly evolving, with several promising avenues for future research.

Further refinement of architectural designs: Continued exploration of hybrid normalization strategies, such as those presented in ResiDual [1], HybridNorm [11], and Peri-LN [22], is likely to yield further improvements in training stability, convergence speed, and performance. Investigating the optimal placement of attention mechanisms and feed-forward networks within the transformer block will be crucial.
Advanced quantization techniques: Research on PTQ should focus on developing more sophisticated quantizers that can effectively handle the complex activation distributions and outliers in transformers [2, 3, 18, 24]. This includes exploring mixed-precision quantization schemes, adaptive quantization methods, and quantization-aware training techniques.
Specialized hardware for efficient deployment: The development of specialized hardware accelerators, such as those proposed in T-REX [19] and TATAA [28], will be essential for enabling efficient deployment of transformers on resource-constrained devices and for real-time applications.
Adaptation to new modalities and tasks: Transformers are still being adapted to new modalities and tasks. The development of foundation transformers [23] that can serve as a go-to architecture for various tasks and modalities is an important goal.
Interpretability and explainability: As transformers become more complex, understanding their decision-making processes becomes increasingly important. Future research should focus on developing techniques for interpreting and explaining the behavior of transformer models, such as the feature attribution methods used in [4].
Integration of domain knowledge: Incorporating domain-specific knowledge into the transformer architecture can improve performance and efficiency. For example, the AlgoFormer [13] framework leverages prior knowledge of tasks and underlying algorithmic structures.
Robustness and reliability: Future research should focus on improving the robustness and reliability of transformers, particularly in the face of adversarial attacks and noisy data.

In conclusion, the post-transformer landscape is characterized by continuous innovation, driven by the need to overcome the limitations of the original architecture and to expand its applicability across diverse domains. Architectural modifications, advanced quantization techniques, specialized hardware, and task-specific adaptations are all contributing to the ongoing evolution of this transformative technology. Continued research in these areas will be crucial for unlocking the full potential of transformers and enabling their widespread deployment in the years to come.

==================================================

References

Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan. ResiDual: Transformer with Dual Residual Connections. arXiv:2304.14802v1 (2023). Available at: http://arxiv.org/abs/2304.14802v1

Runqing Jiang, Ye Zhang, Longguang Wang, Pengpeng Yu, Yulan Guo. AIQViT: Architecture-Informed Post-Training Quantization for Vision Transformers. arXiv:2502.04628v1 (2025). Available at: http://arxiv.org/abs/2502.04628v1

Yijiang Liu, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, Shanghang Zhang. NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers. arXiv:2211.16056v2 (2022). Available at: http://arxiv.org/abs/2211.16056v2

Ana-Maria Bucur, Adrian Cosma, Liviu P. Dinu, Paolo Rosso. An End-to-End Set Transformer for User-Level Classification of Depression and Gambling Disorder. arXiv:2207.00753v1 (2022). Available at: http://arxiv.org/abs/2207.00753v1

Hongfei Xu, Qiuhui Liu, Josef van Genabith. UdS Submission for the WMT 19 Automatic Post-Editing Task. arXiv:1908.03402v1 (2019). Available at: http://arxiv.org/abs/1908.03402v1

Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki. B2T Connection: Serving Stability and Performance in Deep Transformers. arXiv:2206.00330v2 (2022). Available at: http://arxiv.org/abs/2206.00330v2

Yunling Zheng, Carson Hu, Guang Lin, Meng Yue, Bao Wang, Jack Xin. glassoformer: a query-sparse transformer for post-fault power grid voltage prediction. arXiv:2201.09145v1 (2022). Available at: http://arxiv.org/abs/2201.09145v1

Yizhe Xiong, Wei Huang, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jungong Han, Guiguang Ding. UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs. arXiv:2502.00439v1 (2025). Available at: http://arxiv.org/abs/2502.00439v1

Sahar Nasirihaghighi, Negin Ghamsarian, Heinrich Husslein, Klaus Schoeffmann. Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers. arXiv:2312.00593v1 (2023). Available at: http://arxiv.org/abs/2312.00593v1

Srikanth Korse, Nicola Pia, Kishan Gupta, Guillaume Fuchs. PostGAN: A GAN-Based Post-Processor to Enhance the Quality of Coded Speech. arXiv:2201.13093v1 (2022). Available at: http://arxiv.org/abs/2201.13093v1

Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma. HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization. arXiv:2503.04598v1 (2025). Available at: http://arxiv.org/abs/2503.04598v1

Jemin Lee, Yongin Kwon, Sihyeong Park, Misun Yu, Jeman Park, Hwanjun Song. Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems. arXiv:2303.12557v3 (2023). Available at: http://arxiv.org/abs/2303.12557v3

Yihang Gao, Chuanyang Zheng, Enze Xie, Han Shi, Tianyang Hu, Yu Li, Michael K. Ng, Zhenguo Li, Zhaoqiang Liu. AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures. arXiv:2402.13572v2 (2024). Available at: http://arxiv.org/abs/2402.13572v2

Zhenhua Liu, Yunhe Wang, Kai Han, Siwei Ma, Wen Gao. Post-Training Quantization for Vision Transformer. arXiv:2106.14156v1 (2021). Available at: http://arxiv.org/abs/2106.14156v1

Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort. Understanding and Overcoming the Challenges of Efficient Transformer Quantization. arXiv:2109.12948v1 (2021). Available at: http://arxiv.org/abs/2109.12948v1

Arkadipta De, Venkatesh E, Kaushal Kumar Maurya, Maunendra Sankar Desarkar. Coarse and Fine-Grained Hostility Detection in Hindi Posts using Fine Tuned Multilingual Embeddings. arXiv:2101.04998v1 (2021). Available at: http://arxiv.org/abs/2101.04998v1

Sebastian Kula, Michal Gregor. Multilingual Models for Check-Worthy Social Media Posts Detection. arXiv:2408.06737v1 (2024). Available at: http://arxiv.org/abs/2408.06737v1

Zhuguanyu Wu, Jiaxin Chen, Hanwen Zhong, Di Huang, Yunhong Wang. AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer. arXiv:2407.12951v1 (2024). Available at: http://arxiv.org/abs/2407.12951v1

Seunghyun Moon, Mao Li, Gregory Chen, Phil Knag, Ram Krishnamurthy, Mingoo Seok. T-REX: A 68-567 μs/token, 0.41-3.95 μJ/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET. arXiv:2503.00322v1 (2025). Available at: http://arxiv.org/abs/2503.00322v1

Claire S. Lee, Noelle Lim, Michael Guerzhoy. Detecting a Proxy for Potential Comorbid ADHD in People Reporting Anxiety Symptoms from Social Media Data. arXiv:2403.05561v1 (2024). Available at: http://arxiv.org/abs/2403.05561v1

Mojtaba Shahin, Ali Rezaei Nasab, Muhammad Ali Babar. A Qualitative Study of Architectural Design Issues in DevOps. arXiv:2108.06705v2 (2021). Available at: http://arxiv.org/abs/2108.06705v2

Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo. Peri-LN: Revisiting Layer Normalization in the Transformer Architecture. arXiv:2502.02732v2 (2025). Available at: http://arxiv.org/abs/2502.02732v2

Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, Furu Wei. Foundation Transformers. arXiv:2210.06423v2 (2022). Available at: http://arxiv.org/abs/2210.06423v2

Yifu Ding, Haotong Qin, Qinghua Yan, Zhenhua Chai, Junjie Liu, Xiaolin Wei, Xianglong Liu. Towards Accurate Post-Training Quantization for Vision Transformer. arXiv:2303.14341v1 (2023). Available at: http://arxiv.org/abs/2303.14341v1

Amir Mohammadian, Foad Ghaderi. SiamixFormer: a fully-transformer Siamese network with temporal Fusion for accurate building detection and change detection in bi-temporal remote sensing images. arXiv:2208.00657v2 (2022). Available at: http://arxiv.org/abs/2208.00657v2

Subhadra Vadlamannati, Ryan Solgi. Partial Tensorized Transformers for Natural Language Processing. arXiv:2310.20077v1 (2023). Available at: http://arxiv.org/abs/2310.20077v1

Bing Li, Ning Chen, Ulf Schlichtmann. Fast Statistical Timing Analysis for Circuits with Post-Silicon Tunable Clock Buffers. arXiv:1705.04979v1 (2017). Available at: http://arxiv.org/abs/1705.04979v1

Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So. TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture. arXiv:2411.03697v1 (2024). Available at: http://arxiv.org/abs/2411.03697v1

Zein Al-Abideen Douba

Transformers have revolutionized AI across multiple domains due to their powerful self-attention mechanism, but their high computational and memory demands have led to extensive research into more efficient post-transformer architectures. Key developments include architectural improvements like hybrid normalization (e.g., ResiDual, HybridNorm, Peri-LN) for better training stability and performance, as well as novel structures like AlgoFormer and SiamixFormer for task-specific efficiency. To reduce resource usage, post-training quantization (PTQ) techniques such as NoisyQuant, AdaLog, and Q-HyViT have been introduced, tackling transformers' unique quantization challenges. Transformers are also being adapted for diverse applications in NLP, computer vision, social media analysis, and beyond. Hardware acceleration (e.g., T-REX, TATAA) further enables efficient deployment. Ongoing research focuses on refining architectures, advancing quantization, expanding task versatility, and improving interpretability and robustness.

Saikat Barua

Thank you, Zein Al-Abideen Douba , for sharing your opinion.

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

What is the difference between mathematical R^4 space and physical 4D unit space?

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

Controlling for pupil light reflex when analyzing pupil size time course?

What are a “Farmers Producer Organization” (FPO) and its essential features?

Strugglling with m6A dot blot any suugesstion ?

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How to get moment output in Abaqus Standart?

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

Feedback defines the constitution of an organism?

How can I prepare virus for a TEM or SEM imaging?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

What are examples of AI for good projects a teacher can assign to students?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

How can I apply boundary conditions in an orthotropic steel deck numerical model using ABAQUS software?