AI Infra,Quantization
大语言模型专用量化算法与离群值处理
该组文献针对 LLM 独特的激活值分布(Outliers)和内存瓶颈(KV Cache),提出了权重补偿、旋转变换(SpinQuant)、激活感知量化(AWQ)及平滑缩放等技术,旨在解决极低比特下的精度崩塌问题。
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale(Tim Dettmers, M. Lewis, Younes Belkada, Luke Zettlemoyer, 2022, ArXiv)
- SpinQuant: LLM quantization with learned rotations(Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort, 2024, ArXiv)
- PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization(Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo, 2024, ArXiv Preprint)
- IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact(Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zheng-Jun Xu, Lu Hou, Jun Yao, Chun Yuan, 2024, ArXiv)
- WUSH: Near-Optimal Adaptive Transforms for LLM Quantization(Jiale Chen, Vage Egiazarian, Roberto L. Castro, Torsten Hoefler, Dan Alistarh, 2025, ArXiv Preprint)
- OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization(Cong Guo, Jiaming Tang, Weiming Hu, Jingwen Leng, Chen Zhang, Fan Yang, Yun-Bo Liu, Minyi Guo, Yuhao Zhu, 2023, Proceedings of the 50th Annual International Symposium on Computer Architecture)
- Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other(Yifei Gao, Jie Ou, Lei Wang, Yuting Xiao, Zhiyuan Xiang, Ruiting Dai, Jun Cheng, 2024, ArXiv Preprint)
- LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid(Tianyi Zhang, Anshumali Shrivastava, 2024, No journal)
- Self-calibration for Language Model Quantization and Pruning(Miles Williams, G. Chrysostomou, Nikolaos Aletras, 2024, ArXiv)
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration(Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2023, ArXiv Preprint)
- ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms(Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang, 2025, ArXiv)
- PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression(Bo Jiang, Taolue Yang, Youyuan Liu, Xubin He, Sheng Di, Sian Jin, 2025, ArXiv)
- FPTQuant: Function-Preserving Transforms for LLM Quantization(B. V. Breugel, Yelysei Bondarenko, Paul N. Whatmough, Markus Nagel, 2025, ArXiv)
- GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance(Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song, 2025, ArXiv)
- Outlier Matters: A Statistical Analysis of LLM Tensor Distributions and Quantization Effects(Taein Kim, Seongwook Kim, Sukhyun Han, Woojin Cho, Youngjae Choi, Youngseok Bae, Seokin Hong, 2025, 2025 International Technical Conference on Circuits/Systems, Computers, and Communications (ITC-CSCC))
- Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs(Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei, 2024, ArXiv)
- Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners(Yifei Gao, Jie Ou, Lei Wang, Jun Cheng, Mengchu Zhou, 2024, ArXiv Preprint)
- LLM Compression: How Far Can We Go in Balancing Size and Performance?(Sahil Sk, Debasish Dhal, Sonal Khosla, Sk Shahid, S. Shekhar, A. Dhaka, Shantipriya Parida, Dilip K. Prasad, O. Bojar, 2025, No journal)
- KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization(Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava, 2024, ArXiv)
- VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference(Steve Dai, Rangharajan Venkatesan, Haoxing Ren, B. Zimmer, W. Dally, Brucek Khailany, 2021, ArXiv)
- Dynamic Stashing Quantization for Efficient Transformer Training(Guofu Yang, Daniel Lo, R. Mullins, Yiren Zhao, 2023, No journal)
软硬协同设计与专用硬件加速器架构
侧重于底层硬件实现,涵盖基于 FPGA、ASIC、NPU 及存内计算(CIM)的加速器设计。研究重点在于量化算子与硬件电路的深度耦合,如脉动阵列优化、低功耗 MAC 单元及针对 RRAM/忆阻器的新型架构。
- CMQ: Crossbar-Aware Neural Network Mixed-Precision Quantization via Differentiable Architecture Search(Jie Peng, Haijun Liu, ZhongJin Zhao, Zhiwei Li, Sen Liu, Qingjiang Li, 2022, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
- Efficient Low-Bit Neural Network With Memristor-Based Reconfigurable Circuits(He Xiao, Xiaofang Hu, Tongtong Gao, Yue Zhou, Shukai Duan, Yiran Chen, 2024, IEEE Transactions on Circuits and Systems II: Express Briefs)
- Accelerator for LLM-Enhanced GNN with Product Quantization and Unified Indexing(Jiaming Xu, Jinhao Li, Jun Liu, Hao Zhou, Guohao Dai, 2025, Proceedings of the 30th Asia and South Pacific Design Automation Conference)
- Synergizing spintronics and quaternary logic: a hardware accelerator for neural networks with optimized quantization algorithm(Motahareh BahmanAbadi, Abdolah Amirany, M. H. Moaiyeri, Kian Jafari, 2025, The Journal of Supercomputing)
- Asymmetric quantization in hardware accelerator(K. H. Tsoi, Chao Xiong, Wei Zou, Xinyu Niu, 2023, No journal)
- Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation(Eunhyeok Park, Dongyoung Kim, Sungjoo Yoo, 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA))
- Low-Power Artificial Neural Network Perceptron Based on Monolayer MoS2(Guilherme Migliato Marega, Zhenyu Wang, M. Paliy, G. Giusi, S. Strangio, F. Castiglione, Christian Callegari, M. Tripathi, A. Radenović, G. Iannaccone, A. Kis, 2022, ACS Nano)
- LLM on FPGA: Squeezing Language Models by Quantization and Multi-Query Attention and its Efficient Hardware Architecture(Seoyoon Chae, Taewook Kang, 2025, 2025 22nd International SoC Design Conference (ISOCC))
- Synthesis of CNN Accelerator with Weight Sharing through Quantization(G. Aasthikka, K. Anusha, 2025, 2025 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT))
- Entropy-Based Early-Exit in a FPGA-Based Low-Precision Neural Network(Minxuan Kong, J. Núñez-Yáñez, 2022, No journal)
- A Hardware Accelerator for Image Super-Resolution with Algorithm Lightweighting and Custom Fusion Engine(Menghan Li, Sheng Lu, Jun Han, 2024, 2024 IEEE 17th International Conference on Solid-State & Integrated Circuit Technology (ICSICT))
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving(Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 2024, ArXiv Preprint)
- AxCore: A Quantization-Aware Approximate GEMM Unit for LLM Inference(Jiaxiang Zou, Yonghao Chen, Xingyu Chen, Chenxi Xu, Xinyu Chen, 2025, Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture)
- LLM-NPU: Towards Efficient Foundation Model Inference on Low-Power Neural Processing Units(Arnab Raha, Souvik Kundu, S. N. Sridhar, Shamik Kundu, Soumendu Kumar Ghosh, Alessandro Palla, Arghadip Das, Darren Crews, Deepak A. Mathaikutty, 2025, 2025 IEEE International Conference on Omni-layer Intelligent Systems (COINS))
- Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-precision Quantized Multiplication on Hardware Accelerators(Yuhao Liu, Salim Ullah, Akash Kumar, 2026, ArXiv Preprint)
- T-REX: A 68-to-567μs/Token 0.41-to-3.95μJ/Token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET(Seunghyun Moon, Mao Li, Gregory K. Chen, Phil C. Knag, Ram K. Krishnamurthy, Mingoo Seok, 2025, 2025 IEEE International Solid-State Circuits Conference (ISSCC))
- A 16×16 High-Utilization Systolic Array Hardware Accelerator for Long-Sequence Flash-Attention Computation in Transformer(Zhenkun Li, Liji Wu, Yi Yang, Tianling Ren, Le Wu, Xiangmin Zhang, 2025, 2025 IEEE 16th International Conference on ASIC (ASICON))
- Binary Precision Neural Network Manycore Accelerator(M. Hosseini, T. Mohsenin, 2021, ACM Journal on Emerging Technologies in Computing Systems (JETC))
- MixPE: Quantization and Hardware Co-design for Efficient LLM Inference(Yu Zhang, Mingzi Wang, Lancheng Zou, Wulong Liu, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu, 2024, ArXiv)
- OA-LAMA: An Outlier-Adaptive LLM Inference Accelerator with Memory-Aligned Mixed-Precision Group Quantization(Huangxu Chen, Yingbo Hao, Yi Zou, Xinyu Chen, 2025, 2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD))
- Hardware Accelerator for Bidirectional Encoder Representations from Transformers (BERT)(Yimin Wang, 2024, 2024 International Conference on Microelectronics (ICM))
- A High‐Efficiency CNN Accelerator With Mixed Low‐Precision Quantization(Xianghong Hu, Jinhui Pan, Yue Ding, Wenjin Huang, Zhejun Zheng, Xueming Li, Hongmin Huang, Xiaoming Xiong, 2025, IET Circuits)
- Medha: Microcoded Hardware Accelerator for computing on Encrypted Data(Ahmet Can Mert, Aikata, Sunmin Kwon, Youngsam Shin, Donghoon Yoo, Yongwoo Lee, Sujoy Sinha Roy, 2022, ArXiv Preprint)
- PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration(Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah, 2023, ArXiv Preprint)
- A dual-domain compute-in-memory system for general neural network inference(Ze Wang, Ruihua Yu, Zhiping Jia, Zhifan He, Tianhao Yang, B. Gao, Yang Li, Zhenping Hu, Zhenqi Hao, Yun Liu, Jianghai Lu, P. Yao, Jianshi Tang, Qi Liu, H. Qian, Huaqiang Wu, 2025, Nature Electronics)
- Low-Bit Precision Neural Network Architecture with High Immunity to Variability and Random Telegraph Noise based on Resistive Memories(T. Zanotti, F. Puglisi, P. Pavan, 2021, 2021 IEEE International Reliability Physics Symposium (IRPS))
- HLC: A Hardware-friendly Quantization and Cache-based Accelerator for Transformer(Xiangfeng Sun, Yuanting Zhang, Yunchang Jiang, Zheng Li, Bingjin Han, Junyi Mai, Zhibin Luo, Enyi Yao, 2024, 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS))
- Optimized Winograd CNN Hardware Accelerator with Quantized Computation(R. Aishwarya, Protyusha Ray, R. Akanksha, B. Rajeshwari, 2025, 2025 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT))
- AdderNet 2.0: Optimal AdderNet Accelerator Designs With Activation-Oriented Quantization and Fused Bias Removal-Based Memory Optimization(Yunxiang Zhang, Omar Al Kailani, Bin Zhou, Wenfeng Zhao, 2025, IEEE Transactions on Circuits and Systems I: Regular Papers)
- SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers(Alberto Marchisio, David Durà, Maurizio Capra, Maurizio Martina, G. Masera, Muhammad Shafique, 2023, 2023 International Joint Conference on Neural Networks (IJCNN))
- VersaQ-3D: A Reconfigurable Accelerator Enabling Feed-Forward and Generalizable 3D Reconstruction via Versatile Quantization(Yipu Zhang, Jintao Cheng, Xingyu Liu, Zeyu Li, Carol Jingyi Li, Jin Wu, Lin Jiang, Yuan Xie, Jiang Xu, Wei Zhang, 2026, ArXiv)
- ATE-GCN: An FPGA-Based Graph Convolutional Network Accelerator with Asymmetrical Ternary Quantization(Ruiqi Chen, Jiayu Liu, Shi-xiong Tang, Yang Liu, Yanxiang Zhu, Ming Ling, Bruno da Silva, 2025, 2025 Design, Automation & Test in Europe Conference (DATE))
- OutlierCIM: Outlier-Aware Digital CIM-Based LLM Accelerator with Hybrid-Strategy Quantization and Unified FP-INT Computation(Zihan Zou, Shikuang Chen, Chen Zhang, Xing Wang, Zhichao Liu, Haoran Du, Xin Si, Hao Cai, Bo Liu, 2025, 2025 62nd ACM/IEEE Design Automation Conference (DAC))
- Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment(Yuhao Ji, Chao Fang, Shaobo Ma, Haikuo Shao, Zhongfeng Wang, 2024, 2024 ACM/IEEE International Conference On Computer Aided Design (ICCAD))
- Low-power FPGA reconfigurable hardware accelerator design for lightweight CNNs(Youyao Liu, Yituo Qiao, Xiao Xiong, Enci Wang, 2025, No journal)
- 基于异构多核并行加速的嵌入式神经网络人脸识别方法 (Embedded Neural Network Face Recognition Method Based on Heterogeneous Multicore Parallel Acceleration)(Fang Gao, Zhangqin Huang, 2018, 计算机科学)
- Neural Network-Inspired Analog-to-Digital Conversion to Achieve Super-Resolution with Low-Precision RRAM Devices(Weidong Cao, Liu Ke, Ayan Chakrabarti, Xuan Zhang, 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD))
- Evaluating Neural Network-Inspired Analog-to-Digital Conversion With Low-Precision RRAM(Weidong Cao, Liu Ke, Ayan Chakrabarti, Xuan Zhang, 2021, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
混合精度分配与自适应位宽搜索技术
探讨在不同层、块或 Token 粒度上应用不同位宽的策略。通过 Hessian 矩阵分析、神经架构搜索(NAS)或动态路由(如 MoE)来平衡模型精度与计算资源。
- DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference(Jiajun Zhou, Jiajun Wu, Yizhao Gao, Yuhao Ding, Chaofan Tao, Bo Li, Fengbin Tu, Kwang-Ting Cheng, Hayden Kwok-Hay So, Ngai Wong, 2023, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
- MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts(Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, Jianzong Wang, 2025, No journal)
- Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference(Wei Tao, Bin Zhang, Xiaoyang Qu, Jiguang Wan, Jianzong Wang, 2025, 2025 Design, Automation & Test in Europe Conference (DATE))
- QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference(Xiangchen Li, Saeid Ghafouri, Bo Ji, Hans Vandierendonck, Deepu John, Dimitrios S. Nikolopoulos, 2025, ArXiv)
- MoQE: Improve Quantization Model performance via Mixture of Quantization Experts(Jinhao Zhang, Yunquan Zhang, Boyang Zhang, Zeyu Liu, Daning Cheng, 2025, ArXiv)
- BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference(Wonsuk Jang, Thierry Tambe, 2025, ArXiv)
- ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models(Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 2024, ArXiv)
- FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference(Coleman Hooper, Charbel Sakr, Ben Keller, Rangharajan Venkatesan, Kurt Keutzer, Y. Shao, Brucek Khailany, 2025, ArXiv)
- Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators(Jan Klhufek, Miroslav Safar, Vojtech Mrazek, Zdenek Vasicek, Lukas Sekanina, 2024, ArXiv Preprint)
- GradFreeBits: Gradient Free Bit Allocation for Dynamic Low Precision Neural Networks(Ben Bodner, G. B. Shalom, Eran Treister, 2021, ArXiv)
- Mixed Precision Low-Bit Quantization of Neural Network Language Models for Speech Recognition(Junhao Xu, Jianwei Yu, Shoukang Hu, Xunying Liu, H. Meng, 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing)
- I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference(Zhikai Li, Qingyi Gu, 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning(Jiun-Man Chen, Yu-Hsuan Chao, Yuji Wang, Ming-Der Shieh, Chih-Chung Hsu, Wei Lin, 2024, 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR))
- KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference(Xing Li, Zeyu Xing, Yiming Li, Lin Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, S. J. Pan, Mingxuan Yuan, 2025, ArXiv)
- QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts(Pingzhi Li, Xiaolong Jin, Zhen Tan, Yu Cheng, Tianlong Chen, 2024, ArXiv Preprint)
- AirGun: Adaptive Granularity Quantization for Accelerating Large Language Models(Sungbin Kim, Hyunwuk Lee, Sungwoo Kim, Cheolhwan Kim, Won Woo Ro, 2024, 2024 IEEE 42nd International Conference on Computer Design (ICCD))
- An Automatic Neural Network Architecture-and-Quantization Joint Optimization Framework for Efficient Model Inference(Lian Liu, Ying Wang, Xiandong Zhao, Weiwei Chen, Huawei Li, Xiaowei Li, Yinhe Han, 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
极低比特、非均匀量化与新型浮点格式
研究 FP8/FP4 等新型低比特浮点格式,以及 1-bit(二值化)、2-bit 或非均匀量化(如对数系统、加法幂次方),探索模型压缩的极限边界。
- Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks(Léopold Cambier, Anahita Bhiwandiwalla, Ting Gong, M. Nekuii, Oguz H. Elibol, Hanlin Tang, 2020, ArXiv)
- Block and Subword-Scaling Floating-Point (BSFP) : An Efficient Non-Uniform Quantization For Low Precision Inference(Yun-Chen Lo, Tse-Kuang Lee, Ren-Shuo Liu, 2023, No journal)
- Low-Precision Floating-Point for Efficient On-Board Deep Neural Network Processing(Cédric Gernigon, Silviu-Ioan Filip, Olivier Sentieys, Clément Coggiola, Mickael Bruno, 2023, 2023 European Data Handling & Data Processing Conference (EDHPC))
- LLM-FP4: 4-Bit Floating-Point Quantized Transformers(Shih-Yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng, 2023, No journal)
- Low-Precision Floating-Point Schemes for Neural Network Training(Marc Ortiz, A. Cristal, E. Ayguadé, Marc Casas, 2018, ArXiv)
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation(Liqun Ma, Mingjie Sun, Zhiqiang Shen, 2024, ArXiv Preprint)
- Extreme Compression of Large Language Models via Additive Quantization(Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh, 2024, ArXiv)
- decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points(Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 2024, ArXiv Preprint)
- PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models(He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zhen Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong, 2025, ArXiv Preprint)
- Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks(Yuhang Li, Xin Dong, Wei Wang, 2019, ArXiv Preprint)
- Low-precision logarithmic arithmetic for neural network accelerators(Maxime Christ, F. D. Dinechin, F. Pétrot, 2022, 2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP))
- Pruning Ternary Quantization(Dan Liu, Xi Chen, Jie Fu, Chen Ma, Xue Liu, 2021, ArXiv Preprint)
- AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs(Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, S. Kwon, Dongsoo Lee, 2025, ArXiv)
- TAB: Unified and Optimized Ternary, Binary, and Mixed-precision Neural Network Inference on the Edge(Shien Zhu, Luan H. K. Duong, Weichen Liu, 2022, ACM Transactions on Embedded Computing Systems (TECS))
新兴架构与特定模型(Mamba/SNN/扩散模型)的量化适配
针对非 Transformer 架构(如状态空间模型 Mamba)、脉冲神经网络(SNN)、扩散模型及图神经网络(GNN)的特殊计算模式进行的量化优化。
- FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization(Aotao Wang, Haikuo Shao, Shaobo Ma, Zhongfeng Wang, 2025, 2025 IEEE Computer Society Annual Symposium on VLSI (ISVLSI))
- LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design(Renjie Wei, Songqiang Xu, Linfeng Zhong, Zebin Yang, Qingyu Guo, Yuan Wang, Runsheng Wang, Meng Li, 2025, 2025 Design, Automation & Test in Europe Conference (DATE))
- An Efficient FPGA-Based Hardware Accelerator of Fully Quantized Mamba-2(Kailing Zhou, Han Jiao, Wenjin Huang, Yihua Huang, 2025, 2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM))
- lpSpikeCon: Enabling Low-Precision Spiking Neural Network Processing for Efficient Unsupervised Continual Learning on Autonomous Agents(Rachmad Vidya Wicaksana Putra, Muhammad Shafique, 2022, 2022 International Joint Conference on Neural Networks (IJCNN))
- Low Precision Quantization-aware Training in Spiking Neural Networks with Differentiable Quantization Function(Ayan Shymyrbay, M. Fouda, A. Eltawil, 2023, 2023 International Joint Conference on Neural Networks (IJCNN))
- Exploring Extreme Quantization in Spiking Language Models(Malyaban Bal, Yi Jiang, Abhronil Sengupta, 2024, 2024 International Conference on Neuromorphic Systems (ICONS))
- 23.3 EdgeDiff: 418.4mJ/Inference Multi-Modal Few-Step Diffusion Model Accelerator with Mixed-Precision and Reordered Group Quantization(Sangjin Kim, Jungjun Oh, Jeonggyu So, Yuseon Choi, Sangyeob Kim, Dongseok Im, Gwangtae Park, H.-J. Yoo, 2025, 2025 IEEE International Solid-State Circuits Conference (ISSCC))
- Temporal Feature Matters: A Framework for Diffusion Model Quantization(Yushi Huang, Ruihao Gong, Xianglong Liu, Jing Liu, Yuhang Li, Jiwen Lu, Dacheng Tao, 2024, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- BiKA: Binarized KAN-inspired Neural Network for Efficient Hardware Accelerator Designs(Yuhao Liu, Salim Ullah, Akash Kumar, 2025, 2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM))
- ATE-GCN: An FPGA-Based Graph Convolutional Network Accelerator with Asymmetrical Ternary Quantization(Ruiqi Chen, Jiayu Liu, Shi-xiong Tang, Yang Liu, Yanxiang Zhu, Ming Ling, Bruno da Silva, 2025, 2025 Design, Automation & Test in Europe Conference (DATE))
量化训练技术、鲁棒性分析与安全推理
涵盖量化感知训练(QAT)、训练期动态精度调整(CPT)以及量化对模型鲁棒性、隐私保护(防御成员推理攻击)和安全多方计算(MPC)的影响。
- Scaling Law for Quantization-Aware Training(Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, Ping Luo, 2025, ArXiv Preprint)
- Using Quantization-Aware Training Technique with Post-Training Fine-Tuning Quantization to Implement a MobileNet Hardware Accelerator(Ching-Che Chung, Wei-Ting Chen, Ya-Ching Chang, 2020, 2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN))
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference(Benoit Jacob, S. Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, Dmitry Kalenichenko, 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition)
- Compressing Low Precision Deep Neural Networks Using Sparsity-Induced Regularization in Ternary Networks(Julian Faraone, Nicholas Fraser, Giulio Gambardella, Michaela Blott, Philip H. W. Leong, 2017, ArXiv Preprint)
- Towards Model Quantization on the Resilience Against Membership Inference Attacks(C. Kowalski, Azadeh Famili, Yingjie Lao, 2022, 2022 IEEE International Conference on Image Processing (ICIP))
- WOLF: Weight-Level OutLier and Fault Integration for Reliable LLM Deployment(Chong Wang, Wanyi Fu, Jiangwei Zhang, Shiyao Li, Rui Hou, Jian Yang, Yu Wang, 2025, IEEE Transactions on Computers)
- Analyzing inference robustness of RRAM synaptic array in low-precision neural network(Rui Liu, Heng-Yuan Lee, Shimeng Yu, 2017, 2017 47th European Solid-State Device Research Conference (ESSDERC))
- Ditto: Quantization-aware Secure Inference of Transformers upon MPC(Haoqi Wu, Wenjing Fang, Yancheng Zheng, Junming Ma, Jin Tan, Yinggui Wang, Lei Wang, 2024, ArXiv)
- Model Hemorrhage and the Robustness Limits of Large Language Models(Ziyang Ma, Zuchao Li, Lefei Zhang, Gui-Song Xia, Bo Du, Liangpei Zhang, Dacheng Tao, 2025, ArXiv)
- CPT: Efficient Deep Neural Network Training via Cyclic Precision(Y. Fu, Han Guo, Meng Li, Xin Yang, Yining Ding, V. Chandra, Yingyan Lin, 2021, ArXiv)
- Joint Training of Low-Precision Neural Network with Quantization Interval Parameters(S. Jung, Changyong Son, Seohyung Lee, JinWoo Son, Youngjun Kwak, Jae-Joon Han, Changkyu Choi, 2018, ArXiv)
系统部署工具、性能评测与边缘落地实践
关注 AI Infra 的软件栈,包括自动化部署工具(Torch2Chip)、推理引擎评测(TensorRT)、能效评估框架(ArchTune)以及在医疗、通信、电商等垂直领域的落地表现。
- LLM Inference Unveiled: Survey and Roofline Model Insights(Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 2024, ArXiv)
- Model Compression and Efficient Inference for Large Language Models: A Survey(Wenxiao Wang, Wei Chen, Yi Luo, Y. Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 2024, ArXiv)
- FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design(Jiahao Zhang, Zifan He, Nicholas Fraser, M. Blott, Yizhou Sun, Jason Cong, 2026, ArXiv)
- Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design(Jian Meng, Yuan Liao, Anupreetham Anupreetham, Ahmed Hassan, Shixing Yu, Han-Sok Suh, Xiaofeng Hu, Jae-sun Seo, 2024, ArXiv)
- SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision(Rong Tian, Zijing Zhao, Weijie Liu, Haoyan Liu, Weiquan Mao, Zhe Zhao, Kimmo Yan, 2022, No journal)
- ArchTune: A Predictive Energy Estimation Framework for LLM Inference on Edge Accelerators(Arghyajoy Mondal, Rajdeep Samanta, Ashwin Krishnan, Sparsh Mittal, M. Nambiar, Rekha Singhal, 2025, 2025 5th International Conference on AI-ML-Systems (AIMLSystems))
- MQBench: Towards Reproducible and Deployable Model Quantization Benchmark(Yuhang Li, Mingzhu Shen, Yan Ren, Mingxin Zhao, Qi Zhang, Ruihao Gong, F. Yu, Junjie Yan, 2021, ArXiv)
- Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective(A. Benazir, F. Lin, 2025, ArXiv)
- Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low Bit Quantization and Runtime(Saad Ashfaq, Mohammadhossein Askarihemmat, Sudhakar Sah, Ehsan Saboori, Olivier Mastropietro, Alexander Hoffman, 2022, ArXiv)
- TensorRT Implementations of Model Quantization on Edge SoC(Yuxiao Zhou, Zhishan Guo, Zheng Dong, Kecheng Yang, 2023, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC))
- Being-ahead: Benchmarking and Exploring Accelerators for Hardware-Efficient AI Deployment(Xiaofan Zhang, Hanchen Ye, Deming Chen, 2021, ArXiv Preprint)
- EdgeShard: Efficient LLM Inference via Collaborative Edge Computing(Mingjin Zhang, Xiaoming Shen, Jiannong Cao, Zeyang Cui, Shan Jiang, 2025, IEEE Internet of Things Journal)
- Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs(Luchang Li, Shengyi Qian, Jie Lu, L. Yuan, Rui Wang, Qin Xie, 2024, ArXiv)
- On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration(Maoyang Xiang, R. Fernando, Bo Wang, 2025, ArXiv)
- Enhancing E-commerce Chatbots with Falcon-7B and 16-bit Full Quantization(Yang Luo, Zibu Wei, Guokun Xu, Zhengning Li, Ying Xie, Yibo Yin, 2024, Journal of Theory and Practice of Engineering Science)
- Design of YOLOv5 hardware accelerator based on FPGA+ARM(Minghao Zhai, Lei Yu, 2025, No journal)
- Low-Precision Neural Network Decoding of Polar Codes(Igor Wodiany, Antoniu Pop, 2019, 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC))
- Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines(Chongyu Qu, Ritchie Zhao, Ye Yu, Bin Liu, Tianyuan Yao, Junchao Zhu, Bennett A. Landman, Yucheng Tang, Yuankai Huo, 2025, ArXiv)
- Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks(Xinyuan Zhang, Jiangtian Nie, Yudong Huang, Gaochang Xie, Zehui Xiong, Jiang Liu, Dusist Niyato, X. Shen, 2025, IEEE Transactions on Wireless Communications)
- Qrazor: Reliable and effortless 4-bit llm quantization by significant data razoring(Dongyoung Lee, Seungkyu Choi, Ik Joon Chang, 2025, ArXiv)
- HamQ: Hamming Weight-Based Energy-Aware Quantization for Analog Compute-in-Memory Accelerator in Intelligent Sensors(Sudarshan Sharma, Beomseok Kang, N. V. Kidambi, S. Mukhopadhyay, 2025, IEEE Sensors Journal)
- Energy awareness in low precision neural networks(Nurit Spingarn-Eliezer, Ron Banner, Elad Hoffer, Hilla Ben-Yaacov, T. Michaeli, 2022, ArXiv)
本报告综合了 AI Infra 与量化领域的全栈研究成果。核心趋势体现为:1) 算法层面,针对 LLM 离群值和 KV Cache 的专用量化技术已成为主流;2) 硬件层面,软硬协同设计正从传统的 FPGA/ASIC 转向更高效的存内计算(CIM)与混合精度架构;3) 架构层面,量化研究已从 Transformer 扩展至 Mamba、SNN 及扩散模型等新兴领域;4) 工程层面,自动化工具链与硬件感知的量化搜索(NAS)正在加速量化模型在边缘侧与移动端的工业化落地。整体研究呈现出从单一精度压缩向系统级能效优化的深度演进。
总计180篇相关文献
人工智能技术的飞速发展推动了大语言模型(LLM)的不断进步。在众多LLM中,OpenAI推出的ChatGPT和DeepSeek-AI开发的DeepSeek-R1尤为引人注目。ChatGPT基于GPT-4架构,具备强大的自然语言理解能力和广泛的应用场景,而DeepSeek-R1则通过强化学习方法优化推理能力,在数学推理和编程任务中展现了强劲的竞争力。本文基于DeepSeek-R1的最新研究成果,全面对比ChatGPT与DeepSeek-R1在模型架构、训练方法、推理能力、应用场景及开放性等方面的差异。研究发现,ChatGPT依赖监督微调(SFT)和基于人类反馈的强化学习(RLHF),在自然语言处理任务上表现突出,而DeepSeek-R1更倾向于通过强化学习优化推理能力,尤其在数学推理、代码生成等任务上表现优异。此外,ChatGPT采用闭源策略,主要用于商业应用,而DeepSeek-R1则采取开源模式,为研究社区和开发者提供更大的灵活性。本文的研究结果为人工智能研究人员和开发者提供了重要参考,以期促进LLM技术的发展,并为未来的大模型优化提供新思路。 The rapid development of artificial intelligence has driven the continuous advancement of large language models (LLMs). Among them, OpenAI's ChatGPT and DeepSeek-AI's DeepSeek-R1 have garnered significant attention. ChatGPT, built upon the GPT-4 architecture, demonstrates strong natural language understanding and wide-ranging applications, whereas DeepSeek-R1 leverages reinforcement learning techniques to optimize reasoning capabilities, excelling in mathematical reasoning and programming tasks. This paper, based on the latest research on DeepSeek-R1, provides a comprehensive comparison between ChatGPT and DeepSeek-R1 in terms of model architecture, training methods, reasoning capabilities, application scenarios, and openness. The study reveals that ChatGPT relies on supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), making it highly effective in natural language processing tasks. In contrast, DeepSeek-R1 emphasizes reinforcement learning to enhance reasoning abilities, particularly excelling in mathematical reasoning and code generation tasks. Moreover, ChatGPT follows a closed-source approach, primarily for commercial use, while DeepSeek-R1 adopts an open-source model, offering greater flexibility for researchers and developers. This study provides valuable insights for AI researchers and developers, contributing to the advancement of LLM technology and future model optimization strategies.
As machine learning inferences increasingly move to edge devices, adapting to diverse computational capabilities, hardware, and memory constraints becomes more critical. Instead of relying on a pre-trained model fixed for all future inference queries across diverse edge devices, we argue that planning an inference pattern with a request-specific model tailored to the device's computational capacity, accuracy requirements, and time constraints is more cost-efficient and robust to diverse scenarios. To this end, we propose an accuracy-aware and workload-balanced inference system that integrates joint model quantization and inference partitioning. In this approach, the server dynamically responds to inference queries by sending a quantized model and adaptively sharing the inference workload with the device. Meanwhile, the device's computational power, channel capacity, and accuracy requirements are considered when deciding. Furthermore, we introduce a new optimization framework for the inference system, incorporating joint model quantization and partitioning. Our approach optimizes layer-wise quantization bit width and partition points to minimize time consumption and cost while accounting for varying accuracy requirements of tasks through an accuracy degradation metric in our optimization model. To our knowledge, this work represents the first exploration of optimizing quantization layer-wise bit-width in the inference serving system, by introducing theoretical measurement of accuracy degradation. Simulation results demonstrate a substantial reduction in overall time and power consumption, with computation payloads decreasing by over 80% and accuracy degradation kept below 1%.
The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve end-to-end model accuracy post quantization. As a result, the proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.
Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code at https://github.com/snu-mllab/GuidedQuant.
Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model quality. Furthermore, we demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.
As neural networks get deeper and more computationally intensive, model quantization has emerged as a promising compression tool offering lower computational costs with limited performance degradation, enabling deployment on edge devices. Meanwhile, recent studies have shown that neural network models are vulnerable to various security and privacy threats. Among these, membership inference attacks (MIAs) are capable of breaching user privacy by identifying training data from neural network models. This paper investigates the impact of model quantization on the resistance of neural networks against MIA through empirical studies. We demonstrate that quantized models are less likely to leak private information of training data than their full precision counterparts. Our experimental results show that the precision MIA attack on quantized models is 7 to 9 points lower than their counterparts when the recall is the same. To the best of our knowledge, this paper is the first work to study the implication of model quantization on the resistance of neural network models against MIA.
The increasing demand for image generation on mobile devices [1] highlights the need for high-performing image-generative models, including the diffusion model (DM) [2], [3]. A conventional DM requires numerous UNet-based denoising timesteps (~50), leading to high computation and external memory access (EMA) costs. Recently, the Few-Step Diffusion Model (FSDM) [4] was introduced, as shown in Fig. 23.3.1, to reduce the denoising timesteps to 1–4 through knowledge distillation, while maintaining high image quality, reducing computations and EMA by 22.0× and 42.3×, respectively. However, prior diffusion-model architectures, which accelerated many steps of a DM [5], [6] through inter-timestep redundancy in the UNet, fail to speed up the few denoising steps of a FSDM due to the lack of redundancy between timesteps. Moreover, a multi-modal DM introduces additional computational costs for the encoder, and a FSDM shifts computational bottlenecks from the UNet to the encoder and decoder. Additionally, a FSDM becomes more sensitive to quantization due to increased precision demands with fewer denoising steps. To tackle these challenges, we exploit mixed-precision and group quantization [7] as a unified optimization scheme applicable to the encoder, UNet, and decoder in a FSDM, even without inter-timestep redundancy.
Large language models (LLMs) excel in natural language processing but demand intensive computation. To mitigate this, various quantization methods have been explored, yet they compromise LLM performance. This paper unveils a previously overlooked type of outliers in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which are crucial to the performance of quantized LLMs. Given that, we propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions with no extra inference overhead. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further with minimal training costs. Mathematical analysis also proves that IntactKV effectively reduces the upper bound of quantization error. Empirical results show that IntactKV brings consistent improvement over various quantization methods across different LLMs and downstream tasks, leading to the new state-of-the-art for LLM quantization. The codes are available at https://github.com/ruikangliu/IntactKV.
A systematic understanding of Apple Silicon is lacking in the current landscape of hardware efficiency; research focus is largely centered on accelerating GPUs for large-scale training or inference on CUDA devices. This paper investigates Apple Silicon's unique memory architecture that offers a unified memory integrating CPU and GPU memory and its implications for on-device LLM inference. We decipher myths about whether Apple Silicon is efficient for on-device inference compared to competitors such as NVIDIA GPUs by directly conducting latency and throughput comparison benchmarks. We explain the performance gap between them through profiling low level hardware metrics - ALU utilization, memory bandwidth, buffer usage, cache residency etc. at runtime. We draw several insights regarding performance bottlenecks such as dequantization overhead, compute throughput and memory bandwidth. We debunk existing false claims regarding large language model inference such as compressing models to lower bit precision is a defacto promise for faster inference across all hardware platforms. We find that the large unified memory enables Apple Silicon to be both cost effective and efficient against NVIDIA GPUs for ultra large language models. Our large scale evaluation on 5 hardware testbeds incorporating three Apple M-series devices: M2 Ultra, M2 Max and M4 Pro and two NVIDIA GPUs: NVIDIA RTX A6000, a multi GPU setup with 2xNVIDIA RTX A6000, 5 model scales ranging from 8B to 405B parameters and 14 quantization schemes gives an understanding of how Apple Silicon fits within the paradigm of on-device LLM inference. Our analysis reveals multiple resource interdependencies and unexpected findings, while also quantifying established insights. To the best of our knowledge, this study makes the first attempt to present a thorough characterization and analysis of Apple Silicon for on-device inference.
Generative Artificial Intelligence (GAI) is taking the world by storm with its unparalleled content creation ability. Large Language Models (LLMs) are at the forefront of this movement. However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. Although edge intelligence has long been utilized to solve these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources, most research has focused on traditional AI models and has left a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms. In this paper, we present an edge intelligence optimization problem tailored for LLM inference. Specifically, with the deployment of the batching technique and model quantization on resource-limited edge devices, we formulate an inference model for transformer decoder-based LLMs. Furthermore, our approach aims to maximize the inference throughput via batch scheduling and joint allocation of communication and computation resources, while also considering edge resource constraints and varying user requirements of latency and accuracy. To address this NP-hard problem, we develop an optimal Depth-First Tree-Searching algorithm with online tree-Pruning (DFTSP) that operates within a feasible time complexity. Simulation results indicate that DFTSP surpasses other batching benchmarks in throughput across diverse user settings and quantization techniques, and it reduces time complexity by over 45% compared to the brute-force searching method.
Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application. While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and (c) extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: • Modality-Specific Static Quantization (MSQ), assigning distinct static scales for visual vs. textual tokens; • Attention-Invariant Flexible Switching (AIFS), reordering tokens to preserve casual attention while eliminating expensive token-wise scale computations; • Rotation Magnitude Suppression (RMS), mitigating weight outliers arising from online Hadamard rotations. On five mainstream MLLMs (including Qwen-VL, MiniCPM-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (<1% degradation) while reducing inference latency by up to 30%, significantly outperforming existing PTQ baselines. Our MQuant effectively bridges the gap for efficient and accurate MLLMs inference in resource-constrained devices. Code will be released in https://github.com/StiphyJay/MQuant.
Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising technique to reduce memory requirements and decoding latency. However, recent accurate quantization methods often depend on specialized computations or custom data formats to achieve better model quality, which limits their compatibility with popular frameworks, as they require dedicated inference kernels tailored to specific hardware and software platforms, hindering wider adoption. Furthermore, many competitive methods have high resource requirements and computational overhead for quantizing models, making it challenging to scale them to hundreds of billions of parameters. In response to these challenges, we propose LeanQuant (Loss-Error-Aware Network Quantization), a novel quantization method that is accurate, versatile, and scalable. In the existing popular iterative loss-error-based quantization framework, we identify a critical limitation in prior methods: the min-max affine quantization grid fails to preserve model quality due to outliers in inverse Hessian diagonals. To overcome this fundamental issue, we propose learning loss-error-aware grids, instead of using non-adaptive min-max affine grids. Our approach not only produces quantized models that are more accurate but also generalizes to a wider range of quantization types, including affine and non-uniform quantization, enhancing compatibility with more frameworks. Extensive experiments with recent LLMs demonstrate that LeanQuant is highly accurate, comparing favorably against competitive baselines in model quality, and scalable, achieving very accurate quantization of Llama-3.1 405B, one of the largest open-source LLMs to date, using two Quadro RTX 8000-48GB GPUs in 21 hours.
Transformer-based models have gained widespread popularity in varying fields. Deploying these large models for real-world applications inevitably requires fine-tuning on task-specific data, followed by quantization for efficient inference in real-world scenarios. However, QAT (Quantization-Aware Training) is time-consuming and computationally expensive, while PTQ (Post-Training Quantization) often leads to substantial accuracy loss. Since fine-tuning is inevitable, one question would arise: can we just make the model quantization-friendly during fine-tuning? The answer is affirmed. We propose QuantTune, a quantization-friendly fine-tuning method based on restricting the dynamic range amplification effect of outliers across Transformer-based models using the proposed outlier-driven loss. Importantly, QuantTune seamlessly integrates into existing fine-tuning workflows without increasing training time or requiring extra inference complexity. Our approach achieves significant improvements across Transformer-based models, including ViT, BERT-base, and OPT, using naive PTQ only. QuantTune reduces accuracy drops by 12.09% at 8-bit quantization compared to top calibration methods, outperforming state-of-the-art solutions by over 18.84% across ViT models, also demonstrating that the proposed QuantTune achieve the best trade-off in the real-world scenarios.
Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues. However, unlike traditional models, diffusion models critically rely on the time-step for the multi-round denoising. Typically, each time-step is encoded into a hypersensitive temporal feature by several modules. Despite this, existing PTQ methods do not optimize these modules individually. Instead, they employ unsuitable reconstruction objectives and complex calibration methods, leading to significant disturbances in the temporal feature and denoising trajectory, as well as reduced compression efficiency. To address these challenges, we introduce a novel quantization framework that includes three strategies: 1) TIB-based Maintenance: Based on our innovative Temporal Information Block (TIB) definition, Temporal Information-aware Reconstruction (TIAR) and Finite Set Calibration (FSC) are developed to efficiently align original temporal features. 2) Cache-based Maintenance: Instead of indirect and complex optimization for the related modules, pre-computing and caching quantized counterparts of temporal features are developed to minimize errors. 3) Disturbance-aware Selection: Employ temporal feature errors to guide a fine-grained selection between the two maintenance strategies for further disturbance reduction. This framework preserves most of the temporal information and ensures high-quality end-to-end generation. Extensive testing on various datasets, diffusion models and hardware confirms our superior performance and acceleration.
Efficient deep learning models, especially optimized for edge devices, benefit from low inference latency to efficient energy consumption. Two classical techniques for efficient model inference are lightweight neural architecture search (NAS), which automatically designs compact network models, and quantization, which reduces the bit-precision of neural network models. As a consequence, joint design for both neural architecture and quantization precision settings is becoming increasingly popular. There are three main aspects that affect the performance of the joint optimization between neural architecture and quantization: 1) quantization precision selection (QPS); 2) quantization-aware training (QAT); and 3) NAS. However, existing works focus on at most twofold of these aspects, and result in secondary performance. To this end, we proposed a novel automatic optimization framework, DAQU, that allows jointly searching for Pareto-optimal neural architecture and quantization precision combination among more than $10^{47}$ quantized subnet models. To overcome the instability of the conventional automatic optimization framework, DAQU incorporates a warm-up strategy to reduce the accuracy gap among different neural architectures, and a precision-transfer training approach to maintain flexibility among different quantization precision settings. Our experiments show that the quantized lightweight neural networks generated by DAQU consistently outperform state-of-the-art NAS and quantization joint optimization methods.
Quantization and pruning are fundamental approaches for model compression, enabling efficient inference for language models. In a post-training setting, state-of-the-art quantization and pruning methods require calibration data, a small set of unlabeled examples. Conventionally, this is randomly sampled web text, aiming to reflect the model training data. However, this poses two key problems: (1) unrepresentative calibration examples can harm model performance, and (2) organizations increasingly avoid releasing model training data. In this paper, we propose self-calibration as a solution. Our approach requires no external data, instead leveraging the model itself to generate synthetic calibration data, with a view to better approximating the pre-training data distribution. We extensively compare the performance of self-calibration with several baselines, across a variety of models, compression methods, and tasks. Our approach proves consistently competitive in maximizing downstream task performance, frequently outperforming even using real data.
Model compression has emerged as a mainstream solution to reduce memory usage and computational overhead. This paper presents Group Quantization and Sparse Acceleration (GQSA), a novel compression technique tailored for LLMs. Traditional methods typically focus exclusively on either quantization or sparsification, but relying on a single strategy often results in significant performance loss at high compression rates. In contrast, GQSA integrates quantization and sparsification in a tightly coupled manner, leveraging GPU-friendly structured group sparsity and quantization for efficient acceleration. Building upon system-algorithm co-design principles, we propose a two-stage sparse optimization strategy that ensures the performance superiority of the compressed model. On the engine side, we introduce a"task-centric"parallel strategy, which, to the best of our knowledge, is the first application in the domain of sparse computing. Compared to the traditional 2:4 sparse method, the GQSA offers a more flexible and adjustable sparsity rate, as well as a higher weight compression rate, and is efficiently compatible with weight-only quantization methods. Experimental results demonstrate that, under the GQSA W4S50% compression setting, the model's accuracy surpasses that of both 2:4 pruning and W2 quantization. Furthermore, at the inference level, GQSA outperforms W2 by 1.26$\times$ and 2:4 pruning by 2.35$\times$ in terms of speed.
Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations to low precision is challenging without degrading model accuracy. We propose fine-grained mixed precision (FGMP) quantization, a post-training mixed-precision quantization hardware-software co-design methodology that maintains accuracy while quantizing the majority of weights and activations to reduced precision. Our work makes the following contributions: 1) We develop a policy that uses the perturbation in each value, weighted by the Fisher information, to select which weight and activation blocks to keep in higher precision. This approach preserves accuracy by identifying which weight and activation blocks need to be retained in higher precision to minimize the perturbation in the model loss. 2) We also propose a sensitivity-weighted clipping approach for fine-grained quantization which helps retain accuracy for blocks that are quantized to low precision. 3) We then propose hardware augmentations to leverage the efficiency benefits of FGMP quantization. Our hardware implementation encompasses i) datapath support for FGMP at block granularity, and ii) a mixed-precision activation quantization unit to assign activation blocks to high or low precision on the fly with minimal runtime and energy overhead. Our design, prototyped using NVFP4 (an FP4 format with microscaling) as the low-precision datatype and FP8 as the high-precision datatype, facilitates efficient FGMP quantization, attaining<1% perplexity degradation on Wikitext-103 for the Llama-2-7B model relative to an all-FP8 baseline design while consuming 14% less energy during inference and requiring 30% less weight memory.
Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied"fake quantization", which simulates lower precision operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, the potential of deploying real 3D low-bit quantization on modern GPUs is still unexplored. In this study, we introduce a real post-training quantization (PTQ) framework that successfully implements true 8-bit quantization on state-of-the-art (SOTA) 3D medical segmentation models, i.e., U-Net, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet, ST-UNet,and VISTA3D. Our approach involves two main steps. First, we use TensorRT to perform fake quantization for both weights and activations with unlabeled calibration dataset. Second, we convert this fake quantization into real quantization via TensorRT engine on real GPUs, resulting in real-world reductions in model size and inference latency. Extensive experiments demonstrate that our framework effectively performs 8-bit quantization on GPUs without sacrificing model performance. This advancement enables the deployment of efficient deep learning models in medical imaging applications where computational resources are constrained. The code and models have been released, including U-Net, TransUNet pretrained on the BTCV dataset for abdominal (13-label) segmentation, UNesT pretrained on the Whole Brain Dataset for whole brain (133-label) segmentation, and nnU-Net, SegResNet, SwinUNETR and VISTA3D pretrained on TotalSegmentator V2 for full body (104-label) segmentation. https://github.com/hrlblab/PTQ.
The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.
Deep neural networks have shown remarkable capabilities in computer vision applications. However, their complex architectures can pose challenges for efficient real-time deployment on edge devices, as they require significant computational resources and energy costs. To overcome these challenges, TensorRT has been developed to optimize neural network models trained on major frameworks to speed up inference and minimize latency. It enables inference optimization using techniques such as model quantization which reduces computations by lowering the precision of the data type. The focus of our paper is to evaluate the effectiveness of TensorRT for model quantization. We conduct a comprehensive assessment of the accuracy, inference time, and throughput of TensorRT quantized models on an edge device. Our findings indicate that the quantization in TensorRT significantly enhances the efficiency of inference metrics while maintaining a high level of inference accuracy. Additionally, we explore various workflows for implementing quantization using TensorRT and discuss their advantages and disadvantages. Based on our analysis of these workflows, we provide recommendations for selecting an appropriate workflow for different application scenarios.
With a recent trend of using Large Language Models (LLMs) for different applications within smart cities, there is a need for pushing these models toward the edge of network while still preserving their performance. Edge Computing (EC) as a physically closer computing resource to the end users can help to reduce the communication delay for serving end users' tasks for LLM-dependent services. However, EC servers have limited capacity in terms of communication, computation, and storage capacity. This paper introduces DILEMMA, a novel framework addressing the challenges of deploying LLMs in EC systems by jointly optimizing layer placement and layer quantization in EC systems. DILEMMA formulates an Integer Linear Programming problem to minimize total inference delay while ensuring acceptable LLM performance levels, leveraging layer-wise quantization and knowledge distillation for LLM performance control. Experimental evaluations on OPT-350 model using the SQuAD dataset demonstrate that DILEMMA achieves a quantization ratio of up to 12.75% while preserving model loss, highlighting its effectiveness in resource-constrained environments.
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.
Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets. Our code is presented on https://github.com/Sullivan12138/Cocktail.
Model quantization has emerged as an indispensable technique to accelerate deep learning inference. While researchers continue to push the frontier of quantization algorithms, existing quantization work is often unreproducible and undeployable. This is because researchers do not choose consistent training pipelines and ignore the requirements for hardware deployments. In this work, we propose Model Quantization Benchmark (MQBench), a first attempt to evaluate, analyze, and benchmark the reproducibility and deployability for model quantization algorithms. We choose multiple different platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms under a unified training pipeline. MQBench acts like a bridge to connect the algorithm and the hardware. We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights. By aligning the training settings, we find existing algorithms have about the same performance on the conventional academic track. While for the hardware-deployable quantization, there is a huge accuracy gap which remains unsettled. Surprisingly, no existing algorithm wins every challenge in MQBench, and we hope this work could inspire future research directions.
Due to the rising privacy concerns on sensitive client data and trained models like Transformers, secure multi-party computation (MPC) techniques are employed to enable secure inference despite attendant overhead. Existing works attempt to reduce the overhead using more MPC-friendly non-linear function approximations. However, the integration of quantization widely used in plaintext inference into the MPC domain remains unclear. To bridge this gap, we propose the framework named Ditto to enable more efficient quantization-aware secure Transformer inference. Concretely, we first incorporate an MPC-friendly quantization into Transformer inference and employ a quantization-aware distillation procedure to maintain the model utility. Then, we propose novel MPC primitives to support the type conversions that are essential in quantization and implement the quantization-aware MPC execution of secure quantized inference. This approach significantly decreases both computation and communication overhead, leading to improvements in overall efficiency. We conduct extensive experiments on Bert and GPT2 models to evaluate the performance of Ditto. The results demonstrate that Ditto is about $3.14\sim 4.40\times$ faster than MPCFormer (ICLR 2023) and $1.44\sim 2.35\times$ faster than the state-of-the-art work PUMA with negligible utility degradation.
Convolutional neural networks (CNNs) are widely utilized in intelligent edge computing applications such as computational vision and image processing. However, as the number of layers of the CNN model increases, the number of parameters and computations gets larger, making it increasingly challenging to accelerate in edge computing applications. To effectively adapt to the tradeoff between the speed and accuracy of CNNs inference for smart applications. This paper proposes an FPGA-based adaptive CNNs inference accelerator synergistically utilizing filter pruning, fixed-point parameter quantization, and multi-computing unit parallelism called APPQ-CNN. First, the article devises a hybrid pruning algorithm based on the L1-norm and APoZ to measure the filter impact degree and a configurable parameter quantization fixed-point computing architecture instead of floating-point architecture. Then, design a cascade of the CNN pipelined kernel architecture and configurable multiple computation units. Finally, conduct extensive performance exploration and comparison experiments on various real and synthetic datasets. With negligible accuracy loss, the speed performance of our accelerator APPQ-CNN compares with current state-of-the-art FPGA-based accelerators PipeCNN and OctCNN by 2.15× and 1.91×, respectively. Furthermore, APPQ-CNN provides settable fixed-point quantization bit-width parameters, filter pruning rate, and multiple computation unit counts to cope with practical application performance requirements in edge computing.
Transformer-based large language models (LLMs) have achieved remarkable success as model sizes continue to grow, yet their deployment remains challenging due to significant computational and memory demands. Quantization has emerged as a promising solution, and state-of-the-art quantization algorithms for LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), where lower-precision weights are multiplied with higher-precision activations. Despite its benefits, current hardware accelerators such as GPUs and TPUs lack native support for efficient mpGEMM, leading to inefficient dequantization operations in the main sequential loop. To address this limitation, we introduce MixPE, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference. MixPE leverages two key innovations to minimize dequantization overhead and unlock the full potential of low-bit quantization. First, recognizing that scale and zero point are shared within each quantization group, we propose performing dequantization after per-group mpGEMM, significantly reducing dequantization overhead. Second, instead of relying on conventional multipliers, MixPE utilizes efficient shift\&add operations for multiplication, optimizing both computation and energy efficiency. Our experimental results demonstrate that MixPE surpasses the state-of-the-art quantization accelerators by $2.6\times$ speedup and $1.4\times$ energy reduction.
Systems that use Deep Learning (DL) models extensively utilize cloud computing for inference tasks in various domains such as traffic monitoring, healthcare, and IoT. However, applications like autonomous vehicles, surveillance systems, and spacecraft are transitioning towards edge computing due to band-width limitations, transmission delays, and network connectivity issues. Edge computing mitigates these challenges by reducing latency through local data and model processing on the device. Implementing Deep Neural Networks (DNNs) on edge devices faces resource constraints, such as limited memory, computing power, etc. DNNs employ 32-bit floating-point precision for accuracy, leading to inflated model sizes. Quantization offers a solution by converting high-precision floating-point (FP) values to lower-precision or integer (INT) values, focusing on throughput and improving latency. This paper presents a comparative study of the accuracy and performance of 64-bit, 32-bit, and 16-bit floating-point instructions, along with 8-bit integer instructions, using Post Training Quantization (PTQ) and Quantization-Aware Training (QAT), on multiple Nets including CustomNets, which was inferenced on a GPU as well as a Xilinx Deep Processing Unit (DPU). The models were evaluated on a sample of the EuroSat Remote Sensing dataset. Quantizing models to FP16 and INT8 resulted in $2-3 x$ and $4 x$ faster inferencing, respectively, with a negligible decrease in accuracy of $1-4 \%$. FP64 exhibited a 2 $3 x$ decrease in speed but a slight accuracy improvement ($2 \%$). On the DPU, models showed minimal accuracy degradation of about $1 \%$. Overall, model size decreased by a constant $2 x$ and 4x from FP32 to FP16 and INT8, respectively, while increasing by $2 x$ for FP64. This reduction in size, with negligible loss in accuracy enables onboard storage along withfaster and accurate inferencing on resource constraint systems.
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance while limiting the computational overhead of QAT. Density ratio Language Model fusion has shown remarkable accuracy gains on RNN-T workloads but it severely increases the computational cost of inference. We show that our quantization strategies enable using large beam widths for hypothesis search while achieving streaming-compatible runtimes and a full model compression ratio of 7.6$\times$ compared to the full precision model. Via hardware simulations, we estimate a 3.4$\times$ acceleration from FP16 to INT4 for the end-to-end quantized RNN-T inclusive of LM fusion, resulting in a Real Time Factor (RTF) of 0.06. On the NIST Hub5 2000, Hub5 2001, and RT-03 test sets, we retain most of the gains associated with LM fusion, improving the average WER by $>$1.5%.
The latest industrial inference engines, such as FasterTransformer and TurboTransformers, have verified that half-precision floating point (FP16) and 8-bit integer (INT8) quantization can greatly improve model inference speed. However, the existing INT8 quantization methods are too complicated, and improper usage will lead to model performance damage greatly. In this paper, we develop a toolkit for users to easily quantize their models for inference, in which Self-Adaptive Mixed-Precision (SAMP) is proposed to automatically control quantization rate by a mixed-precision architecture to balance model accuracy and efficiency. Experimental results show that our SAMP toolkit has a higher speedup than PyTorch and FasterTransformer while ensuring the required accuracy. In addition, SAMP is based on a modular design, decoupling the tokenizer, embedding, encoder and target layers, which allows users to handle various downstream tasks and can be seamlessly integrated into PyTorch.
—Deep Learning has been one of the most disruptive technological advancements in recent times. The high performance of deep learning models comes at the expense of high computational, storage and power requirements. Sensing the immediate need for accelerating and compressing these models to improve on-device performance, we introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite Runtime for deployment of ultra-low bit quantized models on Arm-based platforms. We implement low-level quantization kernels for Armv7 and Armv8 architectures enabling deployment on the vast array of 32-bit and 64-bit Arm-based devices. With efficient implementations using vectorization, parallelization, and tiling, we realize speedups of up to 2x and 2.2x compared to TensorFlow Lite with XNNPACK backend on classification and detection models, respectively. We also achieve significant speedups of up to 5x and 3.2x compared to ONNX Runtime for classification and detection models, respectively.
Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0× higher energy efficiency and 1.8× better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2× higher throughput using the latest Versal VHK158 FPGA.
Machine learning deployment on edge devices has faced challenges such as computational costs and privacy issues. Membership inference attack (MIA) refers to the attack where the adversary aims to infer whether a data sample belongs to the training set. In other words, user data privacy might be compromised by MIA from a well-trained model. Therefore, it is vital to have defense mechanisms in place to protect training data, especially in privacy-sensitive applications such as healthcare. This paper exploits the implications of quantization on privacy leakage and proposes a novel quantization method that enhances the resistance of a neural network against MIA. Recent studies have shown that model quantization leads to resistance against membership inference attacks. Existing quantization approaches primarily prioritize performance and energy efficiency; we propose a quantization framework with the main objective of boosting the resistance against membership inference attacks. Unlike conventional quantization methods whose primary objectives are compression or increased speed, our proposed quantization aims to provide defense against MIA. We evaluate the effectiveness of our methods on various popular benchmark datasets and model architectures. All popular evaluation metrics, including precision, recall, and F1-score, show improvement when compared to the full bitwidth model. For example, for ResNet on Cifar10, our experimental results show that our algorithm can reduce the attack accuracy of MIA by 14%, the true positive rate by 37%, and F1-score of members by 39% compared to the full bitwidth network. Here, reduction in true positive rate means the attacker will not be able to identify the training dataset members, which is the main goal of the MIA.
No abstract available
Large language models (LLMs) have shown great success in content generation and intelligent intelligent decision making for IoT systems. Traditionally, LLMs are deployed on the cloud, incurring prolonged latency, high bandwidth costs, and privacy concerns. More recently, edge computing has been considered promising in addressing such concerns because the edge devices are closer to data sources. However, edge devices are cursed by their limited resources and can hardly afford LLMs. Existing studies address such a limitation by offloading heavy workloads from edge to cloud or compressing LLMs via model quantization. These methods either still rely heavily on the remote cloud or suffer substantial accuracy loss. This work is the first to deploy LLMs on a collaborative edge computing environment, in which edge devices and cloud servers share resources and collaborate to infer LLMs with high efficiency and no accuracy loss. We design EdgeShard, a novel approach to partition a computation-intensive LLM into affordable shards and deploy them on distributed devices. The partition and distribution are nontrivial, considering device heterogeneity, bandwidth limitations, and model complexity. To this end, we formulate an adaptive joint device selection and model partition problem and design an efficient dynamic programming algorithm to optimize the inference latency and throughput. Extensive experiments of the popular Llama2 serial models on a real-world testbed reveal that EdgeShard achieves up to 50% latency reduction and $2 \times $ throughput improvement over the state-of-the-art.
Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. However, these models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging. Quantization is a promising approach to reducing model complexity, and the dyadic arithmetic pipeline can allow the quantized models to perform efficient integer-only inference. Unfortunately, dyadic arithmetic is based on the homogeneity condition in convolutional neural networks, which is not applicable to the non-linear components in ViTs, making integer-only inference of ViTs an open issue. In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting, and without any floating-point arithmetic. In I-ViT, linear operations (e.g., MatMul and Dense) follow the integer-only pipeline with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and LayerNorm) are approximated by the proposed light-weight integer-only arithmetic methods. More specifically, I-ViT applies the proposed Shiftmax and ShiftGELU, which are designed to use integer bit-shifting to approximate the corresponding floating-point operations. We evaluate I-ViT on various benchmark models and the results show that integer-only INT8 quantization achieves comparable (or even slightly higher) accuracy to the full-precision (FP) baseline. Furthermore, we utilize TVM for practical hardware deployment on the GPU’s integer arithmetic units, achieving 3.72 ~ 4.11 inference speedup compared to the FP model. Code of both Pytorch and TVM is released at https://github.com/zkkli/I-ViT.
Quantization is commonly used in Deep Neural Networks (DNNs) to reduce the storage and computational complexity by decreasing the arithmetical precision of activations and weights, a.k.a. tensors. Efficient hardware architectures employ linear quantization to enable the deployment of recent DNNs onto embedded systems and mobile devices. However, linear uniform quantization cannot usually reduce the numerical precision to less than 8 bits without sacrificing high performance in terms of model accuracy. The performance loss is due to the fact that tensors do not follow uniform distributions. In this paper, we show that a significant amount of tensors fit into an exponential distribution. Then, we propose DNA-TEQ to exponentially quantize DNN tensors with an adaptive scheme that achieves the best trade-off between numerical precision and accuracy loss. The experimental results show that DNA-TEQ provides a much lower quantization bit-width compared to previous proposals, resulting in an average compression ratio of 40 % over the linear INT8 baseline, with negligible accuracy loss and without retraining the DNNs. Besides, DNA-TEQ leads the way in performing dot-product operations in the exponential domain. On average for a set of widely used DNNs, DNA-TEQ provides 1.5x speedup and 2.5x energy savings over a baseline DNN accelerator based on 3D-stacked memory.
Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized"quantization experts"and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOTA quantization model, without incurring significant increases in inference latency.
Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across many dimensions of each tensor, effective precision of individual elements within the tensor are limited. To reduce quantization-related accuracy loss, we propose using a separate scale factor for each small vector of ($\approx$16-64) elements within a single dimension of a tensor. To achieve an efficient hardware implementation, the per-vector scale factors can be implemented with low-bitwidth integers when calibrated using a two-level quantization scheme. We find that per-vector scaling consistently achieves better inference accuracy at low precision compared to conventional scaling techniques for popular neural networks without requiring retraining. We also modify a deep learning accelerator hardware design to study the area and energy overheads of per-vector scaling support. Our evaluation demonstrates that per-vector scaled quantization with 4-bit weights and activations achieves 37% area saving and 24% energy saving while maintaining over 75% accuracy for ResNet50 on ImageNet. 4-bit weights and 8-bit activations achieve near-full-precision accuracy for both BERT-base and BERT-large on SQuAD while reducing area by 26% compared to an 8-bit baseline.
No abstract available
No abstract available
No abstract available
Neural Network (NN) polar decoders have been getting much attention as a viable replacement for conventional decoders in 5G New Radio (NR). Despite scalability issues, the NN-based decoder is a promising technology as it can improve the latency of the standard Successive Cancellation (SC) decoder. It was shown that the Neural Successive Cancellation (NSC) decoder has an improved theoretical latency compared to the standard SC decoder. However, in contrast to SC, the NSC decoder uses large floating-point weight matrices which do not fit in CPU caches, leading to higher energy usage and lower computational performance due to the increased memory traffic. Additionally such higher memory requirement would be expensive to implement in hardware and require complex floating-point arithmetic. This paper presents a new low-precision NN decoder that can replace memory-heavy NN decoders inside the NSC decoder. We show that up to 54 times weights' size reduction can be achieved with the wireless performance degradation varying between 0.1dB and 0.4dB compared to the floating-point implementation. Moreover, we show a reduction of up to respectively 438× and 555× in L1 and L2 data cache misses in our prototype software implementation.
Low-precision quantization is recognized for its efficacy in neural network optimization. Our analysis reveals that non-quantized elementwise operations which are prevalent in layers such as parameterized activation functions, batch normalization, and quantization scaling dominate the inference cost of low-precision models. These non-quantized elementwise operations are commonly overlooked in SOTA efficiency metrics such as Arithmetic Computation Effort (ACE) [46]. In this paper, we propose ACEv2 - an extended version of ACE which offers a better alignment with the inference cost of quantized models and their energy consumption on ML hardware. Moreover, we introduce PikeLPN11Pike is a slim fast fish, LPN stands for Low-Precision Network., a model that addresses these efficiency issues by applying quantization to both elementwise operations and multiply-accumulate operations. In particular, we present a novel quantization technique for batch normalization layers named QuantNorm which allows for quantizing the batch normalization parameters without compromising the model performance. Additionally, we propose applying Double Quantization where the quantization scaling parameters are quantized. Furthermore, we recognize and resolve the issue of distribution mismatch in Separable Convolution layers by introducing Distribution-Heterogeneous Quantization which enables quantizing them to low-precision. PikeLPN achieves Pareto-optimality in efficiency-accuracy trade-off with up to 3× efficiency improvement compared to SOTA low-precision models.
One of the major bottlenecks in high-resolution Earth Observation (EO) space systems is the downlink between the satellite and the ground. Due to hardware limitations, on-board power limitations or ground-station operation costs, there is a strong need to reduce the amount of data transmitted. Various processing methods can be used to compress the data. One of them is the use of on-board deep learning to extract relevant information in the data. However, most ground-based deep neural network parameters and computations are performed using single-precision floating-point arithmetic, which is not adapted to the context of on-board processing. We propose to rely on quantized neural networks and study how to combine low precision (mini) floating-point arithmetic with a Quantization-Aware Training methodology. We evaluate our approach with a semantic segmentation task for ship detection using satellite images from the Airbus Ship dataset. Our results show that 6-bit floating-point quantization for both weights and activations can compete with single-precision without significant accuracy degradation. Using a Thin U-Net 32 model, only a 0.3% accuracy degradation is observed with 6-bit minifloat quantization (a 6-bit equivalent integer-based approach leads to a 0.5% degradation). An initial hardware study also confirms the potential impact of such low-precision floating-point designs, but further investigation at the scale of a full inference accelerator is needed before concluding whether they are relevant in a practical on-board scenario.
Quantized neural networks (QNNs) are among the main approaches for deploying deep neural networks on low resource edge devices. Training QNNs using different levels of precision throughout the network (dynamic quantization) typically achieves superior trade-offs between performance and computational load. However, optimizing the different precision levels of QNNs can be complicated, as the values of the bit allocations are discrete and difficult to differentiate for. Also, adequately accounting for the dependencies between the bit allocation of different layers is not straight-forward. To meet these challenges, in this work we propose GradFreeBits: a novel joint optimization scheme for training dynamic QNNs, which alternates between gradient-based optimization for the weights, and gradient-free optimization for the bit allocation. Our method achieves better or on par performance with current state of the art low precision neural networks on CIFAR10/100 and ImageNet classification. Furthermore, our approach can be extended to a variety of other applications involving neural networks used in conjunction with parameters which are difficult to optimize for.
Power consumption is a major obstacle in the deployment of deep neural networks (DNNs) on end devices. Existing approaches for reducing power consumption rely on quite general principles, including avoidance of multiplication operations and aggressive quantization of weights and activations. However, these methods do not take into account the precise power consumed by each module in the network, and are therefore not optimal. In this paper we develop accurate power consumption models for all arithmetic operations in the DNN, under various working conditions. We reveal several important factors that have been overlooked to date. Based on our analysis, we present PANN (power-aware neural network), a simple approach for approximating any full-precision network by a low-power fixed-precision variant. Our method can be applied to a pre-trained network, and can also be used during training to achieve improved performance. In contrast to previous methods, PANN incurs only a minor degradation in accuracy w.r.t. the full-precision version of the network, even when working at the power-budget of a 2-bit quantized variant. In addition, our scheme enables to seamlessly traverse the power-accuracy trade-off at deployment time, which is a major advantage over existing quantization methods that are constrained to specific bit widths.
Recent advances have shown that Spiking Neural Network (SNN)-based systems can efficiently perform unsuper-vised continual learning due to their bio-plausible learning rule, e.g., Spike-Timing-Dependent Plasticity (STDP). Such learning capabilities are especially beneficial for use cases like autonomous agents (e.g., robots and UAVs) that need to continuously adapt to dynamically changing scenarios/environments, where new data gathered directly from the environment may have novel features that should be learned online. Current state-of-the-art works employ high-precision weights (i.e., 32 bit) for both training and inference phases, which pose high memory and energy costs thereby hindering efficient embedded implementations of such systems for battery-driven mobile autonomous systems. On the other hand, precision reduction may jeopardize the quality of unsupervised continual learning due to information loss. Towards this, we propose lpSpikeCon, a novel methodology to enable low-precision SNN processing for efficient unsupervised continual learning on resource-constrained autonomous agents/systems. Our lpSpikeCon methodology employs the following key steps: (1) analyzing the impacts of training the SNN model under unsuper-vised continual learning settings with reduced weight precision on the inference accuracy; (2) leveraging this study to identify SNN parameters that have a significant impact on the inference accuracy; and (3) developing an algorithm for searching the respective SNN parameter values that improve the quality of unsupervised continual learning. The experimental results show that our lpSpikeCon can reduce weight memory of the SNN model by 8x (i.e., by judiciously employing 4-bit weights) for performing online training with unsupervised continual learning and achieve no accuracy loss in the inference phase, as compared to the baseline model with 32-bit weights across different network sizes.
In-memory computing architectures based on Resistive random access memory technologies (RRAM) are a promising candidate for the development of ultra-low power hardware accelerators that could enable the deployment of deep neural networks inference algorithms on energy constrained devices at the edge of the communication network. However, the study of the reliability of such circuits is non-trivial due to the intrinsic RRAM devices nonlinearity and stochasticity. For instance, RRAM devices are subject not only to device-to-device and cycle-to-cycle resistance variations but also to Random Telegraph Noise which introduces additional time dependent resistance fluctuations that could result in reduced circuit performance. Previous studies exploited simplified statistical models to show that such device nonidealities may reduce the classification accuracy even when binarized neural networks are employed. However, a circuit reliability analysis based on full circuit-level simulations is still missing. In this work, we develop and train a low-bit precision neural network which employs binary weights and 4-bits activations. We further analyze the impact of RRAM nonidealities (e.g., variability and Random Telegraph Noise) on the classification accuracy by means of full circuit-level simulations enabled by a physics-based RRAM compact model, calibrated on experimental data from the literature. Results show that combining binary weights with low-precision activations allows retaining software-level accuracy even in the presence of Random Telegraph Noise and weight variability.
Resource requirements for hardware acceleration of neural networks inference is notoriously high, both in terms of computation and storage. One way to mitigate this issue is to quantize parameters and activations. This is usually done by scaling and centering the distributions of weights and activations, on a kernel per kernel basis, so that a low-precision binary integer representation can be used. This work studies low-precision logarithmic number system (LNS) as an efficient alternative. Firstly, LNS has more dynamic than fixed-point for the same number of bits. Thus, when quantizing MNIST and CIFAR reference networks without retraining, the smallest format size achieving top-1 accuracy comparable to floating-point is 1 to 3 bits smaller with LNS than with fixed-point. In addition, it is shown that the zero bit of classical LNS is not needed in this context, and that the sign bit can be saved for activations. The proposed LNS neuron is detailed and its implementation on FPGA is shown to be smaller and faster than a fixed-point one for comparable accuracy. Secondly, low-precision LNS enables efficient inference architectures where 1 / multiplications reduce to additions; 2/ the weighted inputs are converted to classical linear domain, but the tables needed for this conversion remain very small thanks to the low precision; and 3/ the conversion of the output activation back to LNS can be merged with an arbitrary activation function.
Owing to the presence of large values, which we call outliers, conventional methods of quantization fail to achieve significantly low precision, e.g., four bits, for very deep neural networks, such as ResNet-101. In this study, we propose a hardware accelerator, called the outlier-aware accelerator (OLAccel). It performs dense and low-precision computations for a majority of data (weights and activations) while efficiently handling a small number of sparse and high-precision outliers (e.g., amounting to 3% of total data). The OLAccel is based on 4-bit multiply-accumulate (MAC) units and handles outlier weights and activations in a different manner. For outlier weights, it equips SIMD lanes of MAC units with an additional MAC unit, which helps avoid cycle overhead for the majority of outlier occurrences, i.e., a single occurrence in the SIMD lanes. The OLAccel performs computations using outlier activation on dedicated, high-precision MAC units. In order to avoid coherence problem due to updates from low- and high-precision computation units, both units update partial sums in a pipelined manner. Our experiments show that the OLAccel can reduce by 43.5% (27.0%), 56.7% (36.3%), and 62.2% (49.5%) energy consumption for AlexNet, VGG-16, and ResNet-18, respectively, compared with a 16-bit (8-bit) state-of-the-art zero-aware accelerator. The energy gain mostly comes from the memory components, the DRAM, and on-chip memory due to reduced precision.
Biometrics such as facial features, fingerprint, and iris are being used increasingly in modern authentication systems. These methods are now popular and have found their way into many portable electronics such as smartphones, tablets, and laptops. Furthermore, the use of biometrics enables secure access to private medical data, now collected in wearable devices such as smartwatches. In this work, we present an accurate low-power device authentication system that employs electrocardiogram (ECG) signals as the biometric modality. The proposed ECG processor consists of front-end signal processing of ECG signals and back-end neural networks (NNs) for accurate authentication. The NNs are trained using a cost function that minimizes intra-individual distance over time and maximizes inter-individual distance. Efficient low-power hardware was implemented by using fixed coefficients for ECG signal pre-processing and by using joint optimization of low-precision and structured sparsity for the NNs. We implemented two instances of ECG authentication hardware with 4X and 8X structurally-compressed NNs in 65 nm LP CMOS, which consume low power of 62.37 $\mu$W and 75.41 $\mu$W for real-time ECG authentication with a low equal error rate of 1.36% and 1.21%, respectively, for a large 741-subject in-house ECG database. The hardware was evaluated at 10 kHz clock frequency and 1.2 V voltage supply.
Large-scale deep neural networks (DNN) have been successfully used in a number of tasks from image recognition to natural language processing. They are trained using large training sets on large models, making them computationally and memory intensive. As such, there is much interest in research development for faster training and test time. In this paper, we present a unique approach using lower precision weights for more efficient and faster training phase. We separate imagery into different frequency bands (e.g. with different information content) such that the neural net can better learn using less bits. We present this approach as a complement existing methods such as pruning network connections and encoding learning weights. We show results where this approach supports more stable learning with 2-4X reduction in precision with 17X reduction in DNN parameters.
Spiking neural network (SNN) uses biologically inspired neuron model coupled with Spike-timing-dependent-plasticity (STDP) to enable unsupervised continuous learning in artificial intelligence (AI) platform. However, current SNN algorithms shows low accuracy in complex problems and are hard to operate at reduced precision. This paper demonstrates a GPU-accelerated SNN architecture that uses stochasticity in the STDP coupled with higher frequency input spike trains. The simulation results demonstrate 2 to 3 times faster learning compared to deterministic SNN architectures while maintaining high accuracy for MNIST (simple) and fashion MNIST (complex) data sets. Further, we show stochastic STDP enables learning even with 2 bits of operation, while deterministic STDP fails.
TAB: Unified and Optimized Ternary, Binary, and Mixed-precision Neural Network Inference on the Edge
Ternary Neural Networks (TNNs) and mixed-precision Ternary Binary Networks (TBNs) have demonstrated higher accuracy compared to Binary Neural Networks (BNNs) while providing fast, low-power, and memory-efficient inference. Related works have improved the accuracy of TNNs and TBNs, but overlooked their optimizations on CPU and GPU platforms. First, there is no unified encoding for the binary and ternary values in TNNs and TBNs. Second, existing works store the 2-bit quantized data sequentially in 32/64-bit integers, resulting in bit-extraction overhead. Last, adopting standard 2-bit multiplications for ternary values leads to a complex computation pipeline, and efficient mixed-precision multiplication between ternary and binary values is unavailable. In this article, we propose TAB as a unified and optimized inference method for ternary, binary, and mixed-precision neural networks. TAB includes unified value representation, efficient data storage scheme and novel bitwise dot product pipelines on CPU/GPU platforms. We adopt signed integers for consistent value representation across binary and ternary values. We introduce a bitwidth-last data format that stores the first and second bits of the ternary values separately to remove the bit extraction overhead. We design the ternary and binary bitwise dot product pipelines based on Gated-XOR using up to 40% fewer operations than State-Of-The-Art (SOTA) methods. Theoretical speedup analysis shows that our proposed TAB-TNN is 2.3× fast as the SOTA ternary method RTN, 9.8× fast as 8-bit integer quantization (INT8), and 39.4× fast as 32-bit full-precision convolution (FP32). Experiment results on CPU and GPU platforms show that our TAB-TNN has achieved up to 34.6× speedup and 16× storage size reduction compared with FP32 layers. TBN, Binary-activation Ternary-weight Network (BTN), and BNN in TAB are up to 40.7×, 56.2×, and 72.2× as fast as FP32. TAB-TNN is up to 70.1% faster and 12.8% more power-efficient than RTN on Darknet-19 while keeping the same accuracy. TAB is open source as a PyTorch Extension1 for easy integration with existing CNN models.
The use of low-precision fixed-point arithmetic along with stochastic rounding has been proposed as a promising alternative to the commonly used 32-bit floating point arithmetic to enhance training neural networks training in terms of performance and energy efficiency. In the first part of this paper, the behaviour of the 12-bit fixed-point arithmetic when training a convolutional neural network with the CIFAR-10 dataset is analysed, showing that such arithmetic is not the most appropriate for the training phase. After that, the paper presents and evaluates, under the same conditions, alternative low-precision arithmetics, starting with the 12-bit floating-point arithmetic. These two representations are then leveraged using local scaling in order to increase accuracy and get closer to the baseline 32-bit floating-point arithmetic. Finally, the paper introduces a simplified model in which both the outputs and the gradients of the neural networks are constrained to power-of-two values, just using 7 bits for their representation. The evaluation demonstrates a minimal loss in accuracy for the proposed Power-of-Two neural network, avoiding the use of multiplications and divisions and thereby, significantly reducing the training time as well as the energy consumption and memory requirements during the training and inference phases.
No abstract available
This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. With only 3.73% and 1.98% area and average power overhead, respectively, novel instructions such as Combined Population-Count-XNOR, Patch-Select, and Bit-based Accumulation are added to the instruction set architecture of the BiNMAC, each of which replaces execution cycles of frequently used functions with 1 clock cycle that otherwise would have taken 54, 4, and 3 clock cycles, respectively. Additionally, customized logic is added to every core to transpose 16×16-bit blocks of memory on a bit-level basis, that expedites reshaping intermediate data to be well-aligned for bitwise operations. A 64-cluster architecture of the BiNMAC is fully placed and routed in 65-nm TSMC CMOS technology, where a single cluster occupies an area of 0.53 mm2 with an average power of 232 mW at 1-GHz clock frequency and 1.1 V. The 64-cluster architecture takes 36.5 mm2 area and, if fully exploited, consumes a total power of 16.4 W and can perform 1,360 Giga Operations Per Second (GOPS) while providing full programmability. To demonstrate its scalability, four binarized case studies including ResNet-20 and LeNet-5 for high-performance image classification, as well as a ConvNet and a multilayer perceptron for low-power physiological applications were implemented on BiNMAC. The implementation results indicate that the population-count instruction alone can expedite the performance by approximately 5×. When other new instructions are added to a RISC machine with existing population-count instruction, the performance is increased by 58% on average. To compare the performance of the BiNMAC with other commercial-off-the-shelf platforms, the case studies with their double-precision floating-point models are also implemented on the NVIDIA Jetson TX2 SoC (CPU+GPU). The results indicate that, within a margin of ∼2.1%--9.5% accuracy loss, BiNMAC on average outperforms the TX2 GPU by approximately 1.9× (or 7.5× with fabrication technology scaled) in energy consumption for image classification applications. On low power settings and within a margin of ∼3.7%--5.5% accuracy loss compared to ARM Cortex-A57 CPU implementation, BiNMAC is roughly ∼9.7×--17.2× (or 38.8×--68.8× with fabrication technology scaled) more energy efficient for physiological applications while meeting the application deadline.
Precision-scalable deep neural network (DNN) accelerator designs have attracted much research interest. Since the computation of most DNNs is dominated by multiply-accumulate (MAC) operations, designing efficient precision-scalable MAC (PSMAC) units is of central importance. This brief proposes two low-complexity PSMAC unit architectures based on the well-known one, Fusion Unit (FU), which is composed of a few basic units called Bit Bricks (BBs). We first simplify the architecture of BB through optimizing some redundant logic. Then a top-level architecture for PSMAC unit is devised by recursively employing BBs. Accordingly, two low-complexity PSMAC unit architectures are presented for two different kinds of quantization schemes. Moreover, we provide an insight into the decomposed multiplications and further reduce the bitwidths of the two architectures. Experimental results show that our proposed architectures can save up to 44.18% area cost and 45.45% power consumption when compared with the state-of-the-art design.
As neural network models are developed and optimized, the use of neural networks in edge devices is increasing, where low-bit neural networks, such as binary neural networks and mixed-precision neural networks, are ideal for edge AI applications. Peripheral circuits and in-memory computing macro are the main components for deploying low-bit precision neural networks on edge AI. However, existing peripheral circuits, including communication units, control modules and analog-to-digital converters (ADCs), are implemented by software or mixed-signal circuits, resulting in significant power and area overheads. To address this issue, memristor-based reconfigurable circuits are proposed for a fully analog implementation of low-bit neural networks without ADCs. In addition, a memristor-based mixed-precision network with a variety of mixed-precision modes is illustrated to verify the effectiveness of deploying low-bit neural networks on edge devices based on the proposed circuits. Furthermore, hybrid simulation results demonstrate that the proposed memristor-based mixed-precision network achieves $84.8~\sim ~87.5$ % accuracy on the CIFAR-10 dataset, and the parameter scale of the network model is reduced by $1.6~\sim ~20\text{x}$ . The circuit analysis demonstrated that the proposed circuits are accurate, robust, and energy-efficient with varying mixed precision, providing a promising and universal solution for applying low-bit neural networks on edge devices.
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Low-bit neural network quantization provides a powerful solution to dramatically reduce their model size. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. To this end, novel mixed precision neural network LM quantization methods are proposed in this paper. The optimal local precision choices for LSTM-RNN and Transformer based neural LMs are automatically learned using three techniques. The first two approaches are based on quantization sensitivity metrics in the form of either the KL-divergence measured between full precision and quantized LMs, or Hessian trace weighted quantization perturbation that can be approximated efficiently using matrix free techniques. The third approach is based on mixed precision neural architecture search. In order to overcome the difficulty in using gradient descent methods to directly estimate discrete quantized weights, alternating direction methods of multipliers (ADMM) are used to efficiently train quantized LMs. Experiments were conducted on state-of-the-art LF-MMI CNN-TDNN systems featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation on two tasks: Switchboard telephone speech and AMI meeting transcription. The proposed mixed precision quantization techniques achieved “lossless” quantization on both tasks, by producing model size compression ratios of up to approximately 16 times over the full precision LSTM and Transformer baseline LMs, while incurring no statistically significant word error rate increase.
Deep neural networks have been proven to be highly effective tools in various domains, yet their computational and memory costs restrict them from being widely deployed on portable devices. The recent rapid increase of edge computing devices has led to an active search for techniques to address the abovementioned limitations of machine learning frameworks. The quantization of artificial neural networks (ANNs), which converts the full-precision synaptic weights into low-bit versions, emerged as one of the solutions. At the same time, spiking neural networks (SNNs) have become an attractive alternative to conventional ANNs due to their temporal information processing capability, energy efficiency, and high biological plausibility. Despite being driven by the same motivation, the simultaneous utilization of both concepts has yet to be thoroughly studied. Therefore, this work aims to bridge the gap between recent progress in quantized neural networks and SNNs. It presents an extensive study on the performance of the quantization function, represented as a linear combination of sigmoid functions, exploited in low-bit weight quantization in SNNs. The presented quantization function demonstrates the state-of-the-art performance on four popular benchmarks, CIFAR10-DVS, DVS128 Gesture, N-Caltech101, and N-MNIST, for binary networks (64.05%, 95.45%, 68.71%, and 99.43% respectively) with small accuracy drops and up to 31 × memory savings, which outperforms existing methods.
This paper presents incremental network quantization (INQ), a novel method, targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version whose weights are constrained to be either powers of two or zero. Unlike existing methods which are struggled in noticeable accuracy loss, our INQ has the potential to resolve this issue, as benefiting from two innovations. On one hand, we introduce three interdependent operations, namely weight partition, group-wise quantization and re-training. A well-proven measure is employed to divide the weights in each layer of a pre-trained CNN model into two disjoint groups. The weights in the first group are responsible to form a low-precision base, thus they are quantized by a variable-length encoding method. The weights in the other group are responsible to compensate for the accuracy loss from the quantization, thus they are the ones to be re-trained. On the other hand, these three operations are repeated on the latest re-trained group in an iterative manner until all the weights are converted into low-precision ones, acting as an incremental network quantization and accuracy enhancement procedure. Extensive experiments on the ImageNet classification task using almost all known deep CNN architectures including AlexNet, VGG-16, GoogleNet and ResNets well testify the efficacy of the proposed method. Specifically, at 5-bit quantization, our models have improved accuracy than the 32-bit floating-point references. Taking ResNet-18 as an example, we further show that our quantized models with 4-bit, 3-bit and 2-bit ternary weights have improved or very similar accuracy against its 32-bit floating-point baseline. Besides, impressive results with the combination of network pruning and INQ are also reported. The code is available at this https URL.
To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths <inline-formula> <tex-math notation="LaTeX">$( < 8$ </tex-math></inline-formula> bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-fields to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to tradeoff the inference accuracy and speedup. Experimental results demonstrate that the ImageNet inference accuracy via DyBit is 1.97% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to <inline-formula> <tex-math notation="LaTeX">$8.1\times $ </tex-math></inline-formula> speedup compared with the original ResNet-50 model.
No abstract available
Model quantization is an important mechanism for energy-efficient deployment of deep neural networks on resource-constrained devices by reducing the bit precision of weights and activations. However, it remains challenging to maintain high accuracy as bit precision decreases, especially for low-precision networks (e.g., 2-bit MobileNetV2). Existing methods have been explored to address this problem by minimizing the quantization error or mimicking the data distribution of full-precision networks. In this work, we propose a novel weight regularization algorithm for improving low-precision network quantization. Instead of constraining the overall data distribution, we separably optimize all elements in each quantization bin to be as close to the target quantized value as possible. Such bin regularization (BR) mechanism encourages the weight distribution of each quantization bin to be sharp and approximate to a Dirac delta distribution ideally. Experiments demonstrate that our method achieves consistent improvements over the state-of-the-art quantization-aware training methods for different low-precision networks. Particularly, our bin regularization improves LSQ for 2-bit MobileNetV2 and MobileNetV3-Small by 3.9% and 4.9% top-1 accuracy on ImageNet, respectively.
Low-precision neural network models are crucial for reducing the memory footprint and computational density. However, existing methods must have an average of 32-bit floating-point (FP32) arithmetic to maintain the accuracy. Floating-point numbers need grave memory requirements in convolutional and deep neural network models. Also, large bit-widths cause too much computational density in hardware architectures. Moreover, existing models must evolve into deeper network models with millions or billions of parameters to solve today’s problems. The large number of model parameters increase the computational complexity and cause memory allocation problems, hence existing hardware accelerators become insufficient to address these problems. In applications where accuracy can be traded-off for the sake of hardware complexity, quantization of models enable the use of limited hardware resources to implement neural networks. From hardware design point of view, quantized models are more advantageous in terms of speed, memory and power consumption than using FP32. In this study, we compared the training and testing accuracy of the quantized LeNet and our own ConvNet neural network models at different epochs. We quantized the models using low precision int-4, int-8 and int-16. As a result of the tests, we observed that the LeNet model could only reach 63.59% test accuracy at 400 epochs with int-16. On the other hand, the ConvNet model achieved a test accuracy of 76.78% at only 40 epochs with low precision int-8 quantization.
Quantization is a technique to reduce the computation and memory cost of DNN models, which are getting increasingly large. Existing quantization solutions use fixed-point integer or floating-point types, which have limited benefits, as both require more bits to maintain the accuracy of original models. On the other hand, variable-length quantization uses low-bit quantization for normal values and high-precision for a fraction of outlier values. Even though this line of work brings algorithmic benefits, it also introduces significant hardware overheads due to variable-length encoding and decoding.In this work, we propose a fixed-length a daptive n umerical data t ype called ANT to achieve low-bit quantization with tiny hardware overheads. Our data type ANT leverages two key innovations to exploit the intra-tensor and inter-tensor adaptive opportunities in DNN models. First, we propose a particular data type, flint, that combines the advantages of float and int for adapting to the importance of different values within a tensor. Second, we propose an adaptive framework that selects the best type for each tensor according to its distribution characteristics. We design a unified processing element architecture for ANT and show its ease of integration with existing DNN accelerators. Our design results in $2.8\times $ speedup and $2.5\times $ energy efficiency improvement over the state-of-the-art quantization accelerators.
Low-precision deep neural network (DNN) training has gained tremendous attention as reducing precision is one of the most effective knobs for boosting DNNs' training time/energy efficiency. In this paper, we attempt to explore low-precision training from a new perspective as inspired by recent findings in understanding DNN training: we conjecture that DNNs' precision might have a similar effect as the learning rate during DNN training, and advocate dynamic precision along the training trajectory for further boosting the time/energy efficiency of DNN training. Specifically, we propose Cyclic Precision Training (CPT) to cyclically vary the precision between two boundary values which can be identified using a simple precision range test within the first few training epochs. Extensive simulations and ablation studies on five datasets and eleven models demonstrate that CPT's effectiveness is consistent across various models/tasks (including classification and language modeling). Furthermore, through experiments and visualization we show that CPT helps to (1) converge to a wider minima with a lower generalization error and (2) reduce training variance which we believe opens up a new design knob for simultaneously improving the optimization and efficiency of DNN training. Our codes are available at: this https URL
The RRAM-based accelerators have become very popular candidates for neural network acceleration due to they perform matrix-vector multiplication in-memory with high storage density and low latency. Many related works have used fixed-precision quantization to achieve the model compression and enhance the tolerance of process variation, but these methods still suffer a large accuracy degradation and poor robustness to nonideal effects. In this work, we propose a crossbar-aware mixed-precision quantization scheme, which enables to search for the optimal precision of each part of the network as a way to improve the accuracy and robustness to noise. First, we introduce a group quantization strategy that can flexibly adjust the group size dynamically according to the crossbar size. Then, we propose a detailed mixed-precision search flow to search for the optimal precision set of the network. Finally, we give a noise injection adaption training method to enhance the tolerance of noise. Experimental results show that our proposed method can improve inference accuracy by at least 2.04% compared to the fixed-precision quantization under the same resource cost. The searched architecture with the highest accuracy most accurate (MA) could achieve an accuracy of 92.39% and a resource saving of 93.30% compared to the full precision model. The searched architecture with the biggest resource savings most efficient (ME) could achieve an accuracy of 91.11% and a resource saving of 95.57% compared to the full precision model. And, the average precision of the ME mixed-precision architecture is only 1.4 bits. Besides, the results show the mixed-precision network with noise adaption training is more robust to noise than the fixed-precision network with noise adaption training.
Energy-constrained neural network processing is in high demanded for various mobile applications. Binary neural network aggressively enhances the computational efficiency, and in contrast, it suffers from degradation of accuracy due to its extreme approximation. We propose a novel accurate neural network model based on binarization and "dithering" that distributes the quantization error to neighboring pixels. The quantization errors in the binarization are distributed in the plane, so that a pixel in the multi-level source expression more accurately represented in the resulting binarized plane by multiple pixels. We designed a low-overhead binary-based hardware architecture for the proposed model. The evaluation results show that this method can be realized with a few additional lightweight hardware components.
No abstract available
Training with larger number of parameters while keeping fast iterations is an increasingly adopted strategy and trend for developing better performing Deep Neural Network (DNN) models. This necessitates increased memory footprint and computational requirements for training. Here we introduce a novel methodology for training deep neural networks using 8-bit floating point (FP8) numbers. Reduced bit precision allows for a larger effective memory and increased computational speed. We name this method Shifted and Squeezed FP8 (S2FP8). We show that, unlike previous 8-bit precision training methods, the proposed method works out-of-the-box for representative models: ResNet-50, Transformer and NCF. The method can maintain model accuracy without requiring fine-tuning loss scaling parameters or keeping certain layers in single precision. We introduce two learnable statistics of the DNN tensors - shifted and squeezed factors that are used to optimally adjust the range of the tensors in 8-bits, thus minimizing the loss in information due to quantization.
Machine learning and signal processing on the edge are poised to influence our everyday lives with devices that will learn and infer from data generated by smart sensors and other devices for the Internet of Things. The next leap toward ubiquitous electronics requires increased energy efficiency of processors for specialized data-driven applications. Here, we show how an in-memory processor fabricated using a two-dimensional materials platform can potentially outperform its silicon counterparts in both standard and nontraditional Von Neumann architectures for artificial neural networks. We have fabricated a flash memory array with a two-dimensional channel using wafer-scale MoS2. Simulations and experiments show that the device can be scaled down to sub-micrometer channel length without any significant impact on its memory performance and that in simulation a reasonable memory window still exists at sub-50 nm channel lengths. Each device conductance in our circuit can be tuned with a 4-bit precision by closed-loop programming. Using our physical circuit, we demonstrate seven-segment digit display classification with a 91.5% accuracy with training performed ex situ and transferred from a host. Further simulations project that at a system level, the large memory arrays can perform AlexNet classification with an upper limit of 50 000 TOpS/W, potentially outperforming neural network integrated circuits based on double-poly CMOS technology.
No abstract available
No abstract available
Vision Transformer (ViT) models demonstrate substantial promise in image processing tasks, but deploying these computationally intensive models on edge devices poses significant challenges due to constraints in power consumption and computational resources. This paper proposes a dedicated hardware accelerator for the lightweight EfficientViT model to address this issue, focusing on its ReLU-based global attention mechanism. Our approach is underpinned by two key hardware optimization strategies: block-wise attention, which decomposes large matrices for efficient processing, and a simplified quantization technique to convert costly floating-point operations into efficient integers with minimal precision loss. The proposed design is fabricated in TSMC 40nm process technology. Operating at a maximum frequency of 1.29 GHz, the chip consumes 340 mW and achieves a peak throughput of 368.9 GOPS, with an area efficiency of 265.4 GOPS/mm². These results showcase an exceptional balance between performance and resource utilization, delivering a high-performance ViT solution ideally suited for edge computing.
Brain-inspired Spiking Neural Networks (SNNs) have attracted significant attention for their potential to enable energy-efficient computing on neuromorphic hardware. However, the current SNN community focuses primarily on performance improvement by developing large-scale models on CPU or GPU platforms, which limits the applicability of SNNs in resourcelimited edge devices. In this paper, we propose JQA, a softwarehardware co-design framework for efficiently deploying highperformance SNNs on resource-constrained platforms. On the software side, we introduce a hardware-friendly quantization strategy, which enables SNNs to perform the entire inference process using integer arithmetic and bit-shifting. On the hardware side, we develop an efficient SNN accelerator that adopts a row stationary (RS) dataflow while incorporating tiling and parallelism schemes specifically designed for SNNs. JQA is designed at the register-transfer level (RTL) and implemented on the Xilinx Zynq XC7Z035 FPGA board with limited resources. Extensive experiments demonstrate that JQA outperforms existing works in terms of accuracy, throughput, and energy efficiency. These state-of-the-art results suggest that JQA offers promising potential for real-world SNN applications in resource-constrained scenarios.
Transformer-based language representations have demonstrated superior accuracy in various natural language processing (NLP) tasks. However, their deployment on terminal hardware is challenging due to the involvement of dense matrix operations and complex data flows. In this paper, we propose a heterogeneous multi-core low-power architecture design based on cache (HLC) for TIFP11: top-K half-precision integerized floating-point format. A quantization method is proposed to enable a hardware-friendly model compression technique, TIFP11. Furthermore, parallelized process elements (PE) and storage mechanisms for TIFP11 are carefully designed. Additionally, a cache architecture based on heterogeneous multi-core systems that significantly enhances the efficiency of transformer operations is described. We implement the transformer model using appropriate hardware scheduling. The IFP11 were deployed and verified the performance during storage and calculation. Experimental results demonstrate that the transformer deployed on VCK190 achieves a latency of 9.31ms with a batch size of 32, resulting in a 37.9× speedup compared to the CPU platform and a 1.94× speedup compared to the GPU platform.
In the context of computer numerical control (CNC) machinery, fault diagnosis traditionally involves complex formula conversions to extract characteristics and categorize faults. However, such a method is unsuitable for hardware implementation due to high resource usage. This paper proposes a convolution neural network (CNN) approach for fault classification and hardware acceleration using ternary quantization and batch normalization techniques to reduce data access for weights and improve accuracy. The proposed CNN hardware accelerator is implemented on FPGA (VC707) and reduces memory usage by 83.8% compared to floating-point operations. Furthermore, the proposed method achieves 97.6% accuracy in CNC machinery fault classification.
This paper presents an efficient implementation of asymmetric quantization in hardware accelerator for deep learning applications. In this work, we show that asymmetric quantization provides better accuracy performance in AI inferencing with the same amount of storage and bandwidth requirements of a symmetric approach. Also, we provide the method to support the asymmetric approach in digital circuit. The results show that this software and hardware collaboration provide sufficient AI performance while achieving over significant silicon resources reduction.
State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65~6.06 x higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43 x that of the GPU baseline.
The continuously growing size of Neural Network (NN) models makes the design of lightweight neural network accelerators for edge devices an emerging subject in recent research. Previous works explored different lightweight technologies or even emerging neural network structures, such as quantization, approximate computing, neuromorphic computing, etc., to reduce hardware resource consumption in accelerator designs. This inspired our interest in exploring the potential of other emerging network structures in hardware accelerator designs. Kolmogorov-Arnold Network (KAN) [1] is a recently proposed novel neural network structure by replacing the multiplication and activation function in Artificial Neural Networks (ANN) with learnable nonlinear functions, which has the potential to transform the paradigm of neural network design. However, considering the complexity of the nonlinear function on hardware, the design of the lightweight hardware accelerator of KAN lacks thoroughly related research.
This paper presents a disaster detection framework based on aerial imagery, utilizing a Branch Convolutional Neural Network (B-CNN) to enhance feature learning efficiency. The B-CNN architecture incorporates branch training, enabling effective training and inference with reduced model parameters. To further optimize resource usage, the framework integrates DoReFa-Net for weight quantization and fixed-point parameter representation. An early exit mechanism is introduced to support low-latency, energy-efficient predictions. The proposed B-CNN hardware accelerator is implemented using TSMC 16 nm CMOS technology, incorporating power gating techniques to manage memory power consumption. Post-layout simulations demonstrate that the proposed hardware accelerator operates at 500 MHz with a power consumption of 37.56 mW. The system achieves a disaster prediction accuracy of 88.18%, highlighting its effectiveness and suitability for low-power, real-time applications in aerial disaster monitoring.
In this study, we elaborate on our work in which we have optimized a Convolutional Neural Network (CNN) for hardware implementation. The CNN model is trained on GPU with the CIFAR-10 dataset, in an iterative quantization function to obtain filter weights as powers of 2 to replace resource heavy multiplications by simple shift operations. The trained parameters are used to implement a design on hardware using the Winograd Convolution algorithm, which minimizes the number of computations thus reducing complexity. In doing so, we obtain substantial computational savings of about $\mathbf{4 2 \%}$ reduction in LUTs compared to previous work, 15% reduction in additions compared to traditional convolution, and elimination of multiplications. The images are also dimensionally processed so that convolution can be parallelized in the architecture designed for the hardware. The hardware accelerator is optimized to achieve about $10^{5}$ times speedup as compared to GPUs implemented in software. Our implementation achieved accuracy of 76.67% as tested on a small fraction of test data. The design leverages parallelism and resource reuse to achieve an optimal balance between accuracy and performance, enabling a highly efficient CNN hardware accelerator for image classification.
The Mamba-2 model introduces a State Space Duality (SSD) mechanism, based on original State Space Models (SSMs), that accelerates training and improves accuracy. However, efficient hardware acceleration for Mamba-2 faces challenges. Numerous element-wise operations fail to fully utilize GPU tensor cores, diminishing inference efficiency. Furthermore, research on full-quantization strategies for Mamba-2 is lacking. To address this, we propose a hybrid-precision full-quantization strategy, Hfqmamba2, balancing performance and hardware resource usage. Applying this strategy to Mamba-2 models of various sizes shows that accuracy loss remains within an acceptable range. We also propose an efficient FPGA-based hardware accelerator for Mamba-2. Given the distinct data flow characteristics of the two RMSNorm (Root Mean Square Normalization) layers in the hardware implementation, we introduce a reconfigurable hardware architecture based on a segmented quantization strategy, improving efficiency and flexibility by using a segmented lookup table to approximate the inverse square root operation. For the selective SSM layer operations, we design an intra-layer computation pipeline to enhance processing efficiency. Through design space exploration, we configure two versions of the hardware accelerator and evaluate their performance on the Alveo U50 platform. Experimental results show that both configurations achieve 99.63 % bandwidth utilization. Compared to the CPU, the hardware accelerator achieves a 114.05× speedup and a 282.75× improvement in energy efficiency. It also outperforms the PyTorch implementation on the GPU, achieving a 29.81 ×speedup and a 297.87 ×improvement in energy efficiency. Additionally, the hardware implementation shows a 1.94 ×speedup and a 35.89 ×improvement in energy efficiency over the official CUDA-accelerated version.
To tackle the contradiction between high power consumption, large resource usage and real-time requirements in deploying Convolutional Neural Networks (CNNs) on resource-constrained embedded edge devices, this paper proposes a low-power FPGA reconfigurable hardware accelerator design for lightweight Depthwise Separable Convolutional Neural Networks (DSCNNs). Based on the MobileNetV2 model, this design conducts in-depth analysis on the computational characteristics of Depthwise Separable Convolution (DSC), and innovatively designs a reconfigurable computing engine (including a decomposable PE array) that supports dynamic switching between Depthwise Convolution (DWC) and Pointwise Convolution (PWC) modes, significantly improving hardware resource utilization (DSP utilization reaches 99%). Combined with dynamic numerical quantization technology (adaptively selecting 16-bit quantization fractional bit-width for different network layers, with an accuracy loss of only 1.3% on the ImageNet dataset), efficient on-chip BRAM reuse data scheduling strategy (avoiding redundant off-chip DDR access), and pipeline processing mechanism based on Pingpong Buffer, tight overlap between computation and data transmission is realized. The accelerator is deployed and verified on the resource-constrained Xilinx Zynq-7000 FPGA platform. Experimental results show that while maintaining 71.5% ImageNet Top-1 classification accuracy, the system achieves an image classification frame rate of 18 FPS, a peak computation throughput of 11.56 GOPS, a total on-chip system power consumption of only 2.484 W, and an energy efficiency ratio of up to 4.65 GOPS/W. Compared with similar works, this design achieves an efficient balance between accuracy, power consumption and real-time performance on low-resource FPGA platforms, providing a feasible solution for efficient deployment of lightweight CNN models at the edge, and has good scalability.
This paper presents a Flash-Attention accelerator design methodology based on a 16×16 high-utilization systolic array architecture for long-sequence Transformer applications. By reformulating the Flash-Attention algorithm into a blocked matrix computation pattern combined with an improved softmax architecture and on-chip memory optimization strategy, an accelerator system operating at 200MHz is implemented on a Xilinx Virtex-7 XC7VX690T-2FFG1761C FPGA platform. Experimental results demonstrate that the accelerator achieves an average speedup of 4.6× compared to conventional CPU implementations while maintaining a mean squared error (MSE) of 10-7 order magnitude and structural similarity (SSIM) above 0.98. Furthermore, we present: (1)a dynamic weight reloading mechanism for small systolic arrays, improving processing elements utilization to 79.01% in typical NLP application scenarios; (2)a hybrid-precision quantization-based matrix computation optimization scheme preserving model accuracy under 8/16-bit integer quantization. This research provides an effective hardware solution for lightweight Transformer deployment at the edge computing domain.
No abstract available
Accurate defect detection is crucial for product quality. As manufacturing lines upgrade, real-time defect detection becomes more critical. However, deploying on resource-constrained edge devices is challenging. This paper presents a method for fast deployment of YOLOv5s on Xilinx UltraScale +FPGA devices using industry-standard architectures to build an efficient underwater object detection system. YOLOv5 achieves an mAP of 91.3% when trained on the PKU-Market-Phone dataset. We used Xilinx's Vitis AI toolchain to quantize the PyTorch model to int8 and compile it, resulting in a post-quantization accuracy of 90.1%. Combining with the DPU, we achieved edge deployment with 38.46 FPS and a low power consumption of 6.024 W, meeting the requirements for low power and real-time performance. Comparisons with other platforms demonstrate the system's superiority, offering a viable solution for industrial edge deployment of defect detection.
Transformer, a recent mainstream model in deep learning, has revolutionized a wide range of AI applications, which motivates a surge in research to develop energy-efficient hardware accelerators. Most prior efforts have concentrated on enhancing on-chip computational energy efficiency through several strategies such as encoder-only models [1]–[7], quantization/sparsity [8]–[18], and layer pruning [19]. However, recent works [20], [21] show that external memory access (EMA) dominates total energy consumption. Our analysis based on [22], [23] also indicates that EMA accounts for up to 81% of the total energy usage (Fig. 23.1.1). Additionally, we recognize that the prior works exhibit low hardware utilization, as low as 9% in [4], which negatively impacts latency performance.
Recently Bidirectional Encoder Representations from Transformers (BERT) model has gained lots of attention because of its state-of-the-art performance in multiple natu-ral language processing (NLP) tasks. However, just like many other deep learning based tasks, large model size and intensive computation load of BERT make it difficult and expensive to run and implement on general purpose processors. The proposed hardware accelerator for BERT model realizes faster inference speed and higher energy efficiency. Design procedure is elaborated with two stages: model compression and hardware architecture. Quantization is chosen as the compression technique because of good speed-up performance as well as small model size and less complexity. In the hardware design, systolic tensor array (STA) is applied as processing elements (PE) array to achieve lower area and power consumption by reducing the ratio between registers and number of Floating-point operations per second (FLOPS). Dedicated hardware is designed for Softmax and layer normalization operations. Mathematical transformation is used to replace complicate nonlinear functions with simple operations to reduce the required hardware resources. Performance is evaluated based on transformer-base model. The maximum speed of overall hardware design is 125 MHz and the total latency is 165.9 us. Compared to the same task run on GPU, 22.4x and 7.5x speed up are achieved in multi-head-attention (MHA) and feed-forward networks(FFN) separately. The peak performance of this design is 4.1 TOPs/s and the maximum required memory bandwidth is 80 GB/s.
Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment
Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices. While binarized Transformers offer a promising solution by significantly reducing model size, existing approaches suffer from algorithm-hardware mismatches with limited co-design exploration, leading to suboptimal performance on edge devices. Hence, we propose a co-design method for efficient end-to-end edge deployment of Transformers from three aspects: algorithm, hardware, and joint optimization. First, we propose BMT, a novel hardware-friendly binarized Transformer with optimized quantization methods and components, and we further enhance its model accuracy by leveraging the weighted ternary weight splitting training technique. Second, we develop a streaming processor mixed binarized Transformer accelerator, namely BAT, which is equipped with specialized units and scheduling pipelines for efficient inference of binarized Transformers. Finally, we co-optimize the algorithm and hardware through a design space exploration approach to achieve a global trade-off between accuracy, latency, and robustness for real-world deployments. Experimental results show our co-design achieves up to $2.14 \sim 49.37 \times$ throughput gains and $3.72 \sim 88.53 \times$ better energy efficiency over state-of-the-art Transformer accelerators, enabling efficient end-to-end edge deployment.
The development of model compression is continuously motivated by the evolution of various neural network accelerators with ASIC or FPGA. On the algorithm side, the ultimate goal of quantization or pruning is accelerating the expensive DNN computations on low-power hardware. However, such a"design-and-deploy"workflow faces under-explored challenges in the current hardware-algorithm co-design community. First, although the state-of-the-art quantization algorithm can achieve low precision with negligible degradation of accuracy, the latest deep learning framework (e.g., PyTorch) can only support non-customizable 8-bit precision, data format, and parameter extraction. Secondly, the objective of quantization is to enable the computation with low-precision data. However, the current SoTA algorithm treats the quantized integer as an intermediate result, while the final output of the quantizer is the"discretized"floating-point values, ignoring the practical needs and adding additional workload to hardware designers for integer parameter extraction and layer fusion. Finally, the compression toolkits designed by the industry are constrained to their in-house product or a handful of algorithms. The limited degree of freedom in the current toolkit and the under-explored customization hinder the prototype ASIC or FPGA-based accelerator design. To resolve these challenges, we propose Torch2Chip, an open-sourced, fully customizable, and high-performance toolkit that supports user-designed compression followed by automatic model fusion and parameter extraction. Torch2Chip incorporates the hierarchical design workflow, and the user-customized compression algorithm will be directly packed into the deployment-ready format for prototype chip verification with either CNN or vision transformer (ViT). The code is available at https://github.com/SeoLabCornell/torch2chip.
In recent years, the internet of things (IoT) has been developed near the public's life circle. At the edge device, for real-time data analysis of data, a lightweight deep learning neural network (DNN) is required. In this paper, the lightweight model MobileNet is used to design an energy efficiency hardware accelerator at the edge device. In the software framework (Tensorflow), the quantization-aware training technique with post-training fine-tuning quantization is applied to quantize the model to improve training convergence speed and parameter minimization. In hardware design considerations, fixed-point operations can reduce computational complexity and memory storage space as compared to floating-point operations, which directly affects the power consumption of the circuit. The proposed MobileNet hardware accelerator can achieve low power consumption and is suitable for the edge devices.
Super-resolution (SR) is a crucial component of end-side image processing tasks in constrained sensor environments. However, the existing convolutional neural networks (CNNs) used for SR have significant computational and parameter requirements, necessitating the use of specifically optimized acceleration hardware for the deployment of SR tasks. Accordingly, in this paper, we initially adopt effective lightweight strategies and mixed-precision quantization to obtain the hardware-friendly Light-FSRCNN, which reduces storage consumption by 73.4% in comparison to the original network. Furthermore, we devise a mixed-precision computation engine with a reduced area overhead, which is 15.9% more compact than traditional engines. The hardware processor constructed with this engine exhibits an energy efficiency of 1750.9 GOPS/W under TSMC 12nm synthesis, outperforming CPU, GPU, and analogous SR accelerators.
This study proposes a compact deep learning (DL) architecture and a highly parallelized computing hardware platform to reconstruct the blood flow index (BFi) in diffuse correlation spectroscopy (DCS). We leveraged a rigorous analytical model to generate autocorrelation functions (ACFs) to train the DL network. We assessed the accuracy of the proposed DL using simulated and milk phantom data. Compared to convolutional neural networks (CNN), our lightweight DL architecture achieves 66.7% and 18.5% improvement in MSE for BFi and the coherence factor β, using synthetic data evaluation. The accuracy of rBFi over different algorithms was also investigated. We further simplified the DL computing primitives using subtraction for feature extraction, considering further hardware implementation. We extensively explored computing parallelism and fixed-point quantization within the DL architecture. With the DL model's compact size, we employed unrolling and pipelining optimizations for computation-intensive for-loops in the DL model while storing all learned parameters in on-chip BRAMs. We also achieved pixel-wise parallelism, enabling simultaneous, real-time processing of 10 and 15 autocorrelation functions on Zynq-7000 and Zynq-UltraScale+ field programmable gate array (FPGA), respectively. Unlike existing FPGA accelerators that produce BFi and the β from autocorrelation functions on standalone hardware, our approach is an encapsulated, end-to-end on-chip conversion process from intensity photon data to the temporal intensity ACF and subsequently reconstructing BFi and β. This hardware platform achieves an on-chip solution to replace post-processing and miniaturize modern DCS systems that use single-photon cameras. We also comprehensively compared the computational efficiency of our FPGA accelerator to CPU and GPU solutions.
Convolutional neural networks (CNNs) are computationally demanding due to expensive Multiply-ACcumulate (MAC) operations. Emerging neural network models, such as AdderNet, exploit efficient arithmetic alternatives like sum-of-absolute-difference (SAD) operations to replace the costly MAC operations, while still achieving competitive model accuracy as compared with the CNN counterparts. Nevertheless, existing AdderNet accelerators still face critical implementation challenges to achieve maximal hardware and energy efficiency at the cost of model inference accuracy loss. This paper presents AdderNet 2.0, an algorithm-hardware co-design framework featuring a novel Activation-Oriented Quantization (AOQ) strategy, a Fused Bias Removal (FBR) scheme for on-chip feature map memory bitwidth reduction, and optimal PE designs to improve the overall resource utilization towards optimal AdderNet accelerator designs. Multiple AdderNet 2.0 accelerator design variants were implemented on Xilinx KV-260 FPGA. Experimental results show that the INT6 AdderNet 2.0 accelerators achieve significant hardware resource and energy savings when compared to prior CNN and AdderNet designs.
State Space Models (SSMs), like recent Mamba2, have achieved remarkable performance and received extensive attention. However, deploying Mamba2 on resource-constrained edge devices encounters many problems: severe outliers within the linear layer challenging the quantization, diverse and irregular element-wise tensor operations, and hardware-unfriendly nonlinear functions in the SSM block. To address these issues, this paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2. Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers. Moreover, a hardware-friendly and finegrained power-of-two quantization framework is presented for the SSM block and convolution layer, and a first-order linear approximation is developed to optimize the nonlinear functions. Based on the accurate algorithm quantization, we propose an accelerator that integrates parallel vector processing units, pipelined execution dataflow, and an efficient SSM Nonlinear Approximation Unit, which enhances computational efficiency and reduces hardware complexity. Finally, we evaluate FastMamba on Xilinx VC709 FPGA. For the input prefill task on Mamba2130 M, FastMamba achieves $68.80 \times$ and $8.90 \times$ speedup over Intel Xeon 4210R CPU and NVIDIA RTX 3090 GPU, respectively. In the output decode experiment with Mamba2-2.7B, FastMamba attains $1.65 \times$ higher energy efficiency than RTX 3090 GPU.
Compute-in-memory (CIM) has gained prominence as a promising hardware architecture for machine-learning accelerators (MLAs) within the landscape of intelligent sensors (ISs). The acceleration of deep neural networks (DNNs) by MLAs highlights the need for improved energy efficiency. In recent years, CIM-aware DNN model compression techniques, such as low-precision quantization, have been extensively investigated to enhance the energy efficiency of tiny machine-learning (TinyML) models for edge devices. However, existing approaches primarily focus on the posttraining compression of pretrained models and overlook the energy consumption during compression-aware training. In this article, we propose a hamming weight (HW)-based quantization framework, named HamQ, to enhance the energy efficiency of analog CIM. A key contribution of this work is in a novel regularizer to reduce HW of quantized model weights, thereby implementing the crossbar with a lesser amount of ON bit-cells. This constraint results in lower bitline currents in crossbar arrays, which are often a major energy overhead in analog CIM accelerators. We analytically prove that HamQ evolves the probabilistic density of model weights to be high in low HW ranges while making it low in high HW ranges. Our method is evaluated on image classification and keyword spotting (KWS) tasks with TinyML models. Simulation results illustrate that, in comparison to models without regularization, HamQ reduces per-inference energy consumption by 54.0% with a marginal accuracy degradation of 1.5% for the 8-bit ResNet-18 model in CIFAR-10 image classification and by 42.7% with a 3.5% degradation for the 6-bit DS-CNN model in the KWS task.
In the field of hardware accelerators for convolutional neural network (CNN) inference, quantization techniques have been widely employed to enhance the performance. The prevailing quantization scheme of the accelerator at present is using signed 8‐bit integer variables (INT8). CNN accelerators support INT8, while lower precision INT4 is less common. Accelerators supporting INT4 depthwise separable convolution (DWC) are even rarer. Therefore, this article presents a high‐performance CNN accelerator that not only supports 8‐bit and 4‐bit data but also supports standard convolution (SC) and DWC. Additionally, in order to improve the transmission efficiency of DWC, an intermediate cache strategy is proposed, using a pointwise convolution (PW) input buffer (PW BUF) to store output data from depthwise convolution (DW) to avoid off‐chip transmission. Furthermore, to address the issue of a DSP cannot perform two 4 × 4‐bit multiplications when dealing with DW, a processing element (PE) is designed to make full use of DSP hardware resources. Finally, this accelerator is implemented on ZYNQ ZC706 with a frequency of 200 MHz. Experimental results show that it achieves a performance up to 307.88 giga operations per second (GOPS) on VGG, reaching 97.9% peak performance; while on MobileNet, it achieves efficient performance with 206.43 GOPS with only 392 DSPs. Compared with mainstream CNN accelerators, it increases DSP utilization rate (GOPS/DSP) by 1.5× to 33.5×.
Ternary quantization can effectively simplify matrix multiplication, which is the primary computational operation in neural network models. It has shown success in FPGA-based accelerator designs for emerging models such as GAT and Transformer. However, existing ternary quantization methods can lead to substantial accuracy loss under certain weight distribution pat-terns, such as GCN. Furthermore, current FPGA-based ternary weight designs often focus on reducing resource consumption while neglecting full utilization of FPGA DSP blocks, limiting maximum performance. To address these challenges, we propose ATE-GCN, an FPGA-based asymmetrical ternary quantization GCN accelerator using a software-hardware co-optimization approach. First, we adopt an asymmetrical quantization strategy with specific interval divisions tailored to the bimodal distribution of GCN weights, reducing accuracy loss. Second, we design a unified processing element (PE) array on FPGA to support various matrix computation forms, optimizing FPGA resource usage while leveraging the benefits of cascade design and ternary quantization, significantly boosting performance. Finally, we implement the ATE-GCN prototype on the VCU118 FPGA board. The results show that ATE-GCN maintains an accuracy loss below 2%. Additionally, ATE-GCN achieves average performance improvements of $224.13\times$ and $11.1\times$, with up to $898.82\times$ and $69.9\times$ energy consumption saving compared to CPU and GPU, respectively. Moreover, compared to state-of-the-art FPGA-based GCN accelerators, ATE-GCN improves DSP efficiency by 63% with an average latency reduction of 11%.
Quantization is an important technique for the acceleration of transformer-based neural networks. Prior related works mainly consider quantization from the algorithm level. Their hardware implementation is inefficient. In this brief, we propose an efficient vision transformer accelerator with retraining-free and finetuning-free hybrid-precision quantization. At the algorithm level, the features and weights are divided into two parts: normal values and outlier values. These two parts are quantized with different bit widths and scaling factors. We use matrix transformation and group-wise quantization policy to improve hardware utilization. At the hardware level, we propose a two-stage FIFO group structure and a hierarchical interleaving data flow to further improve the utilization of the PE array. As a result, the input and weight matrices are quantized to 5.71 bits on average with 0.526 ${\%}$ accuracy loss on Swin-T. The accelerator achieves a frame rate of 118.9 FPS and an energy efficiency of 43.58 GOPS/W on the ZCU102 FPGA board, better than state-of-the-art works.
The deployment of large language models (LLMs) is increasingly constrained by memory and latency bottlenecks, motivating the need for quantization techniques that flexibly balance accuracy and efficiency. Recent work has introduced multi-precision models, which enable inference at multiple precisions within a single model depending on runtime constraints. To support such flexibility, quantized weights are often stored as bit-planes, where hardware efficiency improves when the compute operates directly at the bit-plane level and activates only the precision required by each request. In this work, we present AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded Quantization (BCQ) that supports direct bit-plane operations. By representing weights as binary bit-planes with corresponding scale factors, AnyBCQ enables bit-plane-level computation and maps naturally to accelerator-friendly, bit-parallel arithmetic. Our progressive precision expansion mechanism incrementally refines scaling factors while reusing previously assigned binary codes, yielding monotonic improvements in accuracy as additional bits are enabled. We further co-design a specialized kernel that exploits the BCQ structure to support dynamic per-request precision selection with negligible overhead. Experiments on recent LLMs demonstrate that AnyBCQ significantly narrows the accuracy drop in the low-bit regime (e.g. 2-bit), remains competitive at higher precision, and achieves throughput gains of up to 3.0x over half precision and 1.2x over state-of-the-art multi-precision methods. By aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a practical foundation for multi-precision LLM deployment across diverse service-level objectives.
No abstract available
Precision-scalable convolutional neural networks (CNNs) offer a promising solution to balance network accuracy and hardware efficiency, facilitating high-performance execution on embedded devices. However, the requirement for small fine-grained multiplication calculations in precision-scalable (PS) networks has resulted in limited exploration on FPGA platforms. It is found that the deployment of PS accelerators encounters the following challenges: LUT-based multiply-accumulates (MACs) fail to make full use of DSP, and DSP-based MACs support limited precision combinations and cannot efficiently utilize DSP. Therefore, this brief proposes a DSP-based precision-scalable MAC with hybrid dataflow that supports most precision combinations and ensures high-efficiency utilization of DSP and LUT resources. Evaluating on mixed 4 b/8b VGG16, compared with 8b baseline, the proposed accelerator achieves 3.97× improvement in performance with only a 0.37% accuracy degradation. Additionally, compared with state-of-the-art accelerators, the proposed accelerator achieves 1.20 × −2.69× improvement in DSP efficiency and 1.63 × −6.34× improvement in LUT efficiency.
Low-precision training has emerged as a powerful technique for reducing computational and storage costs in Deep Neural Network (DNN) model training, enabling on-chip training or fine-tuning on edge devices. However, existing low-precision training methods often require higher bit-widths to maintain accuracy as model sizes increase. In this paper, we introduce an outlier-aware quantization strategy for low-precision training. While traditional value-aware quantization methods require costly online distribution statistics operations on computational data, impeding the efficiency gains of low-precision training, our approach addresses this challenge through a novel Learnable Threshold based Outlier-Aware Quantization (LT-OAQ) training framework. This method concurrently updates outlier thresholds and model weights through gradient descent, eliminating the need for costly data-statistics operations. To efficiently support the LT-OAQ training framework, we designed a hardware accelerator based on the systolic array architecture. This accelerator introduces a processing element (PE) fusion mechanism that dynamically fuses adjacent PEs into clusters to support outlier computations, optimizing the mapping of outlier computation tasks, enabling mixed-precision training, and implementing online quantization. Our approach maintains model accuracy while significantly reducing computational complexity and storage resource requirements. Experimental results demonstrate that our design achieves a 2.9 ×speedup in performance and a 2.17 ×reduction in energy consumption compared to state-of-the-art low-precision accelerators.
The increasing demand for deploying Convolutional Neural Networks (CNNs) on FPGAs necessitates efficient resource utilization. This work presents a CNN accelerator by integrating weight sharing through quantization to optimize hardware efficiency for applications that can tolerate a minor tradeoff of accuracy for reduced resource and power consumption. A model trained on CIFAR-10 is used, where Uniform Scalar Quantization (INT4) is leveraged to reduce bitwidth of the weights. K-means clustering is applied to form a weight-sharing matrix, mapping quantized weights to shared indices, minimizing storage overhead. The accelerator is designed with low-resource and low-power constraints in mind and synthesized in Verilog using Vivado, targeting the Zynq7000 FPGA. Resource utilization results reveal a significant reduction in LUTs and DSPs compared to conventional quantized implementations. The trained model achieves $\mathbf{8 0. 2} \boldsymbol{\%}$ accuracy on software an 8% drop from full-precision baselines which is justified by the substantial savings in area and power, making it well-suited for embedded or edge applications where such trade-offs are acceptable.
No abstract available
To address the computational power and energy efficiency challenges in Llama2 large-model inference, this letter proposes a hardware-software co-design method and finally implements a high energy efficiency accelerator named QLlama based on FPGA. This work first employs a novel quantization method based on a microscaling data format, which allows sharing a scaling factor with E8M0 format for each subtensor block, thus enabling quantization and dequantization operations to be completed using only shift operations. Second, on this basis, a mixed precision configuration is implemented for different layers of Llama2 to balance accuracy loss and computational efficiency. Finally, a dedicated accelerator QLlama is designed, whose core units include a quantization unit for dynamic quantization, a vector-matrix multiplication unit for high density computation of quantized weights, a scaled dot product unit, and a basic operator unit. Experimental results show that this scheme achieves energy efficiency improvements of $2.13\sim 10.66\times $ with negligible accuracy loss, i.e., <0.2 perplexity increase. The code is available at https://github.com/wendadawen/QLlama.
Large language models (LLMs) face significant deployment challenges due to their substantial memory and computational demands. While low-precision quantization offers a promising solution, the presence of activation outliers severely degrades model accuracy. Existing approaches either compromise hardware efficiency through misaligned memory access or sacrifice quantization granularity through rigid bit-width allocation, particularly when handling non-uniform tensor distributions across and within layers. This paper presents a hardware-software co-designed framework resulting in an outlier-adaptive LLM inference accelerator with memory-aligned mixed-precision group quantization, named OA-LAMA. The framework comprises three key innovations: First, an outlier-adaptive memory-aligned mixed-precision group (OAMAG) format with a novel outlier reordering technique is proposed to preserve accuracy while maintaining DRAM-aligned memory access. Second, a distribution-aware group allocation strategy is proposed to address inter-layer outlier ratio variance. Finally, we design the OA-LAMA hardware architecture with a three-level accumulation architecture and timing-balanced processing elements to support the OAMAG format efficiently. Evaluations demonstrate that OA-LAMA achieves better accuracy than state-of-the-art 4-bit quantization methods while delivering 1.21–3.09× performance improvement and 1.35–2.47× energy efficiency gains over leading LLM accelerators. OA-LAMA establishes new Pareto frontiers in accuracy-efficiency co-optimization for LLM inference. OA-LAMA is open-sourced at https://github.com/CLab-HKUST-GZ/ICCAD25_OA-LAMA.git.
Activation outliers in Large Language Models (LLMs), which exhibit large magnitudes but small quantities, significantly affect model performance and pose challenges for the acceleration of LLMs. To address this bottleneck, researchers have proposed several co-design frameworks with outlier-aware algorithms and dedicated hardware. However, they face challenges balancing model accuracy with hardware efficiency when accelerating LLMs in a low bit-width manner. To this end, we propose OutlierCIM, the first algorithm and hardware codesign framework for the compute-in-memory (CIM) accelerator with outlier-aware quantization algorithm. The key contributions of OutlierCIM are 1) an outlier-clustered tiling strategy that regulates memory access and reduces inefficient workloads which are both introduced by outliers, 2) a hybrid-strategy quantization and a reconfigurable double-bit CIM macro array that overcome the low storage utilization and high latency of outlier-based LLM quantization, and 3) a quantization factor post-processing strategy and a dedicated quantizer that efficiently unify the multiplication and accumulation of outlier-caused FP-INT workloads. Implemented in a 28 nm CMOS technology, OutlierCIM occupies an area of $2.25 \mathrm{~mm}^{2}$. When evaluated at comprehensive benchmarks, OutlierCIM achieves up to $4.54 \times$ energy efficiency improvement and $3.91 \times$ speedup compared to the state-of-the-art outlier-aware accelerators.
To alleviate the vulnerability of graph neural networks (GNNs) on unseen graphs, many works propose to integrate large language models (LLMs) into GNNs, called graph foundation models (GFMs). The LLM-enhanced GNN, a typical integration method of GFMs, has achieved state-of-the-art performance in most graph-related tasks. However, intensive general matrix multiplications (GEMMs) overhead of LLMs poses a significant challenge to end-to-end inference latency. The introduction of LLMs brings 100× more workload than original GNNs, with GEMMs accounting for more than 99%, becoming the bottleneck of end-to-end inference. To tackle the above challenge, we present GFMEngine, an algorithm and hardware co-design accelerator supporting LLM-enhanced GNNs at multiple levels. (1) At the algorithm level, we point out that the computational precision of LLMs has little impact on the end-to-end accuracy, and propose a product-quantization-based (PQ-based) matrix multiplication for LLMs to alleviate the intensive GEMMs in LLMs, reducing more than 70% computation with negligible accuracy loss. (2) At the hardware level, we point out that the implementation of PQ-based matrix multiplication effectively alleviates the intensive GEMMs but results in a substantial increase in dynamic memory access. Coupled with the dynamic memory access inherent in GNNs, we design a unified indexing unit as the hardware support, reducing ~ 30% memory access in end-to-end inference. (3) At the compilation level, we further design an extensible instruction set architecture as the software support, GFM-ISA, for various real-world GFM tasks. We implement GFMEngine with TSMC 28nm process, and extensive experiments show that GFMEngine achieves up to 3.93×, 38.66×, 22.32×, 2.96× speedup and up to 102.52×, 37.82×, 28.37×, 2.56× energy efficiency improvement compared with NVIDIA Tesla A100 and the domain-specific accelerators, SGCN, MEGA, FACT, respectively.
Transformer-based video generation models have demonstrated significant potential in content creation. However, the current state-of-the-art model employing “ 3 D full attention” encounters substantial computation and storage challenges. For instance, the attention map size for $\operatorname{Cog}$ VideoX-5B requires 56.50 GB, and generating a video of 49 frames takes approximately 1 minute on an NVIDIA A100 GPU under FP16. Although model quantization has proven effective in reducing both memory and computational costs, applying it to video generation models still faces challenges in preserving algorithm performance while ensuring efficient hardware processing. To address these issues, we introduce PARO, a video generation accelerator with patternaware reorder-based attention quantization. PARO investigates the diverse attention patterns of 3D full attention and proposes a novel reorder technique to unify these patterns into a unified “block diagonal” structure. Block-wise mixed precision quantization is further applied to achieve lossless compression under an average bitwidth of 4.80 bits. In terms of hardware, to overcome the limitation of existing mixed-precision computing units could not fully utilize the attention map bitwidth to accelerate $Q K$ multiplication, PARO designs an output-bitwidth aware mixedprecision processing element (PE) array through hardwaresoftware co-design. This approach ensures that the mixedprecision characteristics are fully utilized to enhance hardware efficiency in the bottleneck attention computation. Experiments demonstrate that PARO delivers up to $2.71 \times$ improvement in end-to-end performance compared to an NVIDIA A100 GPU and achieves up to $6.38 \sim 7.05 \times$ speedup over state-of-the-art ASICbased accelerators on the CogVideoX-2B and 5B models.
Transformers' compute- intensive operations pose enormous challenges for their deployment in resource- constrained EdgeAI / tiny ML devices. As an established neural network compression technique, quantization reduces the hardware computational and memory resources. In particular, fixed-point quantization is desirable to ease the computations using lightweight blocks, like adders and multipliers, of the underlying hardware. However, deploying fully-quantized Transformers on existing general-purpose hardware, generic AI accelerators, or specialized architectures for Transformers with floating-point units might be infeasible and/or inefficient. Towards this, we propose SwiftTron, an efficient specialized hardware accelerator designed for Quantized Transformers. SwiftTron supports the execution of different types of Transformers' operations (like Attention, Softmax, GELU, and Layer Normalization) and accounts for diverse scaling factors to perform correct computations. We synthesize the complete SwiftTron architecture in a 65 nm CMOS technology with the ASIC design flow. Our Accelerator executes the RoBERTa-base model in 1.83 ns, while consuming 33.64 mW power, and occupying an area of 273 mm 2• To ease the reproducibility, the RTL of our SwiftTron architecture is released at https://github.com/albertomarchisio/SwiftTron.
Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot. Code is available at https://github.com/facebookresearch/SpinQuant.
Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance due to large magnitude outliers. This paper describes FPTQuant, which introduces four novel, lightweight, and expressive function-preserving transforms (FPTs) to facilitate quantization of transformers: (1) a mergeable pre-RoPE transform for queries and keys, (2) a mergeable transform for values, (3) a mergeable scaling transform within the MLP block, and (4) a cheap, dynamic scaling transform. By leveraging the equivariances and independencies inherent to canonical transformer operation, we designed these FPTs to maintain the model's function while shaping the intermediate activation distributions to be more quantization friendly. FPTQuant requires no custom kernels and adds virtually no overhead during inference. The FPTs are trained both locally to reduce outliers, and end-to-end such that the outputs of the quantized and full-precision models match. FPTQuant enables static INT4 quantization with minimal overhead and shows SOTA speed-up of up to 3.9 times over FP. Empirically, FPTQuant has an excellent accuracy-speed trade-off -- it is performing on par or exceeding most prior work and only shows slightly lower accuracy compared to a method that is up to 29% slower.
Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $\mu = 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete $\{+1, -1\}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms'continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU.
Large-scale language models (LLMs) excel in language processing tasks but face deployment challenges due to high memory and computational demands. While low-bit quantization, such as 4-bit techniques, offers a potential solution, these methods often suffer from significant accuracy loss or require considerable effort for implementation such as reordering, rotation, etc. To address these challenges, we propose QRazor, a simple yet effective quantization scheme that enables 4-bit quantization of weights, activations, and KV cache in transformer-based LLMs. QRazor operates in two stages: first, quantizing data using 8 or 16-bit integers as a basis with absolute max scaling to preserve accuracy close to full-precision models, and second, compressing the quantized data to 4-bit using our significant data razoring (SDR) technique, which retains only the four most salient bits. Without any additional requirment of fine-tuning or additional training, QRazor achieves performance similar or better compared to state-of-the-art in 4-bit quantization method, surpassing Smoothquant and QLLM by over 12 points and Quarot(RTN) by more than 2.9 points in zero-shot reasoning task accuracy on the LLaMA2-7B model. Additionally, we introduce an integer-based arithmetic unit optimized for QRazor, allowing direct low-precision operations on SDR data without decompression.
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.
Large Language Models (LLMs) have become foundational to modern natural language processing, yet their immense computational and memory demands pose major obstacles for efficient inference. Transformer-based LLMs rely heavily on floating-point general matrix-matrix multiplication (FP-GEMM), which dominates both compute and bandwidth. In this paper, we introduce AxCore, a quantization-aware, approximate GEMM unit that combines weight-only quantization with floating-point multiplication approximation (FPMA) to deliver highly efficient and accurate LLM inference. Unlike traditional GEMM units, AxCore eliminates multipliers entirely, replacing them with low-bit integer additions in a novel systolic array. AxCore features several key innovations: (1) a mixed-precision FPMA-based processing element that supports direct computation on compressed weights and high-precision activations; (2) a lightweight accuracy preservation strategy, including subnormal number handling, error compensation, and format-aware quantization; and (3) a set of systolic array optimizations, including shared correction and normalization logic. Evaluations on open-source LLMs show that AxCore achieves up to 6.3 × -12.5 × higher compute density than conventional FP GEMM units. Compared to state-of-the-art INT4-based accelerators, FIGLUT and FIGNA, AxCore improves compute density by 53% and 70%, respectively, while also delivering lower perplexity. AxCore is opensourced at: https://github.com/CLab-HKUST-GZ/micro58-axcore.
As transformer-based Large Language Models (LLMs) grow, deploying them under resource constraints has become increasingly complex, making quantization a vital technique for efficient inference. However, unlike convolutional neural networks (CNNs), LLMs exhibit unique tensor distribution characteristics, particularly in activations, significantly hindering low-bit quantization. This paper uses a statistical analysis grounded in standard distribution theory to reveal that LLM activations contain rare but high-magnitude outliers significantly influencing model performance. Our empirical findings show that these outliers are not merely noise but contain semantically critical information, and their improper handling during quantization leads to severe accuracy degradation. To address this, we propose an efficient Outlier-Rescaled quantization method that preserves expressive outlier representations using a lightweight shift-based mechanism within a 4-bit format. Evaluations demonstrate that our method substantially restores performance lost under INT4 quantization, particularly in LLMs, without requiring additional hardware or mixed-precision schemes. This study underscores the importance of activation-aware design in LLM quantization and provides a practical path forward for ultra-low-bit deployment.
We present an on-chip implementation of a compressed Transformer-based language model on a Xilinx Artix-7 FPGA. Our contributions include: (1) combining ultra-low-precision quantization (4 bits) and multi-query attention (MQA) to compress the KV cache by 8×, enabling sequence lengths up to 256 tokens; (2) a streaming hardware architecture in Verilog that implements pre-layernorm, attention, and feed-forward sublayers using block RAM (BRAM) and DSPs; and (3) post-synthesis results demonstrating real-time throughput (4.4 K tokens/s) with BRAM and DSP utilizations of 31.9% and 85%, respectively. The prototype supports generative inference entirely on-chip, paving the way for privacy-preserving, edge-scale LLMs. Code and scripts are available at https://github.com/chae-sy/squeezing_lm
Large Language Models (LLMs) have demonstrated impressive performance on a range of Natural Language Processing (NLP) tasks. Unfortunately, the immense amount of computations and memory accesses required for LLM training makes them prohibitively expensive in terms of hardware cost, and thus challenging to deploy in use cases such as on-device learning. In this paper, motivated by the observation that LLM training is memory-bound, we propose a novel dynamic quantization strategy, termed Dynamic Stashing Quantization (DSQ), that puts a special focus on reducing the memory operations, but also enjoys the other benefits of low precision training, such as the reduced arithmetic cost. We conduct a thorough study on two translation tasks (trained-from-scratch) and three classification tasks (fine-tuning). DSQ reduces the amount of arithmetic operations by $20.95\times$ and the number of DRAM operations by $2.55\times$ on IWSLT17 compared to the standard 16-bit fixed-point, which is widely used in on-device learning.
The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of"extreme"LLM compression-defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter-from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across each transformer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime. In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software.
Transformer-based Large Language Models (LLMs) have significantly advanced AI capabilities but pose considerable challenges for deployment on edge devices due to high computational demands, memory bandwidth constraints, and energy consumption. This paper addresses these challenges by presenting an efficient framework for deploying the Qwen2.5-0.5B model on the Xilinx Kria KV260 edge platform, a heterogeneous system integrating an ARM Cortex-A53 CPU with reconfigurable FPGA logic. Leveraging Activation-aware Weight Quantization (AWQ) with FPGA-accelerated execution pipelines, the proposed approach enhances both model compression rate and system throughput. Additionally, we propose a hybrid execution strategy that intelligently offloads compute-intensive operations to the FPGA while utilizing the CPU for lighter tasks, effectively balancing the computational workload and maximizing overall performance. Our framework achieves a model compression rate of 55.08% compared to the original model and produces output at a rate of 5.1 tokens per second, outperforming the baseline performance of 2.8 tokens per second.
Neural processing units (NPUs) have become essential in modern client and edge platforms, offering unparalleled efficiency by delivering high throughput at low power. This is critical to improve the TOPS/W of the NPU, leading to longer battery life. While NPUs were initially designed to efficiently execute computer vision (CV) workloads such as CNNs, the rising demand to run transformer-based large language models (LLMs) locally now calls for significant architectural and software adaptation. This paper presents LLM-NPU, a comprehensive software-hardware co-optimization framework that enables scalable, power-efficient LLM deployment on NPUs under tight compute and memory budgets. We present software solutions such as vertical and horizontal operator fusion, quantization-aware weight compression, hybrid key-value (KV) quantization, eviction strategies, and static-shape inference that target memory bottlenecks and compute inefficiencies in LLM execution. On the hardware side, we explore domain-specialized NPU enhancements, including processing-in-memory architectures, extended input channel accumulation, structured sparsity acceleration, GEMM engine optimizations, mixed precision, microscaling format support, and fusion-aware execution pipelines. These co-designed innovations can collectively improve the energy, throughput, and latency of NPUs for LLM workloads.
Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an effective method to accelerate LLM inference. Despite its growing popularity in LLM model compression, PTQ deployment faces two major challenges. First, low-bit quantization leads to performance degradation. Second, restricted by the limited integer computing unit type on GPUs, quantized matrix operations with different precisions cannot be effectively accelerated. To address these issues, we introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU. ABQ-LLM introduces several key innovations: (1) a distribution correction method for transformer blocks to mitigate distribution differences caused by full quantization of weights and activations, improving performance at low bit-widths. (2) the bit balance strategy to counteract performance degradation from asymmetric distribution issues at very low bit-widths (e.g., 2-bit). (3) an innovative quantization acceleration framework that reconstructs the quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents, gets rid of the limitations of INT4/INT8 computing units. ABQ-LLM can convert each component bit width gain into actual acceleration gain, maximizing performance under mixed precision(e.g., W6A6, W2A8). Based on W2*A8 quantization configuration on LLaMA-7B model, it achieved a WikiText2 perplexity of 7.59 (2.17⬇ vs 9.76 in AffineQuant). Compared to SmoothQuant, we realized 1.6x acceleration improvement and 2.7x memory compression gain.
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based and struggle with bit widths below 8 bits. Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions, and it has emerged as a default choice in many hardware platforms. One characteristic of FP quantization is that its performance largely depends on the choice of exponent bits and clipping range. In this regard, we construct a strong FP-PTQ baseline by searching for the optimal quantization parameters. Furthermore, we observe a high inter-channel variance and low intra-channel variance pattern in activation distributions, which adds activation quantization difficulty. We recognize this pattern to be consistent across a spectrum of transformer models designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models. To tackle this, we propose per-channel activation quantization and show that these additional scaling factors can be reparameterized as exponential biases of weights, incurring a negligible cost. Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1 on the common sense zero-shot reasoning tasks, which is only 5.8 lower than the full-precision model, significantly outperforming the previous state-of-the-art by 12.7 points. Code is available at: https://github.com/nbasyl/LLM-FP4.
Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency across various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics, namely accuracy, inference latency, and throughput (total output tokens generated per second), providing insights into the suitability of low-bit quantization for real-world deployment. Using the results, users can then make suitable decisions based on the specifications that need to be met. We discuss the pros and cons of GSQ and GPTQ techniques on models of different sizes, which also serve as a benchmark for future experiments.
Transformer-based Large Language Models (LLMs) power applications from virtual assistants and code generation to scientific discovery. As their capabilities grow, they are used for emerging use cases such as offline AI copilots, on-device personalization, and edge inference. This demands efficient deployment of fine-tuned LLMs or compact Small Language Models (SLMs) on resource-constrained edge devices. However, deploying LLMs on the edge presents a significant hardware design challenge. The diversity in model architectures, input/output token lengths, and batch sizes leads to widely varying compute and memory demands. Moreover, design parameters like systolic array sizes, vector lengths, data widths, and operating frequencies drastically affect energy consumption and latency. While hardware-aware quantization is often adopted for power-performance gains, determining the optimal hardware configuration that meets tight power and performance budgets remains a non-trivial task, especially when the design space spans millions of possible configurations. Exhaustive exploration is computationally prohibitive and delays time to market. We introduce Architecture-Tuner (ArchTune), a lightweight analytical framework that predicts power, latency and energy consumption for RISC-V-based accelerators featuring systolic arrays and vector processing units (VPUs). Given an LLM workload, ArchTune rapidly estimates energy across millions of configurations using calibrated analytical models, eliminating the need for exhaustive simulations. ArchTune achieves $R^{2}=99.42 \%$ with $\mathbf{1 0. 4 1 \%}$ MAPE for systolic arrays and 8.2% MAPE for VPUs. By combining these models with systematic latency and memory analysis, ArchTune empowers early-stage design-space exploration, enabling designers to select energy-efficient hardware tailored for specific LLM workloads on edge platforms.
No abstract available
Transformer-based large language models (LLMs) have demonstrated remarkable potential across a wide range of practical applications. However, long-context inference remains a significant challenge due to the substantial memory requirements of the key-value (KV) cache, which can scale to several gigabytes as sequence length and batch size increase. In this paper, we present \textbf{PackKV}, a generic and efficient KV cache management framework optimized for long-context generation. %, which synergistically supports both latency-critical and throughput-critical inference scenarios. PackKV introduces novel lossy compression techniques specifically tailored to the characteristics of KV cache data, featuring a careful co-design of compression algorithms and system architecture. Our approach is compatible with the dynamically growing nature of the KV cache while preserving high computational efficiency. Experimental results show that, under the same and minimum accuracy drop as state-of-the-art quantization methods, PackKV achieves, on average, \textbf{153.2}\% higher memory reduction rate for the K cache and \textbf{179.6}\% for the V cache. Furthermore, PackKV delivers extremely high execution throughput, effectively eliminating decompression overhead and accelerating the matrix-vector multiplication operation. Specifically, PackKV achieves an average throughput improvement of \textbf{75.7}\% for K and \textbf{171.7}\% for V across A100 and RTX Pro 6000 GPUs, compared to cuBLAS matrix-vector multiplication kernels, while demanding less GPU memory bandwidth. Code available on https://github.com/BoJiang03/PackKV
Transformer-based large language models (LLMs) have achieved great success with the growing model size. LLMs' size grows by 240× every two years, which outpaces the hardware progress and makes model inference increasingly costly. Model quantization is a promising approach to mitigate the widening gap between LLM size and hardware capacity. However, the existence of outliers, values with significant magnitudes, in LLMs makes existing quantization methods less effective. Prior outlier-aware quantization schemes adopt sparsity encoding techniques to separate outliers from normal values where the process requires global coordination (e.g., a global sparsity coordination list). This incurs complex encoding/decoding hardware logics and an extra orchestration controller for the computation between outlier and normal values. As such, it is not hardware-efficient and hence only achieves sub-optimal quantization benefits. We propose OliVe, an algorithm/architecture co-designed solution that adopts an outlier-victim pair (OVP) quantization and handles outlier values locally with low hardware overheads and high performance gains. The key insight of OliVe is that outliers are important while the normal values next to them are not. Thus those normal values (called victims) can be sacrificed to accommodate outliers. This enables a memory-aligned OVP encoding scheme, which can be efficiently integrated to the existing hardware accelerators like systolic array and tensor core. As a result, OliVe-based accelerator surpasses the existing outlier-aware accelerator, GOBO, by 4.5× speedup and 4.0× energy reduction, respectively, with a superior model accuracy.
The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbolic expression-based approach to support dynamic shape model inference; (b) operator optimizations and execution priority setting to enhance inference speed and reduce phone lagging; (c) an FP4 quantization method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference. Furthermore, we implement these methods in our mobile inference engine, Transformer-Lite, which is compatible with both Qualcomm and MTK processors. We evaluated Transformer-Lite's performance using LLMs with varied architectures and parameters ranging from 2B to 14B. Specifically, we achieved prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s and 30 token/s for smaller Gemma 2B, respectively. Compared with CPU-based FastLLM and GPU-based MLC-LLM, our engine attains over 10x speedup for the prefill speed and 2~3x speedup for the decoding speed.
E-commerce chatbots play a crucial role in customer service but often struggle with understanding complex queries. This study introduces a breakthrough approach leveraging the Falcon-7B model, a state-of-the-art Large Language Model (LLM) with 7 billion parameters. Trained on a vast dataset of 1,500 billion tokens from RefinedWeb and curated corpora, the Falcon-7B model excels in natural language understanding and generation. Notably, its 16-bit full quantization transformer ensures efficient computation without compromising scalability or performance. By harnessing cutting-edge machine learning techniques, our method aims to redefine e-commerce chatbot systems, providing businesses with a robust solution for delivering personalized customer experiences.
We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29$\times$ end-to-end speedup, 1.64$\times$ higher decode throughput, and 3.14$\times$ better energy efficiency than an NVIDIA A100 GPU (7nm) running BF16 inference; projected results on the V80 FPGA at 7nm reach 4.71$\times$, 6.55$\times$, and 4.13$\times$, respectively. In long-context scenarios, integrating the HMT plug-in reduces prefill latency by 23.23$\times$ and extends the context window by 64$\times$, delivering 1.10$\times$/4.86$\times$ lower end-to-end latency and 5.21$\times$/6.27$\times$ higher energy efficiency on the U280/V80 compared to the A100 baseline. FlexLLM thus bridges algorithmic innovation in LLM inference and high-performance accelerators with minimal manual effort.
Despite the growing prevalence of large language model (LLM) architectures, a crucial concern persists regarding their energy and power consumption, which still lags far behind the remarkable energy efficiency of the human brain. Recent strides in spiking language models (LM) and transformer architectures aim to address this concern by harnessing the spiking activity of biological neurons to enhance energy/power efficiency. Doubling down on the principles of model quantization and energy efficiency, this paper proposes the development of a novel binary/ternary (1/1.58-bit) spiking LM architecture. Achieving scalability comparable to a deep spiking LM architecture is facilitated by an efficient knowledge distillation technique, wherein knowledge from a non-spiking full-precision “teacher” model is transferred to an extremely weight quantized spiking “student” LM. Our proposed model represents a significant advancement as the first-of-its-kind 1/1.58-bit spiking LM, and its performance is rigorously evaluated on multiple text classification tasks of the GLUE benchmark.
Transformer-based models have evolved into Large Language Models (LLMs) by increasing model sizes to achieve higher accuracy, but they incur significant computational and memory costs. As quantization is a promising method to mitigate the huge cost of LLMs, the presence of outliers can lead to accuracy drops during quantization. Previous work pointed out LLMs have outliers only in specific input channels of activations. This suggests that per-input channel quantization would be beneficial, but it poses excessive computational overhead without optimization. To address these challenges, we propose a hardware and software co-design that mitigates the overhead of per-input channel quantization. We first propose AirGun, a quantization method that adaptively quantizes LLM modules. We observe that LLMs have high quantization sensitivity only in specific modules. Based on our observation, AirGun applies hardware-efficient per-tensor quantization for non-sensitive modules and per-input channel quantization for sensitive modules. For per-input channel quantization, we introduce early reconstruction and adaptive dyadic numbering, dismissing the overhead while exploiting its advantages. Additionally, we propose the AirGun accelerator that fully utilizes the advantages of AirGun. As a result, the AirGun accelerator achieves a 4.19 × speedup and 63.16 % lower energy consumption compared to the previous LM accelerator while achieving higher accuracy.
The Visual Geometry Grounded Transformer (VGGT) enables strong feed-forward 3D reconstruction without per-scene optimization. However, its billion-parameter scale creates high memory and compute demands, hindering on-device deployment. Existing LLM quantization methods fail on VGGT due to saturated activation channels and diverse 3D semantics, which cause unreliable calibration. Furthermore, VGGT presents hardware challenges regarding precision-sensitive nonlinear operators and memory-intensive global attention. To address this, we propose VersaQ-3D, an algorithm-architecture co-design framework. Algorithmically, we introduce the first calibration-free, scene-agnostic quantization for VGGT down to 4-bit, leveraging orthogonal transforms to decorrelate features and suppress outliers. Architecturally, we design a reconfigurable accelerator supporting BF16, INT8, and INT4. A unified systolic datapath handles both linear and nonlinear operators, reducing latency by 60%, while two-stage recomputation-based tiling alleviates memory pressure for long-sequence attention. Evaluations show VersaQ-3D preserves 98-99% accuracy at W4A8. At W4A4, it outperforms prior methods by 1.61x-2.39x across diverse scenes. The accelerator delivers 5.2x-10.8x speedup over edge GPUs with low power, enabling efficient instant 3D reconstruction.
The field of Artificial Intelligence has witnessed remarkable progress in recent years, especially with the emergence of powerful large language models (LLMs) based on the transformer architecture. Cloud-based LLMs, such as OpenAI's ChatGPT, offer impressive capabilities but come with concerns regarding latency and privacy due to network dependencies. This article presents an innovative approach to LLM inference, envisioning a future where LLMs with billions of parameters can be executed directly on mobile devices without network connectivity. The article showcases a fine-tuned GPT LLM with 3 billion parameters that can operate smoothly on devices with as low as 4GB of memory. Through the integration of native code and model quantization techniques, the application not only serves as a general-purpose assistant but also facilitates seamless mobile interactions with text-to-actions features. The article provides insights into the training pipeline, implementation details, test results, and future directions of on-device LLM inference. This breakthrough technology opens up possibilities for empowering users with sophisticated AI capabilities while preserving their privacy and eliminating latency concerns.
Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM’s substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and $\frac {1}{2}$ -approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multi-user case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node’s inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called"sink tokens"receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment through quantization, pruning, or decoding strategy adjustments. We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes. Through systematic analysis of various LLM frameworks, we identify key vulnerability patterns: layer expansion frequently disrupts attention mechanisms, compression techniques induce information loss cascades, and decoding adjustments amplify prediction divergences. Our investigation reveals transformer architectures exhibit inherent robustness thresholds that determine hemorrhage severity across modification types. We propose three mitigation strategies: gradient-aware pruning preserves critical weight pathways, dynamic quantization scaling maintains activation integrity, and decoding calibration aligns generation trajectories with original model distributions. This work establishes foundational metrics for evaluating model stability during adaptation, providing practical guidelines for maintaining performance while enabling efficient LLM deployment. Our findings advance understanding of neural network resilience under architectural transformations, particularly for large-scale language models.
This work presents a Retrieval-Augmented Generation (RAG) pipeline that integrates document preprocessing, embedding-based retrieval, and large language model (LLM) generation into a unified framework. The pipeline begins with the ingestion of PDF documents, followed by text cleaning, sentence segmentation, and chunking to ensure compatibility with embedding model constraints. High-dimensional vector representations are generated using transformer-based embedding models and stored for downstream use. Semantic similarity search, implemented via dot product and cosine similarity, enables efficient retrieval of contextually relevant text. For scalability, the framework is designed to accommodate vector indexing methods such as Faiss. On the generation side, locally hosted LLM (Gemma-7B) is employed with optional quantization for reduced resource consump- tion. Retrieved context is integrated with user queries to enhance the accuracy and relevance of generated responses. This pipeline demonstrates a practical approach for building domain-specific, retrieval-augmented applications that balance efficiency, scalability, and adaptability to local com- pute environments.
The rapid expansion of academic literature presents significant challenges for manual analysis and categorization, making it difficult to identify key research gaps. In this context, the LeanDL-HPC 2025 challenge aims to automatically classify Brazilian theses and dissertations based on their adherence to state-level strategic themes while also addressing critical constraints in computational resources such as memory, runtime, and energy. Considering these challenges, this work proposes a comparison of several efficient approaches for adapting modern models, such as LLM and BERT, under resource constraints. Specifically, it explores Parameter-Efficient Fine-Tuning (PEFT) through QLoRA—which reduces memory consumption by using 4-bit quantization and low-rank adapters — and a recent improvement of a traditional transformer encoder (ModernBERT). Moreover, the experiments employed a Balanced Loss Function, which was also used to overcome class imbalance, penalizing the misclassification of minority labels.
The quantization of large language models (LLMs) has been a prominent research area aimed at enabling their lightweight deployment in practice. Existing research about LLM's quantization has mainly explored the interplay between weights and activations, or employing auxiliary components while neglecting the necessity of adjusting weights during quantization. Consequently, original weight distributions frequently fail to yield desired results after round-to-nearest (RTN) quantization. Even though incorporating techniques such as mixed precision and low-rank error approximation in LLM's quantization can yield improved results, they inevitably introduce additional computational overhead. On the other hand, traditional techniques for weight quantization, such as Generative Post-Training Quantization, rely on manually tweaking weight distributions to minimize local errors, but they fall short of achieving globally optimal outcomes. Although the recently proposed Learnable Singular-value Increment improves global weight quantization by modifying weight distributions, it disrupts the original distribution considerably. This introduces pronounced bias toward the training data and can degrade downstream task performance. In this paper, we introduce Singular-value Diagonal Expansion, a more nuanced approach to refining weight distributions to achieve better quantization alignment. Furthermore, we introduce Cross-layer Learning that improves overall quantization outcomes by distributing errors more evenly across layers. Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches, including OmniQuant, DuQuant, and PrefixQuant.
Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.
Existing weight-activation quantization methods for Large Language Models (LLMs) primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers. First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache, a process that is training-free and highly efficient (e.g., 1 minutes for Llama-3-70B). Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error. Our experiments show that PrefixQuant significantly outperforms existing dynamic quantization methods, even under coarser static quantization settings. For instance, PrefixQuant achieves an average accuracy improvement of +3.08 and +2.85 points over SpinQuant (dynamic quantization) on five zero-shot reasoning tasks under dynamic and static quantization settings, respectively, on W4A4KV4 Llama-3-8B. Additionally, we demonstrate up to 2.74x prefilling speedup and 2.16x decoding speedup for LLMs using W4A4 PrefixQuant. Our code is available at https://github.com/ChenMnZ/PrefixQuant.
Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and representational capacity. While existing ultra-low-bit methods rely on binary approximations or quantization-aware training(QAT), they often suffer from either limited representational capacity or huge training resource overhead. We introduce PTQ to Trit-Planes (PTQTP), a structured PTQ framework that decomposes weight matrices into dual ternary {-1, 0, 1} trit-planes. This approach achieves multiplication-free additive inference by decoupling weights into discrete topology (trit-planes) and continuous magnitude (scales), effectively enabling high-fidelity sparse approximation. PTQTP provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment without architectural modifications; and (3) uniform ternary operations that eliminate mixed-precision overhead. Comprehensive experiments on LLaMA3.x and Qwen3 (0.6B-70B) demonstrate that PTQTP significantly outperforms sub-4bit PTQ methods on both language reasoning tasks and mathematical reasoning as well as coding. PTQTP rivals the 1.58-bit QAT performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods, and the end-to-end inference speed achieves 4.63$\times$ faster than the FP16 baseline model, establishing a new and practical solution for efficient LLM deployment in resource-constrained environments. Code will available at https://github.com/HeXiao-55/PTQTP.
Emergent Large Language Models (LLMs) use their extraordinary performance and powerful deduction capacity to discern from traditional language models. However, the expenses of computational resources and storage for these LLMs are stunning, quantization then arises as a trending conversation. To address accuracy decay caused by quantization, two streams of works in post-training quantization methods stand out. One uses other weights to compensate existing quantization error, while the other transfers the quantization difficulty to other parts in the model. Combining both merits, we introduce Learnable Singular value Increment (LSI) as an advanced solution. LSI uses Singular Value Decomposition to extract singular values of the weights and make them learnable to help weights compensate each other conditioned on activation. Incorporating LSI with existing techniques, we achieve state-of-the-art performance in diverse quantization settings, no matter in weight-only, weight-activation or extremely low bit scenarios. By unleashing the potential of LSI, efficient finetuning on quantized model is no longer a prohibitive problem.
Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ Layer-wise information effectiveness Quantization, a hardware-native, metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-8B models, model parameters less than 8B, under extreme low-bit compression. LieQ keeps uniform bit-width within each layer while mixing precision across layers, preserving standard multiplication kernels and avoiding irregular memory access, codebooks, or irregular formats at inference time. Our method uncovers a strong correlation between layer-wise functional saliency and representational compactness, revealing that layers with higher training-induced energy concentration are functionally irreplaceable. Leveraging this insight, we propose a purely geometry-driven sensitivity proxy that enables automatic bit-width allocation under a target average-bit budget without expensive gradient updates or inference-based perplexity probing. At sub 2-bit, LieQ consistently reduces the large accuracy gap typically observed for naive 2-bit baselines on Qwen3 and LLaMA3.x families, while retaining standard-kernel efficiency. These properties make LieQ a practical path toward deploying small language models on resource-constrained edge devices. Code will available here: https://github.com/HeXiao-55/LieQ-official.git.
Inference time, model size, and accuracy are three key factors in deep model compression. Most of the existing work addresses these three key factors separately as it is difficult to optimize them all at the same time. For example, low-bit quantization aims at obtaining a faster model; weight sharing quantization aims at improving compression ratio and accuracy; and mixed-precision quantization aims at balancing accuracy and inference time. To simultaneously optimize bit-width, model size, and accuracy, we propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method. We integrate L2 normalization, pruning, and the weight decay term to reduce the weight discrepancy in the gradient estimator during quantization, thus producing highly compressed ternary weights. Our method brings the highest test accuracy and the highest compression ratio. For example, it produces a 939kb (49$\times$) 2bit ternary ResNet-18 model with only 4\% accuracy drop on the ImageNet dataset. It compresses 170MB Mask R-CNN to 5MB (34$\times$) with only 2.8\% average precision drop. Our method is verified on image classification, object detection/segmentation tasks with different network structures such as ResNet-18, ResNet-50, and MobileNetV2.
We propose Additive Powers-of-Two~(APoT) quantization, an efficient non-uniform quantization scheme for the bell-shaped and long-tailed distribution of weights and activations in neural networks. By constraining all quantization levels as the sum of Powers-of-Two terms, APoT quantization enjoys high computational efficiency and a good match with the distribution of weights. A simple reparameterization of the clipping function is applied to generate a better-defined gradient for learning the clipping threshold. Moreover, weight normalization is presented to refine the distribution of weights to make the training more stable and consistent. Experimental results show that our proposed method outperforms state-of-the-art methods, and is even competitive with the full-precision models, demonstrating the effectiveness of our proposed APoT quantization. For example, our 4-bit quantized ResNet-50 on ImageNet achieves 76.6% top-1 accuracy without bells and whistles; meanwhile, our model reduces 22% computational cost compared with the uniformly quantized counterpart. The code is available at https://github.com/yhhhli/APoT_Quantization.
Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry. In this paper, we propose decoupleQ, achieving a substantial increase in model accuracy, especially at very low bits. decoupleQ abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, thus transforming the quantization problem into a traditional mathematical optimization problem with constraints, which is then solved alternatively by off-the-shelf optimization methods. Quantization via decoupleQ is linear and uniform, making it hardware-friendlier than non-uniform counterpart, and enabling the idea to be migrated to high-bit quantization to enhance its robustness. Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance. The code is available at https://github.com/bytedance/decoupleQ
Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on ConvNets. However, due to the unique properties of transformers, the low-bit quantization applications are still limited and underexplored. In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets. Based on comprehensive quantitative analysis, we observe variation in three hierarchies: various module quantization sensitivities, outliers in static weight and activation distribution, and oscillation in dynamic parameter fluctuations. These variations of transformers bring instability to the quantization-aware training (QAT) and negatively influence the performance. We explore the best practices to alleviate the variation's influence during low-bit transformer QAT and propose a variation-aware quantization scheme for both vision and language transformers. We extensively verify and show our scheme can alleviate the variation and improve the performance of transformers across various models and tasks. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement over previous state-of-the-art methods on ImageNet-1K and GLUE. Codes and models are available at https://github.com/HuangOwen/Quantization-Variation.
Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models. It increases the number of parameters while keeping FLOPs nearly constant during inference through sparse activation. Yet, it still suffers from significant memory overheads due to the vast parameter size, necessitating model compression techniques. Post-training quantization offers a powerful approach for model compression. Existing methods adopt a fixed quantization precision for the entire MoE model. This rigid setup can lead to suboptimal performance, without considering the inherent sparse structure. For example, MoE's sparse routing mechanism leads to different activation patterns, where shared experts are accessed by all tokens while token-conditioned experts are selectively activated. This activation disparity suggests different quantization requirements, with consistently activated shared experts potentially needing higher precision to maintain model quality. In this paper, we study a fine-grained precision setup for MoE quantization. We explore MoE structure-aware quantization heuristics, ranging from coarse (e.g., MoE layers) to fine granularity (e.g., linear layers). Our investigations reveal critical principles, where different MoE structures require varying numbers of bits for effective quantization. Conclusions are supported by extensive benchmarking across two representative MoE models and six tasks including commonsense reasoning and natural language understanding. We further show that an MoE quantized in a fined-grained mixed precision achieved state-of-the-art 65.35% performance on average compared to the baseline 64.30% (i.e., GPTQ). Moreover, based on the findings, we introduce novel data-driven techniques for optimizing bit allocation in MoE quantization, including the outlier-aware linear layer scorer and MoE block importance predictor.
Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.
This paper describes the design and implementation of parallel neural networks (PNNs) with the novel programming language Golang. We follow in our approach the classical Single-Program Multiple-Data (SPMD) model where a PNN is composed of several sequential neural networks, which are trained with a proportional share of the training dataset. We used for this purpose the MNIST dataset, which contains binary images of handwritten digits. Our analysis focusses on different activation functions and optimizations in the form of stochastic gradients and initialization of weights and biases. We conduct a thorough performance analysis, where network configurations and different performance factors are analyzed and interpreted. Golang and its inherent parallelization support proved very well for parallel neural network simulation by considerable decreased processing times compared to sequential variants.
This paper presents a framework for estimating the remaining useful life (RUL) of mechanical systems. The framework consists of a multi-layer perceptron and an evolutionary algorithm for optimizing the data-related parameters. The framework makes use of a strided time window to estimate the RUL for mechanical components. Tuning the data-related parameters can become a very time consuming task. The framework presented here automatically reshapes the data such that the efficiency of the model is increased. Furthermore, the complexity of the model is kept low, e.g. neural networks with few hidden layers and few neurons at each layer. Having simple models has several advantages like short training times and the capacity of being in environments with limited computational resources such as embedded systems. The proposed method is evaluated on the publicly available C-MAPSS dataset, its accuracy is compared against other state-of-the art methods for the same dataset.
A low precision deep neural network training technique for producing sparse, ternary neural networks is presented. The technique incorporates hard- ware implementation costs during training to achieve significant model compression for inference. Training involves three stages: network training using L2 regularization and a quantization threshold regularizer, quantization pruning, and finally retraining. Resulting networks achieve improved accuracy, reduced memory footprint and reduced computational complexity compared with conventional methods, on MNIST and CIFAR10 datasets. Our networks are up to 98% sparse and 5 & 11 times smaller than equivalent binary and ternary models, translating to significant resource and speed benefits for hardware implementations.
Homomorphic encryption (HE) enables computation on encrypted data, and hence it has a great potential in privacy-preserving outsourcing of computations to the cloud. Hardware acceleration of HE is crucial as software implementations are very slow. In this paper, we present design methodologies for building a programmable hardware accelerator for speeding up the cloud-side homomorphic evaluations on encrypted data. First, we propose a divide-and-conquer technique that enables homomorphic evaluations in a large polynomial ring $R_{Q,2N}$ to use a hardware accelerator that has been built for the smaller ring $R_{Q,N}$. The technique makes it possible to use a single hardware accelerator flexibly for supporting several HE parameter sets. Next, we present several architectural design methods that we use to realize the flexible and instruction-set accelerator architecture, which we call `Medha'. At every level of the implementation hierarchy, we explore possibilities for parallel processing. Starting from hardware-friendly parallel algorithms for the basic building blocks, we gradually build heavily parallel RNS polynomial arithmetic units. Next, many of these parallel units are interconnected elegantly so that their interconnections require the minimum number of nets, therefore making the overall architecture placement-friendly on the platform. For Medha, we take a memory-conservative design approach and get rid of any off-chip memory access during homomorphic evaluations. Finally, we implement Medha in a Xilinx Alveo U250 FPGA and measure timing performances of the microcoded homomorphic addition, multiplication, key-switching, and rescaling for the leveled HE scheme RNS-HEAAN at 200 MHz clock frequency. For two large parameter sets, Medha achieves accelerations by up to 68x and 78x times respectively compared to a highly optimized software implementation Microsoft SEAL running at 2.3 GHz.
Customized hardware accelerators have been developed to provide improved performance and efficiency for DNN inference and training. However, the existing hardware accelerators may not always be suitable for handling various DNN models as their architecture paradigms and configuration tradeoffs are highly application-specific. It is important to benchmark the accelerator candidates in the earliest stage to gather comprehensive performance metrics and locate the potential bottlenecks. Further demands also emerge after benchmarking, which require adequate solutions to address the bottlenecks and improve the current designs for targeted workloads. To achieve these goals, in this paper, we leverage an automation tool called DNNExplorer for benchmarking customized DNN hardware accelerators and exploring novel accelerator designs with improved performance and efficiency. Key features include (1) direct support to popular machine learning frameworks for DNN workload analysis and accurate analytical models for fast accelerator benchmarking; (2) a novel accelerator design paradigm with high-dimensional design space support and fine-grained adjustability to overcome the existing design drawbacks; and (3) a design space exploration (DSE) engine to generate optimized accelerators by considering targeted AI workloads and available hardware resources. Results show that accelerators adopting the proposed novel paradigm can deliver up to 4.2X higher throughput (GOP/s) than the state-of-the-art pipeline design in DNNBuilder and up to 2.0X improved efficiency than the recently published generic design in HybridDNN given the same DNN model and resource budgets. With DNNExplorer's benchmarking and exploration features, we can be ahead at building and optimizing customized AI accelerators and enable more efficient AI applications.
Energy efficiency and memory footprint of a convolutional neural network (CNN) implemented on a CNN inference accelerator depend on many factors, including a weight quantization strategy (i.e., data types and bit-widths) and mapping (i.e., placement and scheduling of DNN elementary operations on hardware units of the accelerator). We show that enabling rich mixed quantization schemes during the implementation can open a previously hidden space of mappings that utilize the hardware resources more effectively. CNNs utilizing quantized weights and activations and suitable mappings can significantly improve trade-offs among the accuracy, energy, and memory requirements compared to less carefully optimized CNN implementations. To find, analyze, and exploit these mappings, we: (i) extend a general-purpose state-of-the-art mapping tool (Timeloop) to support mixed quantization, which is not currently available; (ii) propose an efficient multi-objective optimization algorithm to find the most suitable bit-widths and mapping for each DNN layer executed on the accelerator; and (iii) conduct a detailed experimental evaluation to validate the proposed method. On two CNNs (MobileNetV1 and MobileNetV2) and two accelerators (Eyeriss and Simba) we show that for a given quality metric (such as the accuracy on ImageNet), energy savings are up to 37% without any accuracy drop.
Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), espcially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. Additionally, we perform an empirical study to investigate the efficiency--accuracy tradeoffs of different PQ parameterizations and training methods. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1$\times$, even when compared to a highly optimized conventional DNN accelerator, with similar improvements on two additional compact DNNs. When comparing to recent PQ solutions, we outperform prior work by $4\times$ in terms of performance-per-area with a 0.6% accuracy degradation. Finally, we reduce the bitwidth of PQ operations to investigate the impact on both hardware efficiency and accuracy. With only 2-6-bit precision on three compact DNNs, we were able to maintain DNN accuracy eliminating the need for DSPs.
Neural network accelerators have been widely applied to edge devices for complex tasks like object tracking, image recognition, etc. Previous works have explored the quantization technologies in related lightweight accelerator designs to reduce hardware resource consumption. However, low precision leads to high accuracy loss in inference. Therefore, mixed-precision quantization becomes an alternative solution by applying different precision in different layers to trade off resource consumption and accuracy. Because regular designs for multiplication on hardware cannot support the precision reconfiguration for a multi-precision Quantized Neural Network (QNN) model in runtime, we propose a runtime reconfigurable multi-precision multi-channel bitwise systolic array design for QNN accelerators. We have implemented and evaluated our work on the Ultra96 FPGA platform. Results show that our work can achieve 1.3185 to 3.5671 times speedup in inferring mixed-precision models and has less critical path delay, supporting a higher clock frequency (250MHz).
Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the timing behaviors and energy profiles of Multiply-Accumulate (MAC) units. This disconnect from circuit-level behavior limits the ability to exploit available timing margins and energy-saving opportunities, reducing the overall efficiency of deployment on modern accelerators. To address these limitations, we propose HALO, a versatile framework for Hardware-Aware Post-Training Quantization (PTQ). Unlike traditional methods, HALO explicitly incorporates detailed hardware characteristics, including critical-path timing and power consumption, into its quantization approach. HALO strategically selects weights with low critical-path-delays enabling higher operational frequencies and dynamic frequency scaling without disrupting the architecture's dataflow. Remarkably, HALO achieves these improvements with only a few dynamic voltage and frequency scaling (DVFS) adjustments, ensuring simplicity and practicality in deployment. Additionally, by reducing switching activity within the MAC units, HALO effectively lowers energy consumption. Evaluations on accelerators such as Tensor Processing Units (TPUs) and Graphics Processing Units (GPUs) demonstrate that HALO significantly enhances inference efficiency, achieving average performance improvements of 270% and energy savings of 51% over baseline quantization methods, all with minimal impact on accuracy.
Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.
Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/omniserve.
Recently, considerable efforts have been directed towards compressing Large Language Models (LLMs), which showcase groundbreaking capabilities across diverse applications but entail significant deployment costs due to their large sizes. Meanwhile, much less attention has been given to mitigating the costs associated with deploying multiple LLMs of varying sizes despite its practical significance. Thus, this paper introduces \emph{any-precision LLM}, extending the concept of any-precision DNN to LLMs. Addressing challenges in any-precision LLM, we propose a lightweight method for any-precision quantization of LLMs, leveraging a post-training quantization framework, and develop a specialized software engine for its efficient serving. As a result, our solution significantly reduces the high costs of deploying multiple, different-sized LLMs by overlaying LLMs quantized to varying bit-widths, such as 3, 4, ..., $n$ bits, into a memory footprint comparable to a single $n$-bit LLM. All the supported LLMs with varying bit-widths demonstrate state-of-the-art model quality and inference throughput, proving itself to be a compelling option for deployment of multiple, different-sized LLMs. Our code is open-sourced and available online.
This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the performance of its full-precision counterparts (e.g., FP16 or BF16) in transformer-based LLMs. It achieves this by employing an autoregressive distillation (AD) loss with maintaining equivalent model dimensions (130M, 1.3B, 7B) and training data volume as regular LLM pretraining, while delivering competitive results in terms of perplexity and task-specific effectiveness. Intriguingly, by analyzing the training trajectory, we find that the pretrained weight is not necessary for training binarized LLMs from scratch. This research encourages a new computational framework and may facilitate the future design of specialized hardware tailored for fully 1-bit LLMs. We make all models, code, and training dataset fully accessible and transparent to support further research (Code: https://github.com/LiqunMa/FBI-LLM. Model: https://huggingface.co/LiqunMa/).
Quantizing LLM weights and activations is a standard approach for efficient deployment, but a few extreme outliers can stretch the dynamic range and amplify low-bit quantization errors. Prior transform-based mitigations (e.g., Hadamard rotations) are fixed and data-agnostic, and their optimality for quantization has remained unclear. We derive closed-form optimal linear blockwise transforms for joint weight-activation quantization under standard RTN AbsMax-scaled block quantizers, covering both integer and floating-point formats. The resulting construction, WUSH, combines a Hadamard backbone with a data-dependent second-moment component to form a non-orthogonal transform that is provably near-optimal for FP and INT quantizers under mild assumptions while admitting an efficient fused GPU implementation. Empirically, WUSH improves W4A4 accuracy over the strongest Hadamard-based baselines (e.g., on Llama-3.1-8B-Instruct in MXFP4, it gains +2.8 average points with RTN and +0.7 with GPTQ) while delivering up to 6.6$\times$ per-layer throughput over BF16 via FP4 MatMul. Source code is available at https://github.com/IST-DASLab/WUSH.
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Although the field has expanded and is vibrant, there hasn't been a concise framework that analyzes the various methods of LLM Inference to provide a clear understanding of this domain. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems, such as why LLMs are memory-bound, how much memory and computation they need, and how to choose the right hardware. We systematically collate the latest advancements in efficient LLM inference, covering crucial areas such as model compression (e.g., Knowledge Distillation and Quantization), algorithm improvements (e.g., Early Exit and Mixture-of-Expert), and both hardware and system-level enhancements. Our survey stands out by analyzing these methods with roofline model, helping us understand their impact on memory access and computation. This distinctive approach not only showcases the current research landscape but also delivers valuable insights for practical implementation, positioning our work as an indispensable resource for researchers new to the field as well as for those seeking to deepen their understanding of efficient LLM deployment. The analyze tool, LLM-Viewer, is open-sourced.
Transformer based large language models have achieved tremendous success. However, the significant memory and computational costs incurred during the inference process make it challenging to deploy large models on resource-constrained devices. In this paper, we investigate compression and efficient inference methods for large language models from an algorithmic perspective. Regarding taxonomy, similar to smaller models, compression and acceleration algorithms for large language models can still be categorized into quantization, pruning, distillation, compact architecture design, dynamic networks. However, Large language models have two prominent characteristics compared to smaller models: (1) Most of compression algorithms require finetuning or even retraining the model after compression. The most notable aspect of large models is the very high cost associated with model finetuning or training. Therefore, many algorithms for large models, such as quantization and pruning, start to explore tuning-free algorithms. (2) Large models emphasize versatility and generalization rather than performance on a single task. Hence, many algorithms, such as knowledge distillation, focus on how to preserving their versatility and generalization after compression. Since these two characteristics were not very pronounced in early large models, we further distinguish large language models into medium models and ``real'' large models. Additionally, we also provide an introduction to some mature frameworks for efficient inference of large models, which can support basic compression or acceleration algorithms, greatly facilitating model deployment for users.
Recent work has demonstrated great potentials of neural network-inspired analog-to-digital converters (NNADCs) in many emerging applications. These NNADCs often rely on resistive random-access memory (RRAM) devices to realize basic NN operations, and usually need high-precision RRAM (6–12 b) to achieve moderate quantization resolutions (4–8 b). Such an optimistic assumption of RRAM precision, however, is not well supported by practical RRAM arrays in the large-scale production process. In this article, we evaluate two new designs of NNADC with low-precision RRAM devices. They take advantage of traditional two-stage/pipelined hardware architecture and a custom deep-learning-based building block design methodology. Results obtained from SPICE simulations demonstrate a robust design of an 8-b subranging NNADC using 4-b RRAM devices, as well as a 14-b pipelined NNADC using 3-b RRAM devices. The evaluations on the two NNADCs suggest that pipelined architecture is better to achieve higher-resolution using lower precision RRAM. We also perform design space exploration on the building blocks of NNADCs to achieve a balanced performance tradeoff. Comprehensive comparisons reveal improved power, speed performance, and competitive figure of merits (FoMs) of the pipelined NNADC, compared with state-of-the-art NNADCs and traditional ADCs. In addition, the proposed pipelined NNADC can support reconfigurable high-resolution nonlinear quantization with high conversion speed and low conversion energy, enabling intelligent analog-to-information interfaces for near-sensor processing.
Recent works propose neural network- (NN-) inspired analog-to-digital converters (NNADCs) and demonstrate their great potentials in many emerging applications. These NNADCs often rely on resistive random-access memory (RRAM) devices to realize the NN operations and require high-precision RRAM cells (6∼12-bit) to achieve a moderate quantization resolution (4∼8-bit). Such optimistic assumption of RRAM resolution, however, is not supported by fabrication data of RRAM arrays in large-scale production process. In this paper, we propose an NN-inspired super-resolution ADC based on low-precision RRAM devices by taking the advantage of a co-design methodology that combines a pipelined hardware architecture with a custom NN training framework. Results obtained from SPICE simulations demonstrate that our method leads to robust design of a 14-bit super-resolution ADC using 3-bit RRAM devices with improved power and speed performance and competitive figure-of-merits (FoMs). In addition to the linear uniform quantization, the proposed ADC can also support configurable high-resolution nonlinear quantization with high conversion speed and low conversion energy, enabling future intelligent analog-to-information interfaces for near-sensor analytics and processing.
In order to shorten the optimization cycle of ship design optimization and solve the time‐consuming problem of computational fluid dynamics (CFD) numerical calculation, this paper proposes a multi‐precision back‐propagation neural network (MP‐BP) approximation technology. Fewer high‐precision ship samples and more low‐precision ship samples were used to construct an approximate model, back‐propagation (BP) neural network was used to train multi‐precision samples. So that the approximate model is as close as possible to the real model, and achieving the effect of high‐precision approximation model. Subsequently, numerical verification and typical hull form verification are given. Based on CFD and Rankine theory, the multi‐objective design optimization framework for ship comprehensive navigation performance is constructed. The multi‐objective approximation model of KCS ship is constructed by MP‐BP approximation technology, and optimized by particle swarm optimization (PSO) algorithm. The results show that the multi‐objective optimization design framework using the MP‐BP approximation model can capture the global optimal solution and improve the efficiency of the entire hull form design optimization. It can provide a certain degree of technical support for green ship and low‐carbon shipping.
One of the most critical concerns in power system reliability is the timely and accurate detection of transmission line faults. Therefore, accurate detection and localisation of these faults are necessary to avert system collapse. This paper focuses on using Artificial Neural Networks in faults detection and localisation to attain accuracy, precision and speed of execution. A 330 kV, 500 km three-phase transmission line was modelled to extract faulty current and voltage data from the line. The Artificial Neural Network technique was used to train this data, and an accuracy of 100% was attained for fault detection and about 99.5% for fault localisation at different distances with 0.0017 μs of detection and an average error of 0%–0.5%. This model performs better than Support Vector Machine and Principal Component Analysis with a higher fault detection time. This proposed model serves as the basis for transmission line fault protection and management system.
本报告综合了 AI Infra 与量化领域的全栈研究成果。核心趋势体现为:1) 算法层面,针对 LLM 离群值和 KV Cache 的专用量化技术已成为主流;2) 硬件层面,软硬协同设计正从传统的 FPGA/ASIC 转向更高效的存内计算(CIM)与混合精度架构;3) 架构层面,量化研究已从 Transformer 扩展至 Mamba、SNN 及扩散模型等新兴领域;4) 工程层面,自动化工具链与硬件感知的量化搜索(NAS)正在加速量化模型在边缘侧与移动端的工业化落地。整体研究呈现出从单一精度压缩向系统级能效优化的深度演进。