自动驾驶结合存算一体
自动驾驶系统约束与端到端加速架构(含瓶颈识别与异构算力加速)
这组论文聚焦自动驾驶(或ADAS)端到端系统架构设计:在实时性、安全性与可预测性等约束下识别关键计算瓶颈,并讨论GPU/FPGA/ASIC等加速器与异构计算平台的实现路径,同时强调能效指标(TOPS/s per power)与“Memory Wall/数据搬运”问题对车载部署的影响。
- The Architectural Implications of Autonomous Driving: Constraints and Acceleration(Shi-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md E. Haque, Lingjia Tang, Jason Mars, 2018, Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems)
- Accelerating Automated Driving and ADAS Using HW/SW Codesign(Shubham Rai, Cecilia De la Parra, Martin Rapp, Jan Micha Borrmann, Nina Bretz, Stefan Metzlaff, T. Soliman, Christoph Schorn, 2024, 2024 IEEE 37th International System-on-Chip Conference (SOCC))
- Driving into the memory wall: the role of memory for advanced driver assistance systems and autonomous driving(Matthias Jung, S. Mckee, C. Sudarshan, Christoph Dropmann, C. Weis, N. Wehn, 2018, Proceedings of the International Symposium on Memory Systems)
- ZuSE Ki-Avf: Application-Specific AI Processor for Intelligent Sensor Signal Processing in Autonomous Driving(Gia Bao Thieu, Sven Gesper, G. Payá-Vayá, C. Riggers, Oliver Renke, Till Fiedler, Jakob Marten, Tobias Stuckenberg, Holger Blume, C. Weis, Lukas Steiner, C. Sudarshan, N. Wehn, Lennart M. Reimann, R. Leupers, Michael Beyer, D. Köhler, Alisa Jauch, Jan Micha Borrmann, Setareh Jaberansari, T. Berthold, Meinolf Blawat, Markus Kock, Gregor Schewior, Jens Benndorf, Frederik Kautz, Hans-Martin Bluethgen, C. Sauer, 2023, 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE))
存算一体/算内存的总体方法与体系结构(PIM/AiM,软硬协同与功能安全)
共同点在于从宏观层面梳理存算一体(PIM/AiM)作为解决数据移动瓶颈与能效问题的体系结构:包括ReRAM算内计算/存储双功能的架构趋势、非理想/EDA工具链等关键挑战,以及软硬件协同与部署视角(如面向车载推理与可靠性/安全)。
- Holistic approaches to memory solutions for the Autonomous Driving Era(Daeyong Shim, Chunseok Jeong, Euncheol Lee, Junmo Kang, S. Yoon, Yongkee Kwon, Il Park, Hyun Ahn, Seonyong Cha, Jinkook Kim, 2022, 2022 IEEE International Symposium on Circuits and Systems (ISCAS))
- Resistive-RAM-Based In-Memory Computing for Neural Network: A Review(Weijian Chen, Zhi Qi, Zahid Akhtar, Kamran Siddique, 2022, Electronics)
- Three Challenges in ReRAM-Based Process-In-Memory for Neural Network(Ziyi Yang, Kehan Liu, Yiru Duan, Mingjia Fan, Qiyue Zhang, Zhou Jin, 2023, 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS))
- A Survey of ReRAM-Based Architectures for Processing-In-Memory and Neural Networks(Sparsh Mittal, 2018, Machine Learning and Knowledge Extraction)
- 基于忆阻器的存算一体加速器综述(周恒, 刘锦鹏, 冯丹, 童薇)
面向车载DNN推理的存内计算实现:ReRAM交叉阵列与激活/权重稀疏优化
这组论文围绕“算内计算如何真正提升运算效率”展开:以ReRAM交叉阵列为核心讨论并行MAC/算术单元的实现细节,并通过位级/比特级稀疏利用、低精度或量化与专用电路设计来降低A/D开销、提升吞吐与能效;同时给出面向实际模型工作负载(含LLM相关加速架构)的存内加速思路。
- ERA-BS: Boosting the Efficiency of ReRAM-Based PIM Accelerator With Fine-Grained Bit-Level Sparsity(Fangxin Liu, Wenbo Zhao, Zongwu Wang, Yongbiao Chen, Xiaoyao Liang, Li Jiang, 2024, IEEE Transactions on Computers)
- FPCAS: In-Memory Floating Point Computations for Autonomous Systems(Sina Sayyah Ensan, Swaroop Ghosh, 2019, 2019 International Joint Conference on Neural Networks (IJCNN))
- RoboPIM: A ReRAM-based Accelerator for LLM-based Robotics Applications via Dynamic Task Slicing(Wenjing Xiao, Jianyu Wang, Dan Chen, Huize Li, Mohsen Guizani, Min Chen, T. Wu, 2026, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
高带宽近像素/光学存内神经网络:减少冗余与降低能耗延迟
共同目标是减少车载视觉等感知链路中的数据搬运与冗余:前者在传感端引入近像素计算(含3D堆叠CIS与嵌入式DRAM缓冲)以做时间帧过滤并降低帧间冗余;后者提出自由空间光学存内神经网络,通过高并行光学权重/输入调制实现超高吞吐与超低能耗/低延迟。
- Temporal Frame Filtering for Autonomous Driving Using 3D-Stacked Global Shutter CIS With IWO Buffer Memory and Near-Pixel Compute(Janak Sharda, Wantong Li, Qiucheng Wu, Shiyu Chang, Shimeng Yu, 2023, IEEE Transactions on Circuits and Systems I: Regular Papers)
- High-clockrate free-space optical in-memory computing(Yuanhao Liang, James Wang, Kaiwen Xue, Xinyi Ren, Ran Yin, Shaoyuan Ou, Lian Zhou, Yuan Li, T. Heuser, N. Heermeier, Ian Christen, James A. Lott, S. Reitzenstein, Mengjie Yu, Zaijun Chen, 2025, Light: Science & Applications)
自动驾驶关键子任务的硬件化加速(路径规划QP求解、稀疏计算与数据流)
该组论文聚焦自动驾驶中的具体关键算法环节——路径规划的QP求解与线性系统求解,并针对FPGA实现:利用ADMM框架、PCG与稀疏矩阵的定制存储/稀疏MM/V等模块化硬件设计,同时进行算子级与系统级数据流流水化以提升端到端吞吐、降低资源与能耗。
- A Sparsity-Aware Autonomous Path Planning Accelerator with HW/SW Co-Design and Multi-Level Dataflow Optimization(Yifan Zhang, Xiaoyu Niu, Hongzheng Tian, Yanjun Zhang, Bo Yu, Shaoshan Liu, Sitao Huang, 2025, ACM Transactions on Architecture and Code Optimization)
车载决策所需的忆阻/纳米电子关联存储与并行搜索
共同点在于将“存算一体”落到决策所需的并行匹配/模式搜索:以忆阻器/纳米器件与混合SRAM-忆阻架构实现关联存储(替代TCAM),面向实时决策任务强调功耗、搜索时延、面积与鲁棒性等评估维度。
- Design and Implementation of Nanoelectronics-Based Advanced Associative Memory Architecture for Autonomous Vehicles(D. N. Nithilam, B. Paulchamy, 2025, Journal of Nanoelectronics and Optoelectronics)
存算一体友好的算术/量化与硬件协同(端侧DNN鲁棒量化、算内执行)
该组共同讨论“计算形式选择”与“存算一体落地”的关键:从DNN新型算术/激活函数与量化权衡、端侧鲁棒性到软硬协同(算法变换+面向硬件的量化与多核/分区存储/电压缩放),并强调将激进量化与存内执行结合可显著带来能耗与时延收益;同时与车载数据搬运瓶颈(Memory Wall)形成问题-方案闭环。
- A Hardware/Software Co-Design Vision for Deep Learning at the Edge(Flavio Ponzina, Simone Machetti, M. Rios, B. Denkinger, A. Levisse, G. Ansaloni, Miguel Peón-Quirós, David Atienza Alonso, 2022, IEEE Micro)
- Novel Arithmetics in Deep Neural Networks Signal Processing for Autonomous Driving: Challenges and Opportunities(M. Cococcioni, Federico Rossi, E. Ruffaldi, S. Saponara, Benoît Dupont de Dinechin, 2021, IEEE Signal Processing Magazine)
- Driving into the memory wall: the role of memory for advanced driver assistance systems and autonomous driving(Matthias Jung, S. Mckee, C. Sudarshan, Christoph Dropmann, C. Weis, N. Wehn, 2018, Proceedings of the International Symposium on Memory Systems)
神经形态/感存算一体视觉传感与前处理(时序冗余与感知效率)
共同点在于把存算一体前移到视觉传感与前处理环节:通过神经形态视觉传感器实现感知、存储与信息预处理的电路集成,减少传统传感器-处理单元分离导致的传输延时与能耗,并提升信息处理效率。
- 具有感存算一体化的新型神经形态视觉传感器(廖付友, 柴扬, 2021, 物理)
这些文献围绕“自动驾驶+存算一体”形成了从端到端系统约束(实时/安全/可预测、能效与Memory Wall)到器件/架构实现(ReRAM PIM、算内浮点/稀疏优化、近像素CIS与自由空间光学、关联存储并行搜索、神经形态前端计算),再到软硬协同(量化、算术选择、HW/SW codesign与部署挑战)的完整技术链条。整体研究方向是:用存算一体减少数据搬运与A/D代价,在满足车载实时与可靠性约束的前提下,提升推理/规划等关键任务的吞吐、能效与尾延迟。
总计22篇相关文献
传统的数字图像处理系统包括图像传感器与图像处理单元,二者在物理空间上分离,图像信息在其间的传输造成了延时与能耗。此外,数字图像传感器基于“帧”的工作原理,可能丢失一些重要信息,或者造成数据冗余。人类视觉系统提供了一种高效并行的信息处理方式。神经形态视觉传感器能够模拟人类视网膜的功能,同时具备感知光信号、存储信号和进行信息预处理的功能。这类感存算一体化的神经形态视觉传感器简化了人工视觉系统的电路复杂性,提升了信息处理效率,节省了系统功耗。文章总结了传统的数字图像传感器存在的问题,介绍了几种重要的人工神经网络,讨论了新型神经形态视觉传感器的研究进展和存在的问题。
为应对人工智能(artificial intelligence, AI)算法的快速发展,计算资源的需求正呈指数级增长,这为AI模型的硬件部署带来了巨大挑战。基于忆阻器的存算一体加速器为解决大型AI模型部署中的能效与时延问题提供了极具前景的方案——即计算直接在存储数据的存储单元中完成。这种方式显著减少了冯·诺依曼架构中处理单元与存储单元之间频繁的数据搬运,从而极大降低了时间和能耗开销。近年来,该领域研究发展迅速,忆阻器技术正经历从概念验证走向商业化产品的关键转变,现有产品原型系统已能在多种应用场景中加速AI模型推理。本文系统地梳理了忆阻器件及交叉阵列、系统架构、软件工具、典型应用及发展趋势,并给出了当前仍需解决的关键技术问题。
摘要: 技术推动与需求拉动决定未来产业的选择,存算一体技术的出现完美印证了这个观点。近年来,大模型这个“高大上”的词汇逐渐走进了千家万户,人们开始依赖于机器高效且全面的思考方式。而人工智能大模型依赖于芯片的大算力,这就对芯片提出了相应要求,即其需要更好地支持并行处理,同时保障数据能够高效率地流动。近年来,国内外均推出了诸多项目和政策来支持该技术的研究,学术界和产业界重新拾起了半世纪前的技术概念,从架构、工艺、集成等多个维度展开了诸多研究,探索后摩尔时代新一代芯片技术。存算一体市场规模不断扩大,似乎已经进入爆发前夕,目前看来,技术落地是存算一体化的关键问题。作者结合自己在学术界与工业界的经历,谈谈存算一体的关键科学问题−工艺集成技术。希望能让大小同行都有所收获,引发大家更深入地讨论。
Autonomous driving is disrupting conventional automotive development. Underlying reasons include control unit consolidation, the use of components originally developed for the consumer market, and the large amount of data that must be processed. For instance, Audi's zFAS or NVIDIA's Xavier platform integrate GPUs, custom accelerators, and CPUs within a single domain controller to perform sensor fusion, processing, and decision making. The communication between these heterogeneous components and the algorithms for Advanced Driver Assistance Systems and Autonomous Driving require low latency and huge memory bandwidth, bringing the Memory Wall from high-performance computing in data centers directly to our cars. In this paper we discuss these and other requirements in using DRAM for near-term autonomous driving architectures.
As DNNs improving state-of-the-art accuracy on many artificial intelligence (AI) applications such as computer vision processing for autonomous driving, the data processing bandwidth and power consumption between neural network accelerator and the off-chip memory are big challenge to enhance the compute performance metric TOPs/watt. To overcome the limited compute and energy resources in automobile environment, inferencing by PIM (Processing in Memory) or AiM (Accelerator in Memory) which deployed MAC(Multiply and Accumulation) units and activation function inside DRAM is one of the key solution by using multi bank parallelism and memory cell architecture. When memory technology equipped with analog logic inside mature in the near future, ultra-low power analog accelerator based neuromorphic computing architecture will lead the future autonomous driving solution.
This article focuses on the trends, opportunities, and challenges of novel arithmetic for deep neural network (DNN) signal processing, with particular reference to assisted- and autonomous driving applications. Due to strict constraints in terms of the latency, dependability, and security of autonomous driving, machine perception (i.e., detection and decision tasks) based on DNNs cannot be implemented by relying on remote cloud access. These tasks must be performed in real time in embedded systems on board the vehicle, particularly for the inference phase (considering the use of DNNs pretrained during an offline step). When developing a DNN computing platform, the choice of the computing arithmetic matters. Moreover, functional safe applications, such as autonomous driving, impose severe constraints on the effect that signal processing accuracy has on the final rate of wrong detection/decisions. Hence, after reviewing the different choices and tradeoffs concerning arithmetic, both in academia and industry, we highlight the issues in implementing DNN accelerators to achieve accurate and lowcomplexity processing of automotive sensor signals (the latter coming from diverse sources, such as cameras, radar, lidar, and ultrasonics). The focus is on both general-purpose operations massively used in DNNs, such as multiplying, accumulating, and comparing, and on specific functions, including, for example, sigmoid or hyperbolic tangents used for neuron activation.
Autonomous driving systems have attracted a significant amount of interest recently, and many industry leaders, such as Google, Uber, Tesla, and Mobileye, have invested a large amount of capital and engineering power on developing such systems. Building autonomous driving systems is particularly challenging due to stringent performance requirements in terms of both making the safe operational decisions and finishing processing at real-time. Despite the recent advancements in technology, such systems are still largely under experimentation and architecting end-to-end autonomous driving systems remains an open research question. To investigate this question, we first present and formalize the design constraints for building an autonomous driving system in terms of performance, predictability, storage, thermal and power. We then build an end-to-end autonomous driving system using state-of-the-art award-winning algorithms to understand the design trade-offs for building such systems. In our real-system characterization, we identify three computational bottlenecks, which conventional multicore CPUs are incapable of processing under the identified design constraints. To meet these constraints, we accelerate these algorithms using three accelerator platforms including GPUs, FPGAs, and ASICs, which can reduce the tail latency of the system by 169x, 10x, and 93x respectively. With accelerator-based designs, we are able to build an end-to-end autonomous driving system that meets all the design constraints, and explore the trade-offs among performance, power and the higher accuracy enabled by higher resolution cameras.
Modern and future AI-based automotive applications, such as autonomous driving, require the efficient real-time processing of huge amounts of data from different sensors, like camera, radar, and LiDAR. In the ZuSE-KI-AVF project, multiple university, and industry partners collaborate to develop a novel massive parallel processor architecture, based on a cus-tomized RISC-V host processor, and an efficient high-performance vertical vector coprocessor. In addition, a software development framework is also provided to efficiently program AI-based sensor processing applications. The proposed processor system was verified and evaluated on a state-of-the-art UltraScale+ FPGA board, reaching a processing performance of up to 126.9 FPS, while executing the YOLO-LITE CNN on 224x224 input images. Further optimizations of the FPGA design and the realization of the processor system on a 22nm FDSOI CMOS technology are planned.
With the advancement of deep learning to solve autonomous driving problems, the computation and memory requirements have been growing rapidly. Near-pixel compute-based CMOS image sensors (CIS) have been investigated as a potential candidate to perform the initial computations of workloads close to the pixel and reduce data movement. In this work, we design a near-pixel compute CIS capable of implementing a temporal frame filtering network, which rejects redundant image frames targeting autonomous driving applications. To improve performance and avoid image distortion, 3D-stacked global shutter CIS is proposed. This architecture integrates photodiodes with memory and compute units using Cu-Cu hybrid bonding. We propose to use back-end-of-line (BEOL) compatible Tungsten-doped Indium Oxide Transistors (IWO FETs) based embedded DRAM as buffer memory to achieve refresh-free storage and high bandwidth connections between various components. Near-pixel compute circuit is optimized by including sparsity-aware adder tree and using NOR gates as data buffers. The two-tier system comprises photodiodes on tier-1 in 40 nm node, and near-pixel compute and buffer memory on tier-2 in 22 nm node. We perform simulations in Cadence, obtaining an energy efficiency of 65 TOPS/W and a compute density of 1.04 TOPS/mm2 for $8\times8\text{b}$ MAC, with a total latency of 1.15 ms/frame.
Autonomous systems e.g., cars and drones generate vast amount of data from sensors that need to be processed in timely fashion to make accurate and safe decisions. Majority of these computations deal with Floating Point (FP) numbers. Conventional Von-Neumann computing paradigm suffers from overheads associated with data transfer. In-memory computing (IMC) can solve this challenge by processing the data locally. However, in-memory FP computing has not been investigated before. We propose F P arithmetic (adder/subtractor and multiplier) using Resistive RAM (ReRAM) crossbar based IMC. A novel shift circuitry is proposed to lower the shift overhead inherently present in the FP arithmetic. The proposed single precision FP adder consumes 335 pJ and 322 pJ for NAND-NAND and NOR-NOR based implementation for addition/subtraction, respectively. The proposed adder/subtractor improves latency, power and energy by 828X, 3.2X, and 3.7X, respectively, compared to MAGIC [1]. Furthermore, the proposed multiplier reduces energy per operation by 1.13X and improves performance by 4.4X compared to ReVAMP [2].
The ability to process and act on data in real time is increasingly critical for applications ranging from autonomous vehicles, three-dimensional environmental sensing, and remote robotics. However, the deployment of deep neural networks (DNNs) in edge devices is hindered by the lack of energy-efficient scalable computing hardware. Here, we introduce a fanout spatial time-of-flight optical neural network (FAST-ONN) that calculates billions of convolutions per second with ultralow latency and power consumption. This is enabled by the combination of high-speed dense arrays of vertical-cavity surface-emitting lasers (VCSELs) for input modulation with spatial light modulators of high pixel counts for in-memory weighting. In a three-dimensional optical system, parallel differential readout allows signed weight values for accurate inference in a single shot. The performance is benchmarked with feature extraction in You-Only-Look-Once (YOLO) for convolution at 100 million frames per second (MFPS), and in-system backward propagation training with photonic reprogrammability. The VCSEL transmitters are implementable in any free-space optical computing systems to improve the clockrate to over gigahertz, where the high scalability in device counts and channel parallelism enables a new avenue to scale up free space computing hardware. We demonstrated high-speed VCSEL in-memory neural networks that deliver billion optical convolutions per second for massively parallel edge intelligence at ultralow energy and latency.
The increasing demand for high-speed and energy-efficient memory solutions in autonomous vehicles has led to the development of advanced memory architectures. This paper presents the design and implementation of a Nanoelectronics-Based Advanced Associative Memory Architecture (NAAMA) as an alternative to Ternary Content Addressable Memory (TCAM) for real-time decision-making in autonomous vehicles. The proposed memory system enhances pattern matching efficiency while reducing power consumption and latency.The architecture leverages nanoelectronic devices, including memristors and FinFET-based memory cells, to improve performance and scalability. A hybrid SRAM and memristor-based associative memory design is implemented to optimize storage efficiency and search operations. The proposed model is evaluated in terms of power consumption, search speed, area efficiency, and scalability using Cadence Virtuoso and HSPICE simulations at a 7 nm technology node. Experimental results show that the proposed NAAMA achieves a 36% reduction in power consumption, a 28% improvement in search latency, and a 42% increase in area efficiency compared to conventional TCAM implementations. The integration of nanoelectronic components significantly enhances the system’s ability to perform high-speed parallel searches, making it a viable solution for real-time applications in autonomous vehicle decision-making systems. The study also demonstrates the robustness of the proposed architecture under varying traffic and environmental conditions, ensuring reliable and accurate decision-making for autonomous navigation. The findings suggest that nanoelectronics-based associative memory architectures can offer substantial advantages in energy efficiency, computational speed, and integration density, paving the way for future innovations in autonomous vehicle computing and intelligent transportation systems.
… 10e show several street scenes during autonomous driving. Our duplex TIIO chip successfully identified all the features and captured their relative depth with a comparable …
With the growing demands of increasing levels of automation in driving capabilities, AI workloads are being increasingly deployed. However, AI-based solutions are compute-intensive. Hence, to achieve market readiness, practical automated driving solutions need to deliver adequate efficiency, measured as compute performance (TOPS/s) per given power budget. For example - the above ratio directly impacts the range of electric vehicles (EVs) and plays a crucial role in deciding the foreseeable future of the automotive sector. Emerging technology such as in-memory computing can be a viable enabler to achieve this next level of AI acceleration to pave the route towards actual product realizations. In-memory computing is a disruptive technology that shifts the computing paradigm from classical digital computing with frequent data transfers into the analog and static data domain of up to 1000 TOPS/s. However, to truly tap the potential of in-memory computing, it requires an efficient integration into the HW/SW stack. This paper presents a holistic HW/SW-codesign approach covering the entire stack from neural architecture search (NAS) to generate efficient networks, optimization of the network involving compression techniques, exploring deployment strategies on the HW, along with discussion related to functional safety. It has been demonstrated that with joint optimization, we reach ${1 5 0 \%}$ improvement in FPS (frames-per-second) over the baseline.
With the growing demands of increasing levels of automation in driving capabilities, AI workloads are being increasingly deployed. However, AI-based solutions are compute-intensive. Hence, to achieve market readiness, practical automated driving solutions need to deliver adequate efficiency, measured as compute performance (TOPS/s) per given power budget. For example - the above ratio directly impacts the range of electric vehicles (EVs) and plays a crucial role in deciding the foreseeable future of the automotive sector. Emerging technology such as in-memory computing can be a viable enabler to achieve this next level of AI acceleration to pave the route towards actual product realizations. In-memory computing is a disruptive technology that shifts the computing paradigm from classical digital computing with frequent data transfers into the analog and static data domain of up to 1000 TOPS/s. However, to truly tap the potential of in-memory computing, it requires an efficient integration into the HW/SW stack. This paper presents a holistic HW/SW-codesign approach covering the entire stack from neural architecture search (NAS) to generate efficient networks, optimization of the network involving compression techniques, exploring deployment strategies on the HW, along with discussion related to functional safety. It has been demonstrated that with joint optimization, we reach ${1 5 0 \%}$ improvement in FPS (frames-per-second) over the baseline.
The growing popularity of edgeAI requires novel solutions to support the deployment of compute-intense algorithms in embedded devices. In this article, we advocate for a holistic approach, where application-level transformations are jointly conceived with dedicated hardware platforms. We embody such a stance in a strategy that employs ensemble-based algorithmic transformations to increase robustness and accuracy in convolutional neural networks, enabling the aggressive quantization of weights and activations. Opportunities offered by algorithmic optimizations are then harnessed in domain-specific hardware solutions, such as the use of multiple ultra-low-power processing cores, the provision of shared acceleration resources, the presence of independently power-managed memory banks, and voltage scaling to ultra-low levels, greatly reducing (up to 60% in our experiments) energy requirements. Furthermore, we show that aggressive quantization schemes can be leveraged to perform efficient computations directly in memory banks, adopting in-memory computing solutions. We showcase that the combination of parallel in-memory execution and aggressive quantization leads to more than 70% energy and latency gains compared to baseline implementations.
Path planning is a critical task for autonomous driving, aiming to generate smooth, collision-free, and feasible paths based on input perception and localization information. The planning task is both highly time-sensitive and computationally intensive, posing significant challenges to resource-constrained autonomous driving hardware. In this article, we propose an end-to-end framework for accelerating path planning on FPGA platforms. This framework focuses on accelerating quadratic programming (QP) solving, which is the core of optimization-based path planning and has the most computationally-intensive workloads. Our method leverages a hardware-friendly alternating direction method of multipliers (ADMM) to solve QP problems while employing a highly parallelizable preconditioned conjugate gradient (PCG) method for solving the associated linear systems. We analyze the sparse patterns of matrix operations in QP and design customized storage schemes along with efficient sparse matrix multiplication and sparse matrix-vector multiplication units. Our customized design significantly reduces resource consumption for data storage and computation while dramatically speeding up matrix operations. Additionally, we propose a multi-level dataflow optimization strategy. Within individual operators, we achieve acceleration through parallelization and pipelining. For different operators in an algorithm, we analyze inter-operator data dependencies to enable fine-grained pipelining. At the system level, we map different steps of the planning process to the CPU and FPGA and pipeline these steps to enhance end-to-end throughput. We implement and validate our design on the AMD ZCU102 platform. Our implementation achieves state-of-the-art performance in both latency and energy efficiency compared with existing works, including an average 1.48× speedup over the best FPGA-based design, a 2.89× speedup compared with the state-of-the-art QP solver on an Intel i7-11800H CPU, a 5.62× speedup over an ARM Cortex-A57 embedded CPU, and a 1.56× speedup over state-of-the-art GPU-based work. Furthermore, our design delivers a 2.05× improvement in throughput compared with the state-of-the-art FPGA-based design.
Artificial intelligence (AI) has been successfully applied to various fields of natural science. One of the biggest challenges in AI acceleration is the performance and energy bottleneck caused by the limited capacity and bandwidth of massive data movement between memory and processing units. In the past decade, much AI accelerator work based on process-in-memory (PIM) has been studied, especially on emerging non-volatile resistive random access memory (ReRAM). In this paper, we provide a comprehensive perspective on ReRAM-based AI accelerators, including software-hardware co-design, the status of chip fabrications, researches on ReRAM non-idealities, and support for the EDA tool chain. Finally, we summarize and provide three directions for future trends: support for complex patterns of models, addressing the impact of non-idealities such as improving endurance, process perturbations, and leakage current, and addressing the lack of EDA tools.
As data movement operations and power-budget become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as processing-in-memory (PIM), machine learning (ML), and especially neural network (NN)-based accelerators has grown significantly. Resistive random access memory (ReRAM) is a promising technology for efficiently architecting PIM- and NN-based accelerators due to its capabilities to work as both: High-density/low-energy storage and in-memory computation/search engine. In this paper, we present a survey of techniques for designing ReRAM-based PIM and NN architectures. By classifying the techniques based on key parameters, we underscore their similarities and differences. This paper will be valuable for computer architects, chip designers and researchers in the area of machine learning.
… To this end, we propose the first ReRAM-based accelerator for … First, we develop a ReRAM-based integrated architecture … enable uniform matrix processing on ReRAM in a …
Resistive Random-Access-Memory (ReRAM) crossbar is one of the most promising neural network accelerators, thanks to its in-memory and in-situ analog computing abilities for Matrix Multiplication-and-Accumulations (MACs). The key limitations are: 1) the number of rows and columns of ReRAM cells for concurrent execution of MACs is constrained, resulting in limited in-memory computing throughput; 2) the cost of high-precision analog-to-digital (A/D) conversions that can offset the efficiency and performance benefits of ReRAM-based Process-In-Memory (PIM). Meanwhile, it is challenging to deploy Deep Neural Network (DNN) models with a large model size in the crossbar since the sparsity of DNNs cannot be effectively exploited in the crossbar structure, especially the sparsity in the activation. As a countermeasure, we develop a novel ReRAM-based PIM accelerator, namely ERA-BS, which pays attention to the correlation between the bit-level sparsity (in both weights and activations) and the performance of the ReRAM-based crossbar. We propose a superior bit-flip scheme combined with the exponent-based quantization, which can adaptively flip the bits of the mapped DNNs to release redundant space without sacrificing the accuracy much or incurring much hardware overhead. Meanwhile, we design an architecture that can integrate the techniques to shrink the crossbar footprint to be used massively. We further propose a dynamic activation sparsity exploitation scheme in conjunction with the tightly coupled structure nature of the crossbar, including crossbar-aware activation pruning and ancillary run-time hardware support. In such a way, we exploit fine-grained sparsity weights (static) and activations (dynamic), respectively, to improve performance while reducing the energy consumption of computation with negligible overheads. Our experiments on a wide variety of networks show that compared to the well-known ReRAM-based PIM accelerator like “ISAAC”, ERA-BS can achieve up to 43<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives><mml:math><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq1-3290869.gif"/></alternatives></inline-formula>, 78<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives><mml:math><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq2-3290869.gif"/></alternatives></inline-formula>, and 73<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives><mml:math><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq3-3290869.gif"/></alternatives></inline-formula> in terms of energy efficiency, area-efficiency, and throughput, respectively. Compared to the state-of-the-art ReRAM-based design “PIM-Prune”, ERA-BS can also achieve 5.3<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives><mml:math><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq4-3290869.gif"/></alternatives></inline-formula> energy efficiency, 7.2<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives><mml:math><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq5-3290869.gif"/></alternatives></inline-formula> area efficiency, and 32<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives><mml:math><mml:mo>×</mml:mo></mml:math><inline-graphic xlink:href="liu-ieq6-3290869.gif"/></alternatives></inline-formula> performance gain with a similar or even higher accuracy.
Processing-in-memory (PIM) is a promising architecture to design various types of neural network accelerators as it ensures the efficiency of computation together with Resistive Random Access Memory (ReRAM). ReRAM has now become a promising solution to enhance computing efficiency due to its crossbar structure. In this paper, a ReRAM-based PIM neural network accelerator is addressed, and different kinds of methods and designs of various schemes are discussed. Various models and architectures implemented for a neural network accelerator are determined for research trends. Further, the limitations or challenges of ReRAM in a neural network are also addressed in this review.
这些文献围绕“自动驾驶+存算一体”形成了从端到端系统约束(实时/安全/可预测、能效与Memory Wall)到器件/架构实现(ReRAM PIM、算内浮点/稀疏优化、近像素CIS与自由空间光学、关联存储并行搜索、神经形态前端计算),再到软硬协同(量化、算术选择、HW/SW codesign与部署挑战)的完整技术链条。整体研究方向是:用存算一体减少数据搬运与A/D代价,在满足车载实时与可靠性约束的前提下,提升推理/规划等关键任务的吞吐、能效与尾延迟。