找一些关于爆炸缩比等效与神经网络的论文
多尺度物理建模与时序等效预测
该组论文探讨了如何利用神经网络处理具有多尺度特征的物理信号。通过多尺度分解、时序缩放(Timestep Shrinking)以及建立物理量之间的等效映射关系(如相位梯度、针尖偏转、地震响应),为模拟爆炸等复杂物理过程中的缩比等效提供了建模基础。
- Shrinking Your TimeStep: Towards Low-Latency Neuromorphic Object Recognition with Spiking Neural Network(Yongqi Ding, Lin Zuo, Mengmeng Jing, Pei He, Yongjun Xiao, 2024, No journal)
- Motor fault diagnosis method based on spiking convolutional neural network with multi-scale decomposition local features.(Gongping Wu, Zhiwen Huang, Zhuo Long, Fengqin Huang, Minghai Wang, Xiaofei Zhang, 2025, ISA transactions)
- Hi-MDTCN: Hierarchical Multi-Scale Dilated Temporal Convolutional Network for Tool Condition Monitoring(Anying Chai, Zhaobo Fang, Mengjia Lian, Ping Huang, Chenyang Guo, Wanda Yin, Lei Wang, Enqiu He, Siwen Li, 2025, Sensors (Basel, Switzerland))
- Amplitude-Dependent Phase-Gradient Directional Beamforming for IRS: A Scalable Optimization Framework(Zhuang Mao, Wei Wang, Q. Xia, Chongwen Huang, Xinhua Pan, Zhizhen Ye, 2024, IEEE Transactions on Communications)
- DAPS-AGF: Depth-Aware Perceptual Similarity with Adaptive Gradient Filtering for Enhanced Outdoor Scene Reconstruction(A. Yousaf, Arkajyoti Mitra, Paul Agbaje, A. Anjum, Habeeb Olufowobi, 2025, 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW))
- Rigorous dynamics of expectation-propagation signal detection via the conjugate gradient method(K. Takeuchi, Chao-Kai Wen, 2017, 2017 IEEE 18th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC))
- Prediction and Analysis of Bevel-Tip Needle Deflection Using Radial Basis Network for Robot-Aided Procedures(Bulbul Behera, M. F. Orlando, Tarun K. Podder, R. Anand, 2025, IEEE Transactions on Automation Science and Engineering)
- On data and parameters of pre-trained neural networks for solar power forecasting(Robbe Vander Eeckt, Joris Depoortere, Hussain Kazmi, 2025, 2025 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT Europe))
- Gated Recurrent Units Based Neural Network For Tool Condition Monitoring(Huan Xu, Chong Zhang, G. Hong, Junhong Zhou, Jihoon Hong, K. Woon, 2018, 2018 International Joint Conference on Neural Networks (IJCNN))
深度学习缩放法则(Scaling Laws)与性能演进
这组文献聚焦于模型规模(参数、数据、分辨率)增加时的性能增长规律。研究涵盖了Transformer的复合缩放、视觉学习的异构预训练缩放以及开源基础模型的扩展规律,是理解‘缩比等效’在模型容量层面表现的核心理论。
- Scaling transformer neural networks for skillful and reliable medium-range weather forecasting(Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Sandeep Madireddy, R. Maulik, V. Kotamarthi, Ian Foster, Aditya Grover, 2023, ArXiv)
- Efficient and accurate compound scaling for convolutional neural networks(Chengmin Lin, Pengfei Yang, Quan Wang, Zeyu Qiu, Wenkai Lv, Zhenyi Wang, 2023, Neural networks : the official journal of the International Neural Network Society)
- Resolution Based Incremental Scaling Methodology for CNNs(J. Lim, Soomi Lee, Soonhoi Ha, 2023, IEEE Access)
- Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers(Lirui Wang, Xinlei Chen, Jialiang Zhao, Kaiming He, 2024, ArXiv)
- YuE: Scaling Open Foundation Models for Long-Form Music Generation(Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yi Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Junlin Zhan, Chunhui Wang, Yatian Wang, Xiao-Qian Chi, Xinyue Zhang, Zhen Yang, Xiangzhou Wang, Shan-Ling Liu, Ling Mei, Peng Li, Junjie Wang, Jian-Xiu Yu, Guojian Pang, Xu Li, Zihao Wang, Xiaohuan Zhou, Lijun Yu, Emmanouil Benetos, Yong Chen, Cheng-Ju Lin, Xie Chen, Gus G. Xia, Zhaoxiang Zhang, Chao Zhang, Wenhu Chen, Xinyu Zhou, Xipeng Qiu, R. Dannenberg, Jia-Hua Liu, Jian Yang, Wenhao Huang, Wei Xue, Xu Tan, Yi-Ting Guo, 2025, ArXiv)
- Scaling Laws For Deep Learning Based Image Reconstruction(Tobit Klug, Reinhard Heckel, 2022, ArXiv)
梯度爆炸抑制与训练动力学稳定性
针对深度网络(RNN、Transformer、扩散模型)在极端参数波动下的不稳定性,该组论文研究了梯度爆炸的抑制策略。包括梯度裁剪、锐度感知最小化(SAM)、初始化失效模式分析及非自治系统的渐近稳定性证明,确保模型在复杂动态环境下的收敛。
- Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation(Alessio Giorlandino, Sebastian Goldt, 2025, ArXiv)
- Inducing Uniform Asymptotic Stability in Non-Autonomous Accelerated Optimization Dynamics via Hybrid Regularization(J. Poveda, Na Li, 2019, 2019 IEEE 58th Conference on Decision and Control (CDC))
- Dynamical Effects of Neuron Activation Gradient on Hopfield Neural Network: Numerical Analyses and Hardware Experiments(B. Bao, Chengjie Chen, H. Bao, Xi Zhang, Quan Xu, Mo Chen, 2019, Int. J. Bifurc. Chaos)
- Scaling Learning-based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout(Navid Hashemi, Bardh Hoxha, Danil Prokhorov, Georgios Fainekos, J. Deshmukh, 2024, ACM Transactions on Cyber-Physical Systems)
- Privacy-Preserving Federated Recurrent Neural Networks(Sinem Sav, Abdulrahman Diaa, Apostolos Pyrgelis, Jean-Philippe Bossuat, Jean-Pierre Hubaux, 2022, Proc. Priv. Enhancing Technol.)
- Research on Ubiquitous Power Internet of Things in Distribution Networks Using Adaptive Gradient Algorithm with Convolution Neural Network(Song Liu, 2024, 2024 International Conference on Data Science and Network Security (ICDSNS))
- Hyperbolic-SAM: sharpness-aware minimization in hyperbolic space for enhanced deep learning generalization(Zeyang Kang, 2025, No journal)
- Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation(Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, Jian Tang, 2024, 2025 IEEE International Conference on Robotics and Automation (ICRA))
- Proximal Gradient Dynamics: Monotonicity, Exponential Convergence, and Applications(Anand Gokhale, A. Davydov, Francesco Bullo, 2024, IEEE Control Systems Letters)
- Instabilities in Convnets for Raw Audio(Daniel Haider, Vincent Lostanlen, Martin Ehler, Péter Balázs, 2023, IEEE Signal Processing Letters)
- Unraveling the Gradient Descent Dynamics of Transformers(Bingqing Song, Boran Han, Shuai Zhang, J. Ding, Mingyi Hong, 2024, ArXiv)
- Convergence and stability analysis of recurrent neural networks for rapid structural damage assessment under seismic loads(Feng Zeng, Fujiang Chen, Yongyi Yang, Xin Zhang, 2025, PLOS One)
脉冲神经网络(SNN)与神经形态等效计算
专注于第三代神经网络的缩放与等效机制。研究内容包括脉冲列级别的反向传播、代理梯度缩放、以及自适应突触缩放(Synaptic Scaling),旨在解决类脑计算架构在深层化过程中的梯度流传递与能效优化问题。
- The architecture design and training optimization of spiking neural network with low-latency and high-performance for classification and segmentation(Wujian Ye, Shaozhen Chen, Haoxian Liu, Yijun Liu, Yuehai Chen, Youfeng Cui, Wenjie Lin, 2025, Neural networks : the official journal of the International Neural Network Society)
- Re-parameterization Convolution Spiking Neural Network for Object Detection*(Jun Zhou, Ziliang Ren, Qieshi Zhang, Kadyrkulova Kyial Kudayberdievna, Taalaybekova Aizharkyn, 2025, 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC))
- Surrogate gradient scaling for directly training spiking neural networks(Tao Chen, Shu Wang, Yuyuan Gong, Lidan Wang, Shukai Duan, 2023, Applied Intelligence)
- Surrogate Gradient Spiking Neural Networks as Encoders for Large Vocabulary Continuous Speech Recognition(Alexandre Bittar, Philip N. Garner, 2022, ArXiv)
- Spike-Train Level Backpropagation for Training Deep Recurrent Spiking Neural Networks(Wenrui Zhang, Peng Li, 2019, No journal)
- Adaptive Synaptic Scaling in Spiking Networks for Continual Learning and Enhanced Robustness(M. Xu, Faqiang Liu, Yifan Hu, Hongyi Li, Yuanyuan Wei, Shuai Zhong, Jing Pei, Lei Deng, 2024, IEEE Transactions on Neural Networks and Learning Systems)
- Scaling Up Resonate-and-Fire Networks for Fast Deep Learning(T. Huber, Jules Lecomte, Borislav Polovnikov, A. V. Arnim, 2025, ArXiv)
大规模分布式优化与硬件架构扩展
该组论文关注在异构计算环境和新型硬件(如光子芯片)下的系统级缩放。涵盖了零阶优化的大规模应用、分布式训练效率优化、以及通过光电残差设计克服传统硬件在梯度传播上的物理限制。
- Scaling Learning based Policy Optimization for Temporal Tasks via Dropout(Navid Hashemi, Bardh Hoxha, Danil Prokhorov, Georgios Fainekos, J. Deshmukh, 2024, ArXiv)
- An Adaptive Proximal Inexact Gradient Framework and Its Application to Per-Antenna Constrained Joint Beamforming and Compression Design(Xilai Fan, Bo Jiang, Ya-Feng Liu, 2025, IEEE Transactions on Signal Processing)
- DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training(Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Jiancheng Liu, Konstantinos Parasyris, Yihua Zhang, Zheng Zhang, B. Kailkhura, Sijia Liu, 2023, ArXiv)
- Heterogeneous gradient computing optimization for scalable deep neural networks(S. Moreno-Álvarez, Mercedes Eugenia Paoletti, Juan A. Rico-Gallego, J. Haut, 2022, The Journal of Supercomputing)
- Architectural Shift Necessitated by Voracious Energy Demands of Generative Artificial Intelligence (GAI) and High-Performance Computing(Parth Shukla, M. L. Sharma, Sunil Kumar, A. Garg, Mahir Pandey, 2025, International Scientific Journal of Engineering and Management)
- Reservoir direct feedback alignment: deep learning by physical dynamics(M. Nakajima, Yongbo Zhang, Katsuma Inoue, Yasuo Kuniyoshi, Toshikazu Hashimoto, Kohei Nakajima, 2024, Communications Physics)
- On-chip deep residual photonic neural networks using optical-electrical shortcut connections.(Kaiyuan Wang, Zihao Tang, Yunlong Li, Yantao Wu, Shuang Zheng, Minming Zhang, 2025, Optics letters)
- A Distributed Neural Network Training Method Based on Hybrid Gradient Computing(Zhenzhou Lu, Meng Lu, Yan Liang, 2020, Scalable Comput. Pract. Exp.)
- Distributed neural network control with dependability guarantees: a compositional port-Hamiltonian approach(Luca Furieri, C. Galimberti, M. Zakwan, G. Ferrari-Trecate, 2021, No journal)
- Maximum-Likelihood Detection With QAOA for Massive MIMO and Sherrington-Kirkpatrick Model With Local Field at Infinite Size(Burhan Gülbahar, 2024, IEEE Transactions on Wireless Communications)
最终合并的文献组构建了一个从“物理等效建模”到“模型缩放理论”再到“系统稳定性保障”的完整框架。研究重点在于:1) 利用多尺度神经网络捕捉物理现象中的缩比规律;2) 通过缩放法则(Scaling Laws)预测大规模模型的性能演进;3) 解决深度学习在模拟高能、瞬态过程(如爆炸相关梯度波动)时的梯度爆炸与稳定性挑战;4) 探索脉冲神经网络与新型硬件在处理等效计算任务时的独特优势。这为实现爆炸缩比等效的神经计算模拟提供了多维度的理论与技术支撑。
总计44篇相关文献
No abstract available
In China, several distribution automation devices are established in distribution network to increase reliability of power supply. The major challenge of ubiquitous electric Internet of Things (IoT) is a resource aspect which enables distributed computational sources. Therefore, the Adaptive Gradient Algorithm with Convolution Neural Network (Adagrad-CNN) is proposed for ubiquitous electric IoT in distribution networks. The Adagrad is helpful when dealing with sparse features or data that has extensive-range of values. The CNN has multiple-layer architecture for classifying image pixels patterns through processing and it filtered noise captures significant predictive properties. The Adagrad-CNN leads to better generalization and enhance the CNN performance through mitigating issues based on exploding gradients that is essential when dealing with difficult and large datasets in ubiquitous electric IoT. The dataset is collected from various devices and sensors in the power grid and it is preprocessed by Z-score normalization. This preprocessing enhances convergence rates and reduces the large scales. It is essential when features are in different ranges and it prevents particular feature dominance. The Adagrad-CNN attains 96.54%, 94.21 %, 95.73% and 94.59% of accuracy, recall, precision and f1-score.
Compared to conventional artificial neurons that produce dense and real-valued responses, biologically-inspired spiking neurons transmit sparse and binary information, which can also lead to energy-efficient implementations. Recent research has shown that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method. They have shown promising results on speech command recognition tasks. Using the same technique, we show that they are scalable to large vocabulary continuous speech recognition, where they are capable of replacing LSTMs in the encoder with only minor loss of performance. This suggests that they may be applicable to more involved sequence-to-sequence tasks. Moreover, in contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
Hyperbolic tangent function, a bounded monotone differentiable function, is usually taken as a neuron activation function, whose activation gradient, i.e. gain scaling parameter, can reflect the response speed in the neuronal electrical activities. However, the previously published literatures have not yet paid attention to the dynamical effects of the neuron activation gradient on Hopfield neural network (HNN). Taking the neuron activation gradient as an adjustable control parameter, dynamical behaviors with the variation of the control parameter are investigated through stability analyses of the equilibrium states, numerical analyses of the mathematical model, and experimental measurements on a hardware level. The results demonstrate that complex dynamical behaviors associated with the neuron activation gradient emerge in the HNN model, including coexisting limit cycle oscillations, coexisting chaotic spiral attractors, chaotic double scrolls, forward and reverse period-doubling cascades, and crisis scenarios, which are effectively confirmed by neuron activation gradient-dependent local attraction basins and parameter-space plots as well. Additionally, the experimentally measured results have nice consistency to numerical simulations.
Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (DP-T) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely ScaleDP, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that DPT suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our utilize non-causal attention which allows the policy network to “see” future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named ScaleDP, can effectively scale up the model size with improved performance and generalization. We benchmark ScaleDP across 50 different tasks from MetaWorld and find that our largest ScaleDP outperforms DP-T with an average improvement of 21.6%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 36. 25% over DP-T on four single-arm tasks and 75% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at https://scaling-diffusion-policy.github.io/.
Neuromorphic object recognition with spiking neural networks (SNNs) is the cornerstone of low-power neuromorphic computing. However, existing SNNs suffer from significant latency, utilizing 10 to 40 timesteps or more, to recognize neuromorphic objects. At low latencies, the performance of existing SNNs is drastically degraded. In this work, we propose the Shrinking SNN (SSNN) to achieve low-latency neuromorphic object recognition without reducing performance. Concretely, we alleviate the temporal redundancy in SNNs by dividing SNNs into multiple stages with progressively shrinking timesteps, which significantly reduces the inference latency. During timestep shrinkage, the temporal transformer smoothly transforms the temporal scale and preserves the information maximally. Moreover, we add multiple early classifiers to the SNN during training to mitigate the mismatch between the surrogate gradient and the true gradient, as well as the gradient vanishing/exploding, thus eliminating the performance degradation at low latency. Extensive experiments on neuromorphic datasets, CIFAR10-DVS, N-Caltech101, and DVS-Gesture have revealed that SSNN is able to improve the baseline accuracy by 6.55% ~ 21.41%. With only 5 average timesteps and without any data augmentation, SSNN is able to achieve an accuracy of 73.63% on CIFAR10-DVS. This work presents a heterogeneous temporal scale SNN and provides valuable insights into the development of high-performance, low-latency SNNs.
On-chip photonic neural networks (PNNs) have recently emerged as an attractive hardware accelerator for deep learning applications. However, deep PNNs with higher inference complexity are harder to train due to gradient vanishing and exploding problems. In this work, we propose an on-chip deep residual photonic neural network architecture (Res-PNN), which enables the training of deeper PNNs by using optical-electrical shortcut connections. The optical-electrical shortcut connection is designed using a power splitter, a wavelength demultiplexer, and photodetectors to directly connect the input and the output across optical weight layers. This optical-electrical shortcut connection alleviates the gradient vanishing and exploding problems by providing a direct path for gradient backpropagation, ensuring stable training of deeper PNNs. The proposed Res-PNN achieves classification accuracies of 88.4% on the CIFAR-10 dataset and 80.3% on the CIFAR-100 dataset. Compared to fully connected PNNs, Res-PNN improves classification accuracy by 3.2% on the CIFAR-10 dataset and 11.3% on the CIFAR-100 dataset.
As the third-generation of neural networks, Spiking Neural Networks (SNN) have biological plausibility and low-power advantages over Artificial Neural Networks (ANNs). However, applying SNN to object detection tasks presents challenges in achieving both high detection accuracy and fast processing speed. To overcome the aforementioned problems, we propose a Re-parameterization SpikeYOLO (RepSpikeYOLO) for high-performance and energy-effcient object detection Our design revolves around network architecture and SNN residual block. Foremost, the SNN are difficult to train, mainly owing to their complex dynamics of neurons and non-differentiable spike operations. We design a YOLO architecture to solve this problem by training SNN with surrogate gradients. Second, object detection is more sensitive to gradient vanishing or exploding in training deep SNN. To address this challenge, we design a new SNN residual block, which can effectively extend the depth of the directly-trained with low power consumption. The proposed approach is validated on both COCO dataset and PASCAL VOC dataset. It is shown that our YOLO could achieve a comparable performance to the ANN with the same architecture. On the COCO dataset, we obtain 54% mAP@50 and 33.7% mAP@50:95, which is +3.9% and 3.7% higher than the prior state-of-the-art SNN, respectively. On the PASCAL VOC dataset, we achieve 75.1% mAP@50, which is +21.05% higher than the prior state-of-the-art SNN.
Non-stationary earthquake responses and sensor noise often make RNN-based damage assessment difficult to optimize and unstable at inference. We develop a stability-controlled, lightweight LSTM that: (i) penalizes gradient overshoot to smooth the update trajectory and prevent exploding/vanishing gradients; (ii) uses a temporal attention gate to emphasize damage-critical segments; and (iii) performs multi-scale sliding-window inference to stabilize long-horizon predictions. Casting the LSTM-with-attention into a discrete-time state-space view, we provide sufficient conditions for non-expansive updates and BIBO stability by bounding the Jacobian spectral norm and constraining attention gains.Empirically, under 10 dB noise our method reaches loss < 0.01 in 18 epochs with only 3 gradient-explosion events, and achieves σ(out)=0.032 with max Δ-rate = 0.085 ± 0.009, outperforming standard LSTM/GRU/BiLSTM/RNN baselines in accuracy, stability, and latency. On-device tests (Jetson Nano) confirm < 5 ms end-to-end delay at 100 Hz, supporting real-time deployment.
The application of deep learning in industry often needs to train large-scale neural networks and use large-scale data sets. However, larger networks and larger data sets lead to longer training time, which hinders the research of algorithms and the progress of actual engineering development. Data-parallel distributed training is a commonly used solution, but it is still in the stage of technical exploration. In this paper, we study how to improve the training accuracy and speed of distributed training, and propose a distributed training strategy based on hybrid gradient computing. Specifically, in the gradient descent stage, we propose a hybrid method, which combines a new warmup scheme with the linear-scaling stochastic gradient descent (SGD) algorithm to effectively improve the training accuracy and convergence rate. At the same time, we adopt the mixed precision gradient computing. In the single-GPU gradient computing and inter-GPU gradient synchronization, we use the mixed numerical precision of single precision (FP32) and half precision (FP16), which not only improves the training speed of single-GPU, but also improves the speed of inter-GPU communication. Through the integration of various training strategies and system engineering implementation, we finished ResNet-50 training in 20 minutes on a cluster of 24 V100 GPUs, with 75.6% Top-1 accuracy, and 97.5% GPU scaling efficiency. In addition, this paper proposes a new criterion for the evaluation of the distributed training efficiency, that is, the actual average single-GPU training time, which can evaluate the improvement of training methods in a more reasonable manner than just the improved performance due to the increased number of GPUs. In terms of this criterion, our method outperforms those existing methods.
Synaptic plasticity plays a critical role in the expression power of brain neural networks. Among diverse plasticity rules, synaptic scaling presents indispensable effects on homeostasis maintenance and synaptic strength regulation. In the current modeling of brain-inspired spiking neural networks (SNN), backpropagation through time is widely adopted because it can achieve high performance using a small number of time steps. Nevertheless, the synaptic scaling mechanism has not yet been well touched. In this work, we propose an experience-dependent adaptive synaptic scaling mechanism (AS-SNN) for spiking neural networks. The learning process has two stages: First, in the forward path, adaptive short-term potentiation or depression is triggered for each synapse according to afferent stimuli intensity accumulated by presynaptic historical neural activities. Second, in the backward path, long-term consolidation is executed through gradient signals regulated by the corresponding scaling factor. This mechanism shapes the pattern selectivity of synapses and the information transfer they mediate. We theoretically prove that the proposed adaptive synaptic scaling function follows a contraction map and finally converges to an expected fixed point, in accordance with state-of-the-art results in three tasks on perturbation resistance, continual learning, and graph learning. Specifically, for the perturbation resistance and continual learning tasks, our approach improves the accuracy on the N-MNIST benchmark over the baseline by 44% and 25%, respectively. An expected firing rate callback and sparse coding can be observed in graph learning. Extensive experiments on ablation study and cost evaluation evidence the effectiveness and efficiency of our nonparametric adaptive scaling method, which demonstrates the great potential of SNN in continual learning and robust learning.
Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinate-wise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsity-induced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing DL with black box.
Nowadays, data processing applications based on neural networks cope with the growth in the amount of data to be processed and with the increase in both the depth and complexity of the neural networks architectures, and hence in the number of parameters to be learned. High-performance computing platforms are provided with fast computing resources, including multi-core processors and graphical processing units, to manage such computational burden of deep neural network applications. A common optimization technique is to distribute the workload between the processes deployed on the resources of the platform. This approach is known as data-parallelism. Each process, known as replica, trains its own copy of the model on a disjoint data partition. Nevertheless, the heterogeneity of the computational resources composing the platform requires to unevenly distribute the workload between the replicas according to its computational capabilities, to optimize the overall execution performance. Since the amount of data to be processed is different in each replica, the influence of the gradients computed by the replicas in the global parameter updating should be different. This work proposes a modification of the gradient computation method that considers the different speeds of the replicas, and hence, its amount of data assigned. The experimental results have been conducted on heterogeneous high-performance computing platforms for a wide range of models and datasets, showing an improvement in the final accuracy with respect to current techniques, with a comparable performance.
What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. These baselines are linear time-invariant systems: as such, they can be approximated by convnets with wide receptive fields. Yet, in practice, gradient-based optimization leads to suboptimal results. In our article, we approach this problem from the perspective of initialization. We present a theory of large deviations for the energy response of FIR filterbanks with random Gaussian weights. We find that deviations worsen for large filters and locally periodic input signals, which are both typical for audio signal processing applications. Numerical simulations align with our theory and suggest that the condition number of a convolutional layer follows a logarithmic scaling law between the number and length of the filters, which is reminiscent of discrete wavelet bases.
In this paper, we propose an adaptive proximal inexact gradient (APIG) framework for solving a class of nonsmooth composite optimization problems involving function and gradient errors. Unlike existing inexact proximal gradient methods, the proposed framework introduces a new line search condition that jointly adapts to function and gradient errors, enabling adaptive stepsize selection while maintaining theoretical guarantees. Specifically, we prove that the proposed framework achieves an <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math></inline-formula>-stationary point within <inline-formula><tex-math notation="LaTeX">$\mathcal{O}(\epsilon^{-2})$</tex-math></inline-formula> iterations for nonconvex objectives and an <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math></inline-formula>-optimal solution within <inline-formula><tex-math notation="LaTeX">$\mathcal{O}(\epsilon^{-1})$</tex-math></inline-formula> iterations for convex cases, matching the best-known complexity in this context. We then custom-apply the APIG framework to an important signal processing problem: the joint beamforming and compression problem (JBCP) with per-antenna power constraints (PAPCs) in cooperative cellular networks. This customized application requires careful exploitation of the problem’s special structure such as the tightness of the semidefinite relaxation (SDR) and the differentiability of the dual. Numerical experiments demonstrate the superior performance of our custom-application over state-of-the-art benchmarks for the JBCP.
Intelligent reflecting surface (IRS) usually consists of a large number of passive elements, for which the element-grouping strategies can be adopted to group adjacent elements into a sub-surface for lower computational complexity. For the grouped elements of a sub-surface, the linear gradient phase shift configuration can achieve directional IRS reflect beam towards the intended receiver. In this paper, we propose a practical scalable optimization framework for element-grouping IRS by adopting the amplitude-dependent phase-gradient directional beamforming, which induces a new amplitude-phase coupling to the reflected signal. Specifically, by deriving the phase-gradient condition from Fermat’s principle, we propose a practical phase-gradient IRS reflection model. Under this practical model, the amplitude-phase coupling becomes complicated, which brings technical challenges to the IRS beamforming optimization. We study a joint transmit and reflect beamforming optimization problem to minimize the transmit power. By designing a trigonometric transformation to deal with the complicated amplitude-phase coupling, we propose a penalty-based phase control strategy under given element grouping. Subsequently, to solve the element-grouping combinatorial problem with performance guarantee, we propose a low-complexity IRS reflect beamforming algorithm based on Markov approximation. Simulation results demonstrate that the proposed algorithm achieves substantial performance gains compared to conventional schemes.
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation
Accurate identification of tool wear conditions is of great significance for extending tool life, ensuring processing quality, and improving production efficiency. Current research shows that signals collected by a single sensor have limited dimensions and cannot comprehensively capture the degradation process of tool wear, while multi-sensor fusion recognition methods cannot effectively handle the complementarity and redundancy between heterogeneous sensor data in feature extraction and fusion. To address these issues, this paper proposes Hi-MDTCN (Hierarchical Multi-scale Dilated Temporal Convolutional Network). In the network, we propose a hierarchical signal analysis framework that processes the signal in segments. When processing intra-segment signals, we design a Multi-channel one-dimensional convolutional network with attention mechanism to capture local wear features at different time scales and fuse them into a unified representation. When processing signal segments, we design a Bi-TCN module to further capture long-term dependencies in wear evolution, mining the overall trend of tool wear over time. Hi-MDTCN adopts a dilated convolution mechanism, which can achieve an extremely large receptive field without building an overly deep network structure, effectively solving problems faced by recurrent neural networks in long sequence modeling such as gradient vanishing, low training efficiency, and poor parallel computing capability, achieving efficient parallel capture of long-range dependencies in time series. Finally, the proposed method is applied to the PHM2010 milling data. Experimental results show that the model’s tool condition recognition accuracy is higher than traditional methods, demonstrating its effectiveness for practical applications.
Designing efficient and accurate network architectures to support various workloads, from servers to edge devices, is a fundamental problem as the use of Convolutional Neural Networks (ConvNets) becomes increasingly widespread. One simple yet effective method is to scale ConvNets by systematically adjusting the dimensions of the baseline network, including width, depth, and resolution, enabling it to adapt to diverse workloads by varying its computational complexity and representation ability. However, current state-of-the-art (SOTA) scaling methods for neural network architectures overlook the inter-dimensional relationships within the network and the impact of scaling on inference speed, resulting in suboptimal trade-offs between accuracy and inference speed. To overcome those limitations, we propose a scaling method for ConvNets that utilizes dimension relationship and runtime proxy constraints to improve accuracy and inference speed. Specifically, our research notes that higher input resolutions in convolutional layers lead to redundant filters (convolutional width) due to increased similarity between information in different positions, suggesting a potential benefit in reducing filters while increasing input resolution. Based on this observation, the relationship between the width and resolution is empirically quantified in our work, enabling models with higher parametric efficiency to be prioritized through our scaling strategy. Furthermore, we introduce a novel runtime prediction model that focuses on fine-grained layer tasks with different computational properties for more accurate identification of efficient network configurations. Comprehensive experiments show that our method outperforms prior works in creating a set of models with a trade-off between accuracy and inference speed on the ImageNet datasets for various ConvNets.
Spiking Neural Networks (SNNs) are the new third generation of bio-mimetic neural networks suitable for large-scale parallel computation due to its advantages of low power consumption and low latency. However, most of the training algorithms and network architectures of existing SNNs are designed on the basis of traditional Artificial Neural Networks (ANNs), which require a large number of time-steps for inference and have high requirements for membrane potential storage space, resulting in large latency and consuming large memory resources. In this paper, we propose a spiking neurons-shared ResNet network (Spiking-NSNet) for image classification and a spiking semantic segmentation network (Spiking-SSegNet) for image segmentation based on our designed neurons-shared architecture and hybrid attenuation strategy. Firstly, a novel Neurons-Shared Block (NS-Block) are designed for locally sharing membrane potential parameters of neurons to realize the reduction of parameters and accelerate the inference speed. Secondly, different attenuation factor are set for neurons in different NS-Blocks, so that different neurons have different activities and are more in line with the biological dynamic characteristics. Finally, a temporal correlated(TC) loss algorithm is designed to optimize the SNN direct training process for faster convergence and better performance. Based on above improvements, the Spiking-NSNet and the Spiking-SSegNet are designed by using the architectures of ResNet and UNet, respectively, and are trained by realizing the pre-training and transfer learning of SNNs for the first time. The experiments show that the proposed Spking-NSNet obtains high recognition accuracies of 94.65 %, 77.4 % and 79 % with lower latency of four time steps on static dataset of CIFAR-10, CIFAR100 and dynamic dataset of DVS-CIFAR-10. The mIoUs of designed Spiking-SSegNet can achieve 43.2 % and 53.4 % on static dataset of PASCAL-VOC2012 and dynamic dataset of DDD17. Thus, under the recognition and segmentation tasks, the proposed methods can effectively reduce the number of time steps and model parameters for model's training and inference comparable to that of traditional ANN models.
Weather forecasting is a fundamental problem for anticipating and mitigating the impacts of climate change. Recently, data-driven approaches for weather forecasting based on deep learning have shown great promise, achieving accuracies that are competitive with operational systems. However, those methods often employ complex, customized architectures without sufficient ablation analysis, making it difficult to understand what truly contributes to their success. Here we introduce Stormer, a simple transformer model that achieves state-of-the-art performance on weather forecasting with minimal changes to the standard transformer backbone. We identify the key components of Stormer through careful empirical analyses, including weather-specific embedding, randomized dynamics forecast, and pressure-weighted loss. At the core of Stormer is a randomized forecasting objective that trains the model to forecast the weather dynamics over varying time intervals. During inference, this allows us to produce multiple forecasts for a target lead time and combine them to obtain better forecast accuracy. On WeatherBench 2, Stormer performs competitively at short to medium-range forecasts and outperforms current methods beyond 7 days, while requiring orders-of-magnitude less training data and compute. Additionally, we demonstrate Stormer's favorable scaling properties, showing consistent improvements in forecast accuracy with increases in model size and training tokens. Code and checkpoints are available at https://github.com/tung-nd/stormer.
One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (https://liruiw.github.io/hpt/) for code and videos.
Sharpness-Aware Minimization (SAM) is a powerful technique for discovering generalizable solutions, yet its Euclidean formulation restricts its effectiveness for data with inherent hierarchical structures. We introduce HyperbolicSAM, a novel optimizer that extends the principles of SAM to the Poincar´e ball manifold. Our framework overcomes the notorious instability of hyperbolic neural networks by employing a principled Riemannian gradient computation and an adaptive parameter perturbation strategy. Critically, we introduce a hybrid optimization scheme that applies distinct learning dynamics to Euclidean and hyperbolic parameters, effectively mitigating gradient explosion issues. This approach demonstrably enhances performance; on the CIFAR-10 dataset, HyperbolicSAM reduces the error rate to 2.34% from the 2.86% achieved by its Euclidean counterpart, with more pronounced advantages on datasets with complex topological structures. Our work provides a robust and theoretically grounded pathway for applying sharpness-aware optimization in non-Euclidean geometric deep learning
While the Transformer architecture has achieved remarkable success across various domains, a thorough theoretical foundation explaining its optimization dynamics is yet to be fully developed. In this study, we aim to bridge this understanding gap by answering the following two core questions: (1) Which types of Transformer architectures allow Gradient Descent (GD) to achieve guaranteed convergence? and (2) Under what initial conditions and architectural specifics does the Transformer achieve rapid convergence during training? By analyzing the loss landscape of a single Transformer layer using Softmax and Gaussian attention kernels, our work provides concrete answers to these questions. Our findings demonstrate that, with appropriate weight initialization, GD can train a Transformer model (with either kernel type) to achieve a global optimal solution, especially when the input embedding dimension is large. Nonetheless, certain scenarios highlight potential pitfalls: training a Transformer using the Softmax attention kernel may sometimes lead to suboptimal local solutions. In contrast, the Gaussian attention kernel exhibits a much favorable behavior. Our empirical study further validate the theoretical findings.
In this letter we study the proximal gradient dynamics. This recently-proposed continuous-time dynamics solves optimization problems whose cost functions are separable into a nonsmooth convex and a smooth component. First, we show that the cost function decreases monotonically along the trajectories of the proximal gradient dynamics. We then introduce a new condition that guarantees exponential convergence of the cost function to its optimal value, and show that this condition implies the proximal Polyak-Łojasiewicz condition. We also show that the proximal Polyak-Łojasiewicz condition guarantees exponential convergence of the cost function. Moreover, we extend these results to time-varying optimization problems, providing bounds for equilibrium tracking. Finally, we discuss applications of these findings, including the LASSO problem, certain matrix based problems and a numerical experiment on a feed-forward neural network.
There have been many recent efforts to study accelerated optimization algorithms from the perspective of dynamical systems. In this paper, we focus on the robustness properties of the time-varying continuous-time version of these dynamics. These properties are critical for the implementation of accelerated algorithms in feedback-based control and optimization architectures. We show that a family of dynamics related to the continuous-time limit of Nesterov’s accelerated gradient method can be rendered unstable under arbitrarily small bounded disturbances. Indeed, while solutions of these dynamics may converge to the set of optimizers, in general, this set may not be uniformly asymptotically stable. To induce uniformity, and robustness as a byproduct, we propose a framework where the dynamics are regularized by using resetting mechanisms that are modeled by well-posed hybrid dynamical systems. For these hybrid dynamics, we establish uniform asymptotic stability and robustness properties, as well as convergence rates that are similar to those of the non-hybrid dynamics. We finish by characterizing a family of discretization mechanisms that retain the main stability and robustness properties of the hybrid algorithms.
Quantum-approximate optimization algorithm (QAOA) is promising in Noisy Intermediate-Scale Quantum (NISQ) computers with applications for NP-hard combinatorial optimization problems. It is recently utilized for NP-hard maximum-likelihood (ML) detection problem with challenges of optimization, simulation and performance analysis for <inline-formula> <tex-math notation="LaTeX">$n \times n$ </tex-math></inline-formula> multiple-input multiple output (MIMO) systems with large <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>. QAOA is recently applied by Farhi et al. on infinite size limit of Sherrington-Kirkpatrick (SK) model with a cost model including only quadratic terms. In this article, we extend the model by including also linear terms and then realize SK modeling of massive MIMO ML detection. The proposed design targets near ML performance while with complexity including <inline-formula> <tex-math notation="LaTeX">$O({16}^{p})$ </tex-math></inline-formula> initial operations independent from problem instance and size <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> for optimizing QAOA angles and <inline-formula> <tex-math notation="LaTeX">$O(n^{2} \,p)$ </tex-math></inline-formula> quantum operations for each instance. We provide both optimized and extrapolated angles for <inline-formula> <tex-math notation="LaTeX">$p \in [{1, 14}]$ </tex-math></inline-formula> and signal-to-noise (SNR) < 12 dB achieving near-optimum ML performance with <inline-formula> <tex-math notation="LaTeX">$p \geq 4$ </tex-math></inline-formula> for <inline-formula> <tex-math notation="LaTeX">$25 \times 25$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$12 \times 12$ </tex-math></inline-formula> MIMO systems modulated with BPSK and QPSK, respectively. We present two conjectures about concentration properties of QAOA and near-optimum performance for next generation massive MIMO systems covering <inline-formula> <tex-math notation="LaTeX">$n \lt 300$ </tex-math></inline-formula>.
Designing an optimal CNN for each embedded device with a different resource budget would be time-consuming and inefficient. Network scaling provides a viable solution to tackle this challenge, In this work, we propose a novel network scaling strategy called RBIS (resolution-based incremental scaling). Unlike the previous works that consider the width, depth, and input resolution together, we first find the input resolution candidates on a given hardware platform. For each resolution candidate, we incrementally scale the depth and width of each stage up to limit of available resources. Comparison with other scaling methods proves the superiority of the proposed scaling methodology. RBIS finds a more accurate model by up to 0.53 % for EfficientNet-B1 and 0.67 % for the S3NAS B2 scale.
Deep neural networks trained end-to-end to map a measurement of a (noisy) image to a clean image perform excellent for a variety of linear inverse problems. Current methods are only trained on a few hundreds or thousands of images as opposed to the millions of examples deep networks are trained on in other domains. In this work, we study whether major performance gains are expected from scaling up the training set size. We consider image denoising, accelerated magnetic resonance imaging, and super-resolution and empirically determine the reconstruction quality as a function of training set size, while simultaneously scaling the network size. For all three tasks we find that an initially steep power-law scaling slows significantly already at moderate training set sizes. Interpolating those scaling laws suggests that even training on millions of images would not significantly improve performance. To understand the expected behavior, we analytically characterize the performance of a linear estimator learned with early stopped gradient descent. The result formalizes the intuition that once the error induced by learning the signal model is small relative to the error floor, more training examples do not improve performance.
Solar generation forecasting models are a key element in unlocking prosumer flexibility and improved grid operation. As learning-based forecasting models still face challenges in practice due to limited data availability, transfer learning offers a solution to this drawback. However, current studies limit their focus to accuracy assessments of resulting models and do not consider the impact or interaction of data quantity, model size, and input features. This study overcomes this gap by pre-training a Long Short-Term Memory on varying volumes of source data and utilizing integrated gradients to understand model predictions beyond accuracy. Using real-world data from ten sites for one year, we show that increasing source data volumes improve zero-shot model performance, but only up to a certain point. We find that the strongest pre-trained model outperforms a baseline persistence model by 36%, despite having seen no actual site data whatsoever. Perhaps more surprisingly, no correlation seems to exist between model size and available data volume, thus going against the established grain. Additionally, the integrated gradients reveal the internal parametrization of the model, including potentially irrelevant features. These results increase the understanding of the solar power forecasting problem, guiding effective transfer learning in data-limited scenarios. By deviating from known neural scaling laws in other domains, it also points towards future models that can effectively leverage greater amounts of source data.
Spiking neural networks (SNNs) well support spatiotemporal learning and energy-efficient event-driven hardware neuromorphic processors. As an important class of SNNs, recurrent spiking neural networks (RSNNs) possess great computational power. However, the practical application of RSNNs is severely limited by challenges in training. Biologically-inspired unsupervised learning has limited capability in boosting the performance of RSNNs. On the other hand, existing backpropagation (BP) methods suffer from high complexity of unrolling in time, vanishing and exploding gradients, and approximate differentiation of discontinuous spiking activities when applied to RSNNs. To enable supervised training of RSNNs under a well-defined loss function, we present a novel Spike-Train level RSNNs Backpropagation (ST-RSBP) algorithm for training deep RSNNs. The proposed ST-RSBP directly computes the gradient of a rated-coded loss function defined at the output layer of the network w.r.t tunable parameters. The scalability of ST-RSBP is achieved by the proposed spike-train level computation during which temporal effects of the SNN is captured in both the forward and backward pass of BP. Our ST-RSBP algorithm can be broadly applied to RSNNs with a single recurrent layer or deep RSNNs with multiple feed-forward and recurrent layers. Based upon challenging speech and image datasets including TI46, N-TIDIGITS, Fashion-MNIST and MNIST, ST-RSBP is able to train RSNNs with an accuracy surpassing that of the current state-of-art SNN BP algorithms and conventional non-spiking deep learning models.
No abstract available
This article introduces a model-based approach for training feedback controllers for an autonomous agent operating in a highly non-linear (albeit deterministic) environment. We desire the trained policy to ensure that the agent satisfies specific task objectives and safety constraints, both expressed in Discrete-Time Signal Temporal Logic (DT-STL). One advantage for reformulation of a task via formal frameworks, like DT-STL, is that it permits quantitative satisfaction semantics. In other words, given a trajectory and a DT-STL formula, we can compute the robustness, which can be interpreted as an approximate signed distance between the trajectory and the set of trajectories satisfying the formula. We utilize feedback control, and we assume a feed forward neural network for learning the feedback controller. We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent’s task objectives. This poses a challenge: RNNs are susceptible to vanishing and exploding gradients, and naïve gradient descent-based strategies to solve long-horizon task objectives thus suffer from the same problems. To address this challenge, we introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling. One of the main contributions is the notion of controller network dropout, where we approximate the NN controller in several timesteps in the task horizon by the control input obtained using the controller in a previous training step. We show that our control synthesis methodology can be quite helpful for stochastic gradient descent to converge with less numerical issues, enabling scalable back-propagation over longer time horizons and trajectories over higher-dimensional state spaces. We demonstrate the efficacy of our approach on various motion planning applications requiring complex spatio-temporal and sequential tasks ranging over thousands of timesteps.
We present RHODE, a novel system that enables privacy-preserving training of and prediction on Recurrent Neural Networks (RNNs) in a cross-silo federated learning setting by relying on multiparty homomorphic encryption. RHODE preserves the confidentiality of the training data, the model, and the prediction data; and it mitigates federated learning attacks that target the gradients under a passive-adversary threat model. We propose a packing scheme, multi-dimensional packing, for a better utilization of Single Instruction, Multiple Data (SIMD) operations under encryption. With multi-dimensional packing, RHODE enables the efficient processing, in parallel, of a batch of samples. To avoid the exploding gradients problem, RHODE provides several clipping approximations for performing gradient clipping under encryption. We experimentally show that the model performance with RHODE remains similar to non-secure solutions both for homogeneous and heterogeneous data distributions among the data holders. Our experimental evaluation shows that RHODE scales linearly with the number of data holders and the number of timesteps, sub-linearly and sub-quadratically with the number of features and the number of hidden units of RNNs, respectively. To the best of our knowledge, RHODE is the first system that provides the building blocks for the training of RNNs and its variants, under encryption in a federated learning setting.
Large-scale cyber-physical systems require that control policies are distributed, that is, that they only rely on local real-time measurements and communication with neighboring agents. Optimal Distributed Control (ODC) problems are, however, highly intractable even in seemingly simple cases. Recent work has thus proposed training Neural Network (NN) distributed controllers. A main challenge of NN controllers is that they are not dependable during and after training, that is, the closed-loop system may be unstable, and the training may fail due to vanishing and exploding gradients. In this paper, we address these issues for networks of nonlinear port-Hamiltonian (pH) systems, whose modeling power ranges from energy systems to non-holonomic vehicles and chemical reactions. Specifically, we embrace the compositional properties of pH systems to characterize deep Hamiltonian control policies with built-in closed-loop stability guarantees, irrespective of the interconnection topology and the chosen NN parameters. Furthermore, our setup enables leveraging recent results on well-behaved neural ODEs to prevent the phenomenon of vanishing gradients by design. Numerical experiments corroborate the dependability of the proposed architecture, while matching the performance of general neural network policies.
This study presents a tissue-independent neural network approach for predicting needle deflection in minimally invasive procedure. Precision in percutaneous interventions critically depends on accurate needle deflection estimation. A Radial Basis Function Network (RBFN) with Xavier initialization is implemented to ensure optimal weight scaling, preventing vanishing/exploding gradients and enhancing network stability. Additionally, a triangular learning rate scheduler is used to adaptively adjust learning rates, accelerating convergence and improving generalization across various insertion depths. RBFN’s localized activation mechanism ensures that deflection predictions are primarily influenced by data points near cluster centroids, effectively capturing region-specific needle-tissue interaction dynamics. The model represents needle-tissue interaction as a combination of a distributed force and a concentrated force at the bevel tip, the primary cause of deflection. Experimental validation across ten trials (90 insertions) shows a mean absolute error of $2.1 \pm 1.9$ mm. Upon comparing the performances of the proposed neural network model and conventional beam theory-based approach with the ground truth through statistical analysis (two-sample t-test, ANOVA) confirmed that the deflection predicted by proposed neural network-based approach is closer to the ground truth, likely to be suitable for clinical aspects. Note to Practitioners—Recently, minimally invasive surgeries have gained significant attention due to their numerous benefits, such as reduced patient trauma, post-surgical pain and surgical costs. However, needles used in clinical procedure tend to deviate from their intended path when interacting with surrounding tissue, reducing treatment efficiency. To address this issue, researcher have developed needle tissue interaction models based on the mechanical properties of both needle and tissue. These models estimate deflection based on several inputs that are challenging to measure accurately, especially Young’s modulus in real clinical settings. The proposed model aims to predict needle deflection independent of Young’s modulus. Leveraging the power of neural networks can significantly enhance surgical procedures by enabling more accurate needle trajectory prediction.
Motor fault diagnosis has been widely focused on various manufacturing systems. Traditional neural networks have limitations in extracting temporal features from data. This paper proposes a motor fault diagnosis method based on spiking convolutional neural network with multi-scale decomposition local features. This method extracts the local features of the raw motor fault signals at different scales (frequency and time) using Discrete Wavelet Transform (DWT), capturing detailed information from various frequency bands, with high-frequency instantaneous changes and low-frequency steady trends. Then, Gaussian population encoding features are used to generate time spikes, enhancing the accuracy and optimization ability of feature representation, to avoid local optima and improve the model's generalization performance. To further improve the performance of the network, Spiking Convolutional Neural Network (SCNN) is combined with Batch Normalization Through Time (BNTT). BNTT performs batch normalization at the temporal level, effectively enhancing the training stability of the neural network, reducing issues like vanishing or exploding gradients, and accelerating the convergence process. In addition, the surrogate gradient method is used to overcome the backpropagation problem in spiking neural networks, allowing the temporal neural network to be trained smoothly. Finally, the experiments and comparisons are conducted by using the Induction Motor Data Sets (IMDS) and Case Western Reserve University (CWRU) datasets. The proposed method can achieve test accuracy of 99.49 % and 96.31 % on IMDS and CWRU respectively. The results show that this method offers high test accuracy and low computational cost.
Tool condition monitoring (TCM) is a prerequisite to ensure high finishing quality of workpiece in manufacturing automation. One of the most important components in TCM system is tool wear estimation. How to achieve estimation with high accuracy is still an open question. In the past few decades, recurrent neural network (RNN) has shown a great success in learning long-term dependence of the sequential data. However, traditional RNNs (e.g., vanilla RNN, etc.) suffer gradient vanishing or exploding problem as well as long computational training time when the model is trained through back propagation through time (BPTT). To address these issues, we propose a gated recurrent units (GRU) based neural network to estimate the tool wear for tool condition monitoring. The GRU neural network can analyze time-series data on multiple time scales and can avoid gradient vanishing during training. A real-world gun drilling experimental dataset is used as a case study for tool condition monitoring in this paper. The performance of the proposed GRU based TCM approach is compared with other well-known models including support vector regression (SVR) and multi-layer perceptron (MLP). The experimental results show that the proposed GRU based TCM approach outperforms other competing models on this real-world gun drilling dataset.
Abstract The computational substrate of the 21st century is undergoing a radical phase transition. The deterministic certainty that defined the era of Moore’s Law—where performance gains were achieved through the reliable shrinking of transistors without a penalty in power density—has irrevocably collapsed. As the semiconductor industry confronts the breakdown of Dennard scaling and the physical limits of lithography, a new paradigm has emerged: Approximate Computing (AC). This architectural shift, necessitated by the voracious energy demands of generative artificial intelligence and high-performance computing, deliberately trades bit-level precision for gains in energy efficiency and throughput. However, this transition from exactitude to approximation is not merely a technical optimization; it is a profound reordering of the sociotechnical contract between human operators and machine agents. This report, "Accountable Approximation," provides an exhaustive analysis of the implications of this shift. By synthesizing data from energy audits, hardware security research, legal theory, and the geometry of neural loss landscapes, we demonstrate that the introduction of stochastic error into the hardware layer possesses significant, yet largely unexamined, agency. We explore how quantization noise—the arithmetic distortion introduced by reducing numerical precision—interacts with the high-dimensional geometry of deep learning models to disproportionately erode the representation of minority data, effectively embedding bias into the silicon itself. Furthermore, we examine the security paradox where the "fog of error" sanctioned by approximation creates a camouflage for hardware Trojans, rendering traditional redundancy-based detection methods obsolete. Synthesizing the latest findings from the International Energy Agency (IEA), Google’s 2025 environmental reports, and cutting-edge research into "Fair-GPTQ" algorithms, this report argues that the sustainability of the AI revolution hinges on our ability to govern this new "technological unconscious." We propose a framework of Accountable Approximation that demands transparency in error budgets, rigorous auditing of the bias-variance trade-off in hardware, and a modernization of liability laws to address the non-deterministic nature of future computing systems. The era of the perfect machine is over; the era of the accountable machine must begin. Keywords Approximate computing, Thermodynamic computing, Hessian Spectrum Analysis, Large language models, Fair-GPTQ, Post-Moore's Law computing, Gradient Normal Disparity
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.
3D Gaussian Splatting (3DGS) is a recent technique for real-time scene reconstruction. However, in large outdoor scenes, it often fails to reconstruct peripheral or distant regions accurately. These areas appear in fewer views and receive weaker multi-view supervision, leading to lower gradient signals, fewer Gaussians, and ultimately degraded structural quality. To address this limitation, we propose two complementary enhancements to the 3DGS pipeline. Our Depth-Aware Perceptual Similarity (DAPS) module uses monocular depth to strengthen optimization in weakly supervised regions, leading to sharper edges and improved reconstruction quality. Additionally, we introduce Adaptive Gradient Filtering (AGF), a dynamic densification mechanism that selectively clones Gaussians based on gradient statistics, ensuring visually faithful reconstruction without excessive memory growth. We curate a challenging outdoor benchmark by selecting scenes from the Tanks & Temples and Mip-NeRF 360 datasets, designed to test reconstruction quality and memory efficiency under real-world visual complexity. Experimental results show DAPS improves SSIM by 3.05%, PSNR by 3.94%, and reduces LPIPS by 17.78%, while AGF cuts memory usage by 38.5% over baseline 3DGS without sacrificing image quality.
No abstract available
Spiking neural networks (SNNs) present a promising computing paradigm for neuromorphic processing of event-based sensor data. The resonate-and-fire (RF) neuron, in particular, appeals through its biological plausibility, complex dynamics, yet computational simplicity. Despite theoretically predicted benefits, challenges in parameter initialization and efficient learning inhibited the implementation of RF networks, constraining their use to a single layer. In this paper, we address these shortcomings by deriving the RF neuron as a structured state space model (SSM) from the HiPPO framework. We introduce S5-RF, a new SSM layer comprised of RF neurons based on the S5 model, that features a generic initialization scheme and fast training within a deep architecture. S5-RF scales for the first time a RF network to a deep SNN with up to four layers and achieves with 78.8% a new state-of-the-art result for recurrent SNNs on the Spiking Speech Commands dataset in under three hours of training time. Moreover, compared to the reference SNNs that solve our benchmarking tasks, it achieves similar performance with much fewer spiking operations. Our code is publicly available at https://github.com/ThomasEHuber/s5-rf.
The rapid advancement of deep learning has motivated various analog computing devices for energy-efficient non-von Neuman computing. While recent demonstrations have shown their excellent performance, particularly in the inference phase, computation of training using analog hardware is still challenging due to the complexity of training algorithms such as backpropagation. Here, we present an alternative training algorithm that combines two emerging concepts: reservoir computing (RC) and biologically inspired training. Instead of backpropagated errors, the proposed method computes the error projection using nonlinear dynamics (i.e., reservoir), which is highly suitable for physical implementation because it only requires a single passive dynamical system with a smaller number of nodes. Numerical simulation with Lyapunov analysis showed some interesting features of our proposed algorithm itself: the reservoir basically should be selected to satisfy the echo-state-property; but even chaotic dynamics can be used for the training when its time scale is below the Lyapunov time; and the performance is maximized near the edge of chaos, which is similar to standard RC framework. Furthermore, we experimentally demonstrated the training of feedforward neural networks by using an optoelectronic reservoir computer. Our approach provides an alternative solution for deep learning computation and its physical acceleration. Existing training algorithms for deep neural networks are not suitable for energy-efficient analog hardware. Here, the authors propose and experimentally demonstrate an alternative training algorithm based on reservoir computing, which improves training efficiency in optoelectronic implementations.
最终合并的文献组构建了一个从“物理等效建模”到“模型缩放理论”再到“系统稳定性保障”的完整框架。研究重点在于:1) 利用多尺度神经网络捕捉物理现象中的缩比规律;2) 通过缩放法则(Scaling Laws)预测大规模模型的性能演进;3) 解决深度学习在模拟高能、瞬态过程(如爆炸相关梯度波动)时的梯度爆炸与稳定性挑战;4) 探索脉冲神经网络与新型硬件在处理等效计算任务时的独特优势。这为实现爆炸缩比等效的神经计算模拟提供了多维度的理论与技术支撑。