电磁、多模态
自动驾驶与无人系统中的多模态感知与融合
该组文献集中于自动驾驶、无人机及无人艇场景,通过融合毫米波/4D雷达与视觉传感器,利用BEV空间表示、Transformer架构及深度学习算法,解决复杂环境下的目标检测、环境感知与安全性问题。
- Target Detection for USVs by Radar–Vision Fusion With Swag-Robust Distance-Aware Probabilistic Multimodal Data Association(Zhenglin Li, Tianxin Yuan, Liyan Ma, Yang Zhou, Yan Peng, 2024, IEEE Sensors Journal)
- A multi-robot system for the detection of explosive devices(Ken Hasselmann, Mario Malizia, R. Caballero, Fabio Polisano, S. Govindaraj, Jakob Stigler, Oleksii Ilchenko, M. Bajic, G. D. Cubber, 2024, arXiv.org)
- 融合毫米波雷达与机器视觉的雾天车辆检测(李颀, 叶小敏, 冯文斌, 2023, 液晶与显示)
- Radar-vision multimodal fusion for dynamic target trajectory prediction and threat assessment in power transmission corridors(Jun Zhang, Zhiwei Zhang, Xiao Tan, Jin Ling, Qingjian Deng, 2026, Scientific Reports)
- SparseFusion3D: Sparse Sensor Fusion for 3D Object Detection by Radar and Camera in Environmental Perception(Zedong Yu, Weibing Wan, Maiyu Ren, Xiuyuan Zheng, Zhijun Fang, 2024, IEEE Transactions on Intelligent Vehicles)
- RCFusion: Fusing 4-D Radar and Camera With Bird’s-Eye View Features for 3-D Object Detection(Lianqing Zheng, Sen Li, Bin Tan, Long Yang, Sihan Chen, Libo Huang, Jie Bai, Xichan Zhu, Zhixiong Ma, 2023, IEEE Transactions on Instrumentation and Measurement)
- CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception(Youngseok Kim, Sanmin Kim, J. Shin, Junwon Choi, Dongsuk Kum, 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- Augmented Millimeter Wave Radar and Vision Fusion Simulator for Roadside Perception(Haodong Liu, Jian Wan, Peng Zhou, Shanshan Ding, Wei Huang, 2024, Electronics)
- Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception(Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Anouar Laouichi, M. Hofmann, Gerhard Rigoll, 2024, 2025 IEEE International Conference on Robotics and Automation (ICRA))
- A Survey of Deep Learning Based Radar and Vision Fusion for 3D Object Detection in Autonomous Driving(Di Wu, Feng Yang, Benlian Xu, Pan Liao, Bo Liu, 2024, arXiv.org)
- Zfusion: an Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving(Sheng Yang, Tong Zhan, Shichen Qiao, Jicheng Gong, Qing Yang, Jian Wang, Yanfeng Lu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- UniBEVFusion: Unified Radar-Vision Bevfusion for 3D Object Detection(Haocheng Zhao, Runwei Guan, Taoyu Wu, Ka Lok Man, Limin Yu, Yutao Yue, 2024, 2025 IEEE International Conference on Robotics and Automation (ICRA))
- Multi-camera Bird's Eye View Perception for Autonomous Driving(David Unger, Nikhil Gosala, Varun Ravi Kumar, Shubhankar Borse, Abhinav Valada, S. Yogamani, 2023, arXiv.org)
- FARFusion: A Practical Roadside Radar-Camera Fusion System for Far-Range Perception(Yao Li, Yingjie Wang, Chengzhen Meng, Yifan Duan, Jianmin Ji, Yu Zhang, Yanyong Zhang, 2024, IEEE Robotics and Automation Letters)
- RVIFNet: Radar–Visual Information Fusion Network for All-Weather Vehicle Perception in Roadside Monitoring Scenarios(Kong Li, Hua Cui, Zhe Dai, Huansheng Song, 2024, IEEE Sensors Journal)
- 基于Transformer的毫米波雷达/激光雷达/相机融合3D目标检测方法(陈坤泽, 刘晓晨, 申冲)
- Real-Time Volumetric Perception for Unmanned Surface Vehicles Through Fusion of Radar and Camera(Hu Xu, Xiaomin Zhang, Ju He, Yang Yu, Yuwei Cheng, 2024, IEEE Transactions on Instrumentation and Measurement)
- GaussianCaR: Gaussian Splatting for Efficient Camera-Radar Fusion(Santiago Montiel-Mar'in, Miguel Antunes-Garc'ia, Fabio S'anchez-Garc'ia, Ángel Llamazares, Holger Caesar, L. M. Bergasa, 2026, arXiv.org)
- Robust BEV Perception via Dual 4D Radar–Camera Fusion Under Adverse Conditions with Fog-Aware Enhancement(Zhengqing Li, B. Singh, 2026, Electronics)
- UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data(Longkun Zou, Jiale Wang, Rongqin Liang, Hai Wu, Ke Chen, Yaowei Wang, 2025, arXiv.org)
- Achelous++: Power-Oriented Water-Surface Panoptic Perception Framework on Edge Devices based on Vision-Radar Fusion and Pruning of Heterogeneous Modalities(Runwei Guan, Haocheng Zhao, Shanliang Yao, Ka Lok Man, Xiaohui Zhu, Limin Yu, Yong Yue, Jeremy S. Smith, Eng Gee Lim, Weiping Ding, Yutao Yue, 2023, arXiv.org)
- BEV-Guided Multi-Modality Fusion for Driving Perception(Yunze Man, Liangyan Gui, Yu-Xiong Wang, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
电磁信号识别、通信感知一体化与抗干扰技术
该组文献聚焦于复杂电磁环境下的信号处理,涵盖调制识别、通信感知一体化(ISAC)、抗干扰决策及物理特性分析,利用多模态特征融合、跨模态注意力机制及对比学习提升频谱认知能力。
- 基于特征融合的电磁信号对抗样本检测方法(徐东伟, 蒋斌, 陈嘉峻, 宣琦, 王巍, 赵文红, 杨小牛, 2023, 电波科学学报)
- 基于多模特征融合的雷达干扰信号识别(魏赓力, 李凉海, 闫海鹏, 李世宝, 杨爽, 2023, 宇航遥测遥控)
- Electromagnetic Spectrum Situation Cognition and Conflict Resolution Method Based on Multimodal Causal Transformer (MC-Trans)(Ziyu Wang, 2025, 2025 IEEE International Conference on Communication Networks and Computing (CNC))
- A Cross-Modality Contrastive Learning Method for Radar Jamming Recognition(Ganggang Dong, Zixuan Wang, Hongwei Liu, 2025, IEEE Transactions on Instrumentation and Measurement)
- CAEP: Cross-Modal Adaptive Embedding Prediction for Self-Supervised Modulation Classification(Yuanfeng Wu, Yu Hong, Zuqi Ma, Ao Wu, Xiangsong Huang, Mengfan Xue, Shuyuan Yang, 2026, Electronics)
- 基于多模态融合的小样本轻量化雷达有源干扰识别算法(张钟升, 李凉海, 张剑琦, 师亚辉, 2025, 遥测遥控)
- CMA: A Cross-Modal Attack on Radar Signal Recognition Model Based on Time-Frequency Analysis(Mengchao Wang, Sicheng Zhang, Qi Xuan, Yun Lin, 2024, ICC 2024 - IEEE International Conference on Communications)
- A Unified Anti-Jamming Design in Complex Environments Based on Cross-Modal Fusion and Intelligent Decision-Making(Huake Wang, Xu Han, B. Cai, Guisheng Liao, Yinghui Quan, 2025, IEEE Transactions on Aerospace and Electronic Systems)
- 生成式人工智能赋能无线电频谱认知:进展与挑战(刘志远, 宋令阳, 刘庆昱, 张舒航, 张泓亮, 2026, 国防科技大学学报)
- Contrastive Learning-Based Multimodal Fusion Model for Automatic Modulation Recognition(Fugang Liu, Jingyi Pan, Ruolin Zhou, 2024, IEEE Communications Letters)
- Broadband electromagnetic signal modulation identification model based on multimodal fusion Transformer(Kaiyuan Jiang, Tong Ding, 2025, 2025 IEEE 7th International Conference on Civil Aviation Safety and Information Technology (ICCASIT))
- Electromagnetic Imaging Boosted Visual Object Recognition Under Difficult Visual Conditions(Min Tan, Tao Jin, Danhui Ye, Kuiwen Xu, Xiaoling Gu, Jun Yu, 2023, IEEE Transactions on Geoscience and Remote Sensing)
- EMind: A Foundation Model for Multi-task Electromagnetic Signals Understanding(Luqing Luo, Wenjin Gui, Yunfei Liu, Ziyue Zhang, Yunxi Zhang, Fengxiang Wang, Zonghao Guo, Zizhi Ma, Xinzhu Liu, Hanxiang He, Jinhai Li, Xin Qiu, Wupeng Xie, Yangang Sun, 2025, arXiv.org)
- Automatic Modulation Recognition Using Hybrid Modal Representation in Complicated Electromagnetic Environment(Sijia Yan, Jiang Wang, Hongying Tang, Rui Guo, 2025, IEEE Internet of Things Journal)
- 雷达通信一体化波形设计综述(唐波, 吴文俊, 夏学成, 王雄鹏, 2026, 国防科技大学学报)
- RFusion: Dynamic Multimodal RF Fusion for Few-Shot Human Activity Recognition(Chao Feng, Jiashen Chen, Shuo Liang, Xiaopeng Peng, Baizhou Yang, Xuan Wang, Zexuan Huang, Xianji Meng, Xiaojiang Chen, 2026, IEEE Transactions on Mobile Computing)
- Multimodal Fusion-Based Channel Prediction and Characterization for mmWave UAV A2G Communications(Zhichao Xin, Yu Liu, Jianping Xing, Jie Huang, Ji Bian, Zongkai Bai, Chuanteng Wang, 2026, IEEE Transactions on Communications)
- A Multi-Modal Foundational Model for Wireless Communication and Sensing(Vahid Yazdnian, Yasaman Ghasempour, 2026, arXiv.org)
- CRFusion: Fine-Grained Object Identification Using RF-Image Modality Fusion(Liyang Xiao, Yanni Yang, Zhe Chen, Yue Gao, Prasant Mohapatra, Pengfei Hu, 2025, IEEE Transactions on Mobile Computing)
- 随机信号体制下MIMO通信感知一体化系统收发预编码设计(刘凡, 卢仕航, 陈子豪, 2025, 雷达学报(中英文))
- Cross Attention Mechanism Based Multi-Modal Deep Learning Method for Space Electromagnetic Signal Recognition(Yi Wei, Yanghao Wang, Yuan Qiu, Lili Cao, Shangrong Ouyang, 2025, 2025 5th International Conference on Electronic Information Engineering and Computer Science (EIECS))
- Electromagnetic signal recognition using multimodal tri-branch semantic fusion network in the UAV-assist integrated sensing and communication systems(Tiantian Wang, Nan Yan, Chaosan Yang, Zeliang An, Gongjing Zhang, Yuqing Xu, 2025, Digital Signal Processing)
- Novel Hybrid-Learning Algorithms for Improved Millimeter-Wave Imaging Systems(Josiah W. Smith, 2023, arXiv.org)
- Large Language Model-Driven Distributed Integrated Multimodal Sensing and Semantic Communications(Yubo Peng, Luping Xiang, Bingxin Zhang, Kun Yang, 2025, arXiv.org)
工业检测、医疗影像与生理行为监测
该组文献侧重于非接触式传感应用,通过电磁、射频、磁传感器与视觉/光学数据的融合,实现工业无损检测、设备故障诊断、人体行为识别及生命体征监测。
- RFID-WiFi-Radar Fusion for Health Monitoring: A Cross-Modal Supervision Framework(Xiangguo Li, Yaohua Guo, Xuehong Sun, Liping Liu, Xinjuan Wang, Yanpeng Zhang, Xiaoyong Song, 2025, IEEE Sensors Journal)
- Fusion-Vital: Video-RF Fusion Transformer for Advanced Remote Physiological Measurement(Jae-Ho Choi, Ki-Bong Kang, Kyung-Tae Kim, 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
- XFall: Domain Adaptive Wi-Fi-Based Fall Detection With Cross-Modal Supervision(Guoxuan Chi, Guidong Zhang, Xuan Ding, Qiang Ma, Zheng Yang, Zhenguo Du, Houfei Xiao, Zhuan Liu, 2024, IEEE Journal on Selected Areas in Communications)
- Review on Systems Combining Computer Vision and Radio Frequency Identification(Emanuele Tavanti, P. Nepa, Roberto Gabbrielli, Marco Pirozzi, 2025, IEEE Internet of Things Journal)
- Physical Coupling Fusion of Electromagnetic Acoustic Guided Wave With “Near-Zero” Magnetic Sensing System for Multitype Defect Detection(Qin Tang, Bin Gao, Q. Ma, Gai Ru, Songwen Xue, Wenze Shi, Fei Luo, Wai Lok Woo, 2025, IEEE Transactions on Instrumentation and Measurement)
- Electromagnetic valve fault diagnosis based on multi-source signal feature fusion(W Li, Y Li, J Mao, C Zhu, XM Ye, 2025, Measurement Science and …)
- CNN-ELMNet: fault diagnosis of induction motor bearing based on cross-modal vector fusion(L Yi, Y Huang, J Zhan, Y Wang, T Sun, 2024, Measurement …)
- Cross-modal fusion of external magnetic sensing and simulated 2D imaging for 3D guidewire pose estimation(Wei Wei, Zhengqian Li, Nan Xiao, Zihan Gao, Dongni. Yang, Jianming Guo, Qian Zheng, Jiaqian Li, 2026, Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine)
- MFA-Net:一种面向复杂对抗环境的反舰导弹智能识别多模态自适应融合网络(张龙, 朱连宏, 杨波, 雷震, 冯轩铭, 2026, 系统工程与电子技术)
- PSB-Net: Physics-Synergized Bidirectional Network for Robust Cross-Modal Fault Diagnosis of PMSMs in Power Electronic Drive Systems(Dongdong Li, Yueqi Wang, Yang Mi, 2026, IEEE Transactions on Power Electronics)
- Synchronous Imaging and Multimodal Fusion of Optical and Electromagnetic Measurements for Overlapping Defects Inspection(Na Zhang, Haoran Dong, C. Ye, 2024, IEEE Transactions on Industrial Informatics)
雷达先进成像、物理特征感知与通用多模态表征
该组文献探讨雷达的物理机理(如微动、散射、轨道角动量)与成像技术,并结合大模型、物理信息注入及具身智能,研究跨领域的通用多模态感知框架。
- 基于模态相关性加权与自适应正则化的涡旋电磁波雷达超分辨成像(杨亭, 史洪印, 郭建文, 2025, 雷达学报(中英文))
- Physical Coupling Fusion of Multisensor Data and Multidomain Mixed Features of Electromagnetic Acoustic Sensing System for Stress Measurement(Fasheng Qiu, Biaohua Rao, Xi Chen, Salisu Saad Gharzali, Guiyun Tian, 2026, IEEE Transactions on Instrumentation and Measurement)
- 雷达微弱目标智能化处理技术与应用(陈小龙, 何肖阳, 邓振华, 关键, 杜晓林, 薛伟, 苏宁远, 王金豪, 2024, 雷达学报(中英文))
- Recognition of Micro-Motion Space Targets Based on Attention-Augmented Cross-Modal Feature Fusion Recognition Network(Xudong Tian, Xueru Bai, Feng Zhou, 2023, IEEE Transactions on Geoscience and Remote Sensing)
- 面向雷达智能感知的语义电磁散射建模(徐丰, 张旭, 岳子瑜, 卫江涛, 2025)
- Fine-Grained Image Generation Network With Radar Range Profiles Using Cross-Modal Visual Supervision(Jiacheng Bao, Da Li, Shiyong Li, Guoqiang Zhao, Houjun Sun, Yi Zhang, 2024, IEEE Transactions on Microwave Theory and Techniques)
- Deep-Learning Approach for Developing Bilayered Electromagnetic Interference Shielding Composite Aerogels Based on Multimodal Data Fusion Neural Networks(Chuntao He, Lin-Feng Yu, Yun Jiang, Lan Xie, Xiaoping Mai, Ai Peng, Bai Xue, 2024, Journal of Colloid and …)
- A Novel Multimodal Fusion Sensing-Based Channel Prediction Method for UAV Communications(Zhichao Xin, Yu Liu, Jianping Xing, Jie Huang, Ji Bian, Yi Zhang, 2025, IEEE Internet of Things Journal)
- Electronic skins with multimodal sensing and perception(J. Tu, Ming Wang, Wenlong Li, Jiangtao Su, Yanzhen Li, Zhisheng Lv, Haicheng Li, Xue Feng, Xiaodong Chen, 2023, Soft Science)
- Multi-Modal Fusion Sensing: A Comprehensive Review of Millimeter-Wave Radar and Its Integration With Other Modalities(Shuai Wang, Luoyu Mei, Ruofeng Liu, Wenchao Jiang, Zhimeng Yin, Xianjun Deng, Tian He, 2025, IEEE Communications Surveys & Tutorials)
- Improving rainfall retrieval accuracy using cross-Modal deep learning: Merging wifi with commercial microwave link(Weitao Tao, Bin Lian, Zhongcheng Wei, Luming Song, Lili Huang, Jijun Zhao, 2026, Physical Communication)
- Multimodal Geophysics-Informed Neural Network for Joint Inversion of Seismic Electromagnetic and Well-logging(Cao Song, Wenkai Lu, Shugang Ye, Weihua Yao, Xianxu Zhang, 2026, IEEE Transactions on Geoscience and Remote Sensing)
- RFSensingGPT: A Multi-Modal RAG-Enhanced Framework for Integrated Sensing and Communications Intelligence in 6G Networks(Muhammad Zakir Khan, Yao Ge, Michael S. Mollel, J. Mccann, Q. Abbasi, M. Imran, 2026, IEEE Transactions on Cognitive Communications and Networking)
- R1-Onevision: Advancing Generalized Multimodal Reasoning Through Cross-Modal Formalization(Yi Yang, Xiaoxuan He, H. Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, Wei Chen, 2025, 2025 IEEE/CVF International Conference on Computer Vision (ICCV))
- 具身雷达的概念、架构和发展(徐丰, 雒梅逸香, 卫江涛, 许京伟, 仇晓兰, 武俊杰, 万显荣, 金亚秋, 2026)
- Multi-Modal Electrophysiological Source Imaging With Attention Neural Networks Based on Deep Fusion of EEG and MEG(Meng Jiao, Shihao Yang, Xiaochen Xian, N. Fotedar, Feng Liu, 2024, IEEE Transactions on Neural Systems and Rehabilitation Engineering)
- Multimodal super-resolution: discovering hidden physics and its application to fusion plasmas(A. Jalalvand, Sangkyeun Kim, J. Seo, Q. Hu, Max Curie, Peter Steiner, A. O. Nelson, Yong-Su Na, E. Kolemen, 2024, Nature Communications)
- Flexible Passive Wireless Sensing Platform with Frequency Mapping and Multimodal Fusion.(Kai Wang, Lifeng Wang, Jiawei Si, Rui Wang, Ziyuan Wang, Chuyuan Gao, Jin Yang, Xiaohan Yang, Hanqiang Zhang, Lei Han, 2025, ACS Applied Materials & Interfaces)
- One-stop multi-sensor fusion and multimodal precise quantified traditional Chinese medicine imaging health examination technology(Chuanxue Li, Ping Wang, M. Zheng, Wenxiang Li, Jun Zhou, Lin Fu, 2024, Journal of Radiation Research and Applied Sciences)
- MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources(Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, Wei Chen, 2026, arXiv.org)
本报告将电磁与多模态感知领域的研究划分为四大核心方向:自动驾驶与无人系统的环境感知、电磁信号处理与通信感知一体化、工业医疗与生理监测的精密传感、以及雷达物理成像与通用多模态表征。研究趋势显示,从传统的传感器融合正向物理机理驱动的深度学习范式及具身智能方向演进。
总计77篇相关文献
随着雷达电子干扰技术的快速发展,雷达所面临的有源干扰的多样性、干扰策略的变化性持续增长,雷达对鉴别有源干扰类型的需求愈发迫切。传统的有源干扰特征识别方法识别效果有限且泛用性差,现有的基于深度学习的方法参数规模较大且具有较高的数据需求,限制了其自身的发展和应用。为在参数量和数据量有限的条件下提高有源干扰识别效果,本文研究了基于多模态融合的小样本轻量化有源干扰识别算法,利用时间局部性实现了信号时频特征及高分辨距离像特征的轻量化融合,利用度量学习与特征检索技术提高了小样本情景下的干扰识别准确率。仿真和实测数据实验表明,本文提出的方法在多种情况下具有良好的识别效果。
传统制导雷达面临的新型有源干扰样式越来越复杂,雷达必须对各种干扰类型加以鉴别。传统的干扰识别方法仅对特定单一样式有效,通用性较差、泛化能力较弱,无法应对复杂多变的干扰对抗环境。因此,必须提出智能化更高、稳健性更强的普适干扰识别方法,提升制导武器抗干扰能力。为了提高干扰信号识别的准确率,研究了多模特征融合算法,并最终对时域、时频域、信息论特征进行融合以实现分类。首次将信息论中熵、相对熵、相对距离等概念引入到干扰信号分类这个应用场景中,通过仿真实验表明,能够有效对常见干扰进行有效识别,在较低干噪比下也有较好的识别准确率。
针对电磁信号调制识别智能模型容易受到对抗样本攻击的问题,提出了一种基于特征融合的电磁信号对抗样本检测方法。该方法首先通过变分模态分解对测试样本进行去噪得到去噪后的电磁信号样本,然后分别将去噪前后的电磁信号样本输入到神经网络模型中,接着计算去噪前后模型输出向量的余弦相似性值 and 置信度差值,最后将两个特征进行融合,输入到一个神经网络模型中进行检测。与基线方法相比,该方法在实验中取得了更高的检测成功率。本文方法具有时间复杂度低、易于实现的优点,为电磁信号调制识别智能模型提供了一种新颖的对抗样本检测方法。
涡旋电磁波雷达(VEWR)利用轨道角动量(OAM)模态的正交性,理论上为突破传统雷达的方位向分辨率限制提供了新的物理维度,从而也为目标微动感知与前视成像开辟了新途径。然而,实际应用中有限可用模态与复杂电磁噪声导致严重的模态混叠和分辨率退化,现有稀疏成像方法普遍存在精度-效率失衡、噪声鲁棒性不足等问题。该文提出一种融合模态相关性加权与自适应正则化(MCW-AR)的超分辨成像框架。首先构建VEWR前视成像几何与波前调制信号模型;进而设计OAM模态相关矩阵量化模态间辐射能量的非均匀分布特性,通过贝塞尔函数幅值加权调制强化主导模态的低秩约束;最终建立联合稀疏性与低秩性的复合优化模型,引入自适应权重机制动态平衡结构保持与噪声抑制,并设计基于交替方向乘子法(ADMM)与增广拉格朗日(ALM)的联合优化框架,其中核心图像更新子问题采用动量加速的二维共轭梯度最小二乘(2D-CGLS)法高效求解。数值仿真与电磁仿真实验表明:该方法在有限模态和强噪声下仍能保持目标结构完整性,计算效率与成像质量得到显著提升。
通过复用随机通信信号,并基于现网中的通信架构实现通信感知一体化(ISAC),能够显著降低ISAC实现成本、加速感知功能融入现有通信网络。然而,通信数据的随机性将会使得感知功能出现随机起伏,造成感知性能不稳定。为了获得稳健的感知性能,该文研究了随机通感一体空域信号处理方法,提出了多输入多输出通感一体(MIMO-ISAC)系统收发预编码联合优化设计方案。具体而言,考虑对目标响应矩阵的估计,该文首先定义了随机信号下感知系统的遍历克拉美罗界(ECRB),并基于复逆Wishart矩阵的分布推导了ECRB的闭合表达式,从理论上说明了使用随机信号进行感知相较于传统使用确定性正交信号的性能损失。进一步地,该文分别考虑了ECRB最小化的感知最优问题以及多天线多用户信号估计的通信最优问题,并获得了感知最优预编码设计和通信最优预编码设计方案。接着,该文将上述收发预编码优化设计思路扩展至通信感知一体化场景。最后,该文通过大量仿真验证了所提方法的有效性,相关结果表明所提出的联合收发预编码设计方案能够支持高精度目标响应矩阵估计,同时能够实现通信信号估计误差与目标响应矩阵估计误差的灵活折衷。
雷达图像解译是提升雷达卫星应用效益和支撑未来无人智能平台的关键技术之一。微波视觉是基于电磁数据认识物理世界的感知逆问题,其核心任务就是如何根据物理规律建模来求解从微波雷达观测图像中反推语义信息的问题。微波视觉的正问题是表征建模电磁波与真实物理世界相互作用机理的“微波图形学”,发展适用于感知逆问题的电磁散射建模。本文提出了发展面向雷达智能感知的语义电磁散射建模,以目标语义为中心,引入多样性随机建模,由追求单一样本的精确一致性转变为追求样本分布的一致性。本文阐述了语义电磁散射建模的问题背景、基本属性和关键任务,并在语义电磁散射基元字典和语义电磁散射表征树两个层面展开介绍了若干进展和技术途径,最后简要介绍了作者团队前期开展的相关研究进展。
车辆检测对于辅助驾驶系统至关重要,由于雾天道路场景严重退化,图像中的车辆信息不明显,导致车辆检测存在漏检、误检的问题。针对上述问题,本文提出了一种融合毫米波雷达和机器视觉的雾天车辆检测方法。首先,采用暗通道去雾算法对图像进行预处理,提高雾天图像中车辆信息的显著性。然后,采用知识蒸馏改进YOLOv5s算法,在YOLOv5s的特征提取网络中引入知识蒸馏,在目标定位和分类阶段计算蒸馏损失,对损失进行反向传播训练小型网络模型,在保证视觉检测准确度的同时提高检测速度。最后,采用基于潜在目标检测区域搜索的距离匹配算法对视觉检测结果和毫米波雷达检测结果进行决策级融合。以检测目标的类型和距离为匹配依据,滤除干扰信息和错误信息,保留毫米波雷达检测和视觉检测融合后的检测置信度较高的目标,从而提高车辆检测的准确率。实验结果表明,该方法在雾天下最高检测准确率达92.8%,召回率达90.7%,能够实现雾天对车辆的检测。
面向自主智能无人系统探测感知等未来需求,本文阐述具身雷达的概念——一种将雷达感知与平台机动、智能决策深度耦合的平台—雷达一体化自主感知系统。其核心在于革新传统雷达“固定模式、单向处理、被动感知”的体制限制,发展“感知—决策—动作”闭环的处理范式,使雷达能够主动选择探测方式、机动路径和交互策略,从而在动态目标、部分可观环境和强对抗电磁场景中实现性能提升。传统雷达多遵循按任务定制的设计思路,表现为探测模式固化、参数不可调、轨迹预设化,其信号处理链以单向数据流开环处理为主,缺乏对环境目标认知后进行自主调优能力,难以满足无人系统在复杂环境下实时建模与决策的需求。具身雷达通过将平台机动性、探测感知与智能体规划策略进行耦合,构建电磁世界模型以表征“电磁场—目标—环境—平台—雷达”动态关联,并通过交互式信息处理框架进行实时闭环反馈,从而实现探测策略与机动策略的联动优化。具身雷达基于无人系统突破具身智能感知范式,有望在复杂场景下显著提升探测效能与自主作业能力,对社会生产模式及未来无人作战体系的重塑具有重要意义。
雷达微弱目标处理是实现优异探测性能的基础和前提,在复杂的实际环境应用过程中,由于强杂波干扰、目标信号微弱、图像特征不明显、有效特征难提取等问题,导致雷达微弱目标检测与识别一直是雷达处理领域中的难点之一。传统模型类处理方法与实际工作背景和目标特性匹配不精准,导致通用性不强。近年来,深度学习在雷达智能信息处理领域取得了显著进展,深度学习算法通过构建深层神经网络,可以自动地从大量雷达数据中学习特征表示,提高目标检测和识别的性能。该文分别从雷达目标微弱信号处理、图像处理、特征学习等多个方面系统梳理和总结近年来雷达微弱目标智能化处理的研究进展,具体包括噪声与杂波抑制、微弱目标信号增强;低、高分辨雷达图像和特征图处理;特征提取、融合、目标分类与识别等。针对目前微弱目标智能化处理应用存在的泛化能力有限、特征单一、可解释性不足等问题,从小样本目标检测(迁移学习、强化学习)、多维多特征融合检测、网络模型可解释性、知识与数据联合驱动等方面对未来发展进行了展望。
针对单一传感器在环境感知任务中的性能局限性,提出了一种基于Transformer的毫米波雷达/激光雷达/相机融合3D目标检测方法。该方法由3个关键模块组成:1)相机模块,将图像特征与初始3D目标预测结合,进行视觉增强;2)雷达模块,针对毫米波雷达点云稀疏问题,提出时序多帧融合算法,对连续5帧毫米波雷达数据进行融合,同时提出点云加权融合算法,在毫米波雷达与激光雷达的点云描述能力互补的基础上,构建增强的雷达点云;3)融合模块,采用Transformer解码器,充分整合雷达点云与相机特征,以提升3D目标检测性能。在自制城市道路数据集上进行了实验,并基于nuScenes数据集指标进行评估。实验结果表明,相较于现有方法,本方法在mAP指标上提升6.38%,在NDS指标上提升5.93%。
雷达通信一体化通过硬件资源共享与信号波形协同设计,突破传统分立架构下的频谱冲突、硬件冗余和电磁兼容性瓶颈,提升平台的综合作战效能与战场生存能力。系统回顾了雷达通信一体化技术从概念萌芽、架构演进到系统实现的发展历程,重点分析了基于线性调频、正交频分复用、正交时频空等主流信号体制的雷达通信一体化波形设计方法;深入探讨了以感知性能为中心的多种波形设计准则,包括方向图匹配、克拉美罗界优化及信息论方法等;梳理了从软件无线电兼容验证、机载多模态波形融合,到多节点与多域协同的工程实践之路,清晰展现了雷达通信一体化技术从理论到实践的演进路径。通过全面展示雷达通信一体化系统从常规体制到多输入多输出体制,再到原理样机验证的完整技术脉络,为未来雷达通信一体化技术的研究与开发提供系统的理论指导和实践参考。
针对复杂对抗环境下反舰导弹目标识别面临的特征模糊、诱饵欺骗性强及传统识别算法鲁棒性不足等问题,提出一种多模态自适应融合网络(multimodal adaptive fusion network, MFA-Net)。该模型采用参数非共享分支分别提取雷达、红外与电子支援措施的异构特征,通过通道-空间双维注意力机制实现跨模态自适应融合,引入基于动量迭代方法的对抗训练策略,以极小极大优化框架增强模型内在鲁棒性,提升其在干扰条件下的决策稳定性。通过Macro-F1分数、抗干扰鲁棒性、推理时效构建非线性综合识别效能指数。实验表明,MFA-Net模型的综合识别效能指数达0.853 1,显著优于几种对比模型,干扰强度灵敏度分析进一步验证了模型在不同对抗等级下的性能稳定性。
近年来,生成式人工智能凭借其强大的数据分布拟合能力及数据生成补全能力,逐渐被引入无线电频谱认知领域,相较于传统依赖物理建模、数学插值以及判别式人工智能的方法,大幅提升了认知准确度。本文系统梳理了生成式人工智能赋能无线电频谱认知的研究进展,重点分析了不同生成范式的技术原理、应用场景及代表性工作,并深入探讨了训练数据稀缺、未知场景泛化能力不足、模型可解释性有限等生成式人工智能用于无线电频谱认知时面临的挑战。未来,通过跨模态知识融合、物理机理嵌入、可信评估构建,生成式人工智能有望推动无线电频谱认知向高精度、强泛化、可解释方向发展,有效支撑频谱资源高效利用。
It is a challenging problem to accurately detect overlapping defects. This article proposes a system that can obtain optical and electromagnetic signals synchronously. It contains an industrial camera, a chromatic confocal displacement sensor (CCDS) and an eddy current testing (ECT) probe with high-resolution array tunneling magnetoresistance sensors. A multimodal fusion algorithm is proposed to combine the advantages of the optical testing and ECT. The fusion of the visual image and CCDS measurement results in morphology of the surface defects. To recognize the overlapping buried defect, a finite-element method model is developed to estimate the surface defect signal using parameters from the optical measurements, and then the surface defect signal is subtracted from the experimental image, based on which the overlapping defects are precisely detected. This method can be widely used in industry to comprehensively assess the health and integrity of structures.
… (UAV) signals in dynamic electromagnetic environments has … propose a novel Multimodal Tri-branch Fusion Network (MTF-… ; (3) A hierarchical fusion module implementing cross-branch …
With the increasing complexity of modern communication systems, the modulation recognition task of broadband electromagnetic signals is confronted with challenges such as diverse signal forms, low signal-to-noise ratio, and multipath interference. To enhance recognition accuracy and robustness, this paper proposes a modulation recognition model based on multimodal fusion Transformer. This model integrates three signal modes: the time domain, the frequency domain, and the time-frequency domain. It extracts the timing features, spectral features, and non-stationary characteristics of the signals respectively, and then achieves effective feature fusion and discriminative learning through an improved Transformer architecture. This paper combines the local feature extraction ability of convolutional neural network (CNN) and the global dependency modeling ability of the Attention mechanism (Attention), and proposes a new algorithm that integrates CNN and Transformer - CT-Transformer. This algorithm introduces one-dimensional convolution in the feature embedding stage for local feature enhancement and a gating mechanism in the self-attention mechanism to improve the ability to focus on key features. The experimental results show that, compared with the traditional CNN and the standard Transformer model, CT-Transformer demonstrates better recognition accuracy and generalization ability under various signal-to-noise ratio conditions, especially with significant improvement in the low signal-to-noise ratio environment. This study provides a new technical path for signal modulation recognition in complex electromagnetic environments.
Millimeter-wave (mmWave) radar, with its high resolution, sensitivity to micro-vibrations, and adaptability to various environmental conditions, holds immense potential across multi-modal fusion sensing. Although there exist review papers on mmWave radar, there is a noticeable lack of comprehensive reviews focusing on its multi-modal fusion sensing capabilities. Addressing this gap, our review offers an extensive exploration of mmWave radar multi-modal fusion sensing, emphasizing its integration with other modalities. This review discusses the complex realm of millimeter-wave radar multi-modal fusion sensing, detailing its importance, hardware and software aspects, principles, and current potential and applications. It delves into data characteristics and datasets associated with mmWave radar, focusing on Doppler, point cloud, and multi-modal data formats. The review highlights how these data types enhance multi-modal fusion sensing and discusses methodologies, including signal processing and learning algorithms. Three categories of multi-modal fusion methodologies are proposed to optimally manage and interpret fused data. Various practical applications of mmWave radar multi-modal fusion sensing are illustrated, underlining the unique capabilities it provides when integrated with other sensors. The review concludes by identifying potential future research avenues, underscoring the immense potential of this field for further exploration and advancement.
Understanding complex physical systems often requires integrating data from multiple diagnostics, each with limited resolution or coverage. We present a machine learning framework that reconstructs synthetic high-temporal-resolution data for a target diagnostic using information from other diagnostics, without direct target measurements during the inference. This multimodal super-resolution technique improves diagnostic robustness and enables monitoring even in case of measurement failures or degradation. Applied to fusion plasmas, our method targets edge-localized modes (ELMs), which can damage plasma-facing materials. By reconstructing super-resolution Thomson Scattering data from complementary diagnostics, we uncover fine-scale plasma dynamics and validate the role of resonant magnetic perturbations (RMPs) in ELM suppression through magnetic island formation. The approach provides new observation supporting the plasma profile flattening due to these islands. Our results demonstrate the framework’s ability to generate high-fidelity synthetic diagnostics, offering a powerful tool for ELM control development in future reactors like ITER. The approach is broadly transferable to other domains facing sparse, incomplete, or degraded diagnostic data, opening new avenues for discovery. Sensor failures and limited resolution challenge many complex systems. Here, authors develop a multimodal AI method to generate super-resolution of a sensor using other available sensors in the system, revealing hidden dynamics in fusion plasmas and enabling cost-effective, high-resolution diagnostics.
Stress concentrations in ferromagnetic materials, which can arise during manufacturing or in-service use, often compromise structural integrity. Therefore, the effective stress monitoring is critical for steel components. Conventional stress analysis methods that rely on single features from individual sensors often lack the required accuracy and robustness. This article proposes a novel multisensor information fusion framework designed to significantly enhance stress prediction accuracy. The framework integrates magnetoacoustic emission (MAE) and magnetic Barkhausen noise (MBN) signals, both of which are highly sensitive to stress-induced magnetic microstructure alterations. The proposed methodology involves extracting diverse time-domain and frequency-domain features from MAE and MBN signals acquired under various applied stress conditions. Subsequently, the principal component analysis (PCA) is employed for dimensionality reduction of this comprehensive feature set. The reduced features are then input into a developed multimodal Gaussian kernel product fusion model. The parameters of this fusion model are optimized using the limited-memory Broyden–Fletcher–Goldfarb–Shanno with bounds (L-BFGS-B) algorithm. Experimental results obtained from the specimens demonstrate the superior predictive performance of the proposed Gaussian fusion model, which achieved an $R$ -squared ( $R$ 2) value of 0.9425 and a root mean square error (RMSE) of 7.58. This performance represents an improvement over other evaluated fusion models and approaches based on single-sensor data or individual features. This study validates the proposed multiphysics, multisensor, and multifeature fusion strategy as a highly reliable and precise method for stress assessment in ferromagnetic materials.
Geophysical joint inversion can comprehensively utilize the sensitivity of different geophysical data to various physical parameters to reduce the multisolution comparison for a single inversion method. To address the challenges of modality differences in geophysical observation data and the limitations of existing joint inversion methods in achieving high-resolution prediction and fine-scale evaluations of underground resources, we propose a multimodal geophysics-informed neural network (MGINN) designed for the joint inversion of seismic, electromagnetic, and well-logging data. The proposed method integrates physical information into the neural network architecture from three key aspects: network structure, data constraints, and loss functions. It achieves multimodal data fusion through feature maps concatenation and gradient backpropagation, complemented by a multistage optimization process that enhances the accuracy of inversion results. Comparative experimental results demonstrate that our method produces inversion results with superior spatial continuity and the ability to resolve finer geological structures. It can offer massive data support for subsequent underground resource extraction efforts.
As one of the core parts of the Internet-of-things (IOTs), multimodal sensors have exhibited great advantages in fields such as human-machine interaction, electronic skin, and environmental monitoring. However, current multimodal sensors substantially introduce a bloated equipment architecture and a complicated decoupling mechanism. In this work we propose a multimodal fusion sensing platform based on a power-dependent piecewise linear decoupling mechanism, allowing four parameters to be perceived and decoded from the passive wireless single component, which greatly broadens the configurable freedom of a sensor in the IOT. A systematic model is employed to analyze the linear sensing properties and ensure the feasibility of the scheme. The excitation power dependence provides an efficient and quantitative linear decoupling strategy of unidentified combinations for multiple stimuli. As a validation for a wearable device such as electronic skin (e-skin), the functionalized sensing film polyaniline/graphene oxide (PANI/GO) is served to synchronously monitor humidity, temperature, ultraviolet, and proximity through the mapping in resonant frequency (fs). Compared with the output errors of ∼18.00%, ∼17.50%, ∼15.00%, and ∼20.00%, the maximum experimental errors of temperature, humidity, ultraviolet, and proximity are 5.70%, 4.00%, 5.00%, and 8.30% after decoupling, respectively. In general, the developed single-component multimodal fusion sensing platform offers a strategic advantage for a miniaturization, passive wireless, and inexpensive (less than $1) signal identification system with a facile circuit layout.
Unmanned-aerial-vehicle (UAV) communications, as a critical application scenario in the sixth generation (6G) wireless communication field, has garnered widespread attention. During UAV-to-ground communication, channel data plays a pivotal role. Analyzing channel data enables an understanding of communication environments’ diversity and temporal variability, thereby facilitating the construction of more efficient communication systems. This article proposes a novel UAV-to-ground channel prediction method based on multimodal fusion. The method aims to achieve real-time and precise prediction of UAV-to-ground channel data from UAVs in the 3-D airspace by integrating various sources of information, including UAV-captured images, location data of transmitters and receivers, and communication settings. The network uses a fused architecture combining convolutional neural network (CNN) and Transformer architecture to extract and integrate features from diverse information sources. This fusion strategy significantly enhances the accuracy of UAV-to-ground channel prediction. Incorporating image information enables the network better to comprehend the complexity and dynamics of communication environments, thereby assisting in achieving more precise UAV-to-ground channel prediction. Experimental results demonstrate that the proposed method achieves real-time prediction of ground channels across various flight altitudes and communication frequency bands. This provides robust technical support for advancing UAV communication and offers new insights for optimizing and upgrading future wireless communication systems.
Object imaging and recognition under difficult visual conditions is extremely challenging due to the captured low-quality images, and traditional optical-based recognition methods always fail in this task. In this article, we propose to utilize the visual–microwave image pairs captured by both visual cameras and microwave sensors for imaging and recognition. To address the heavy noises in the low-quality optical images, we retrieve the physically quantitative images from associated scattered field data and enhance visual features by both optical and retrieval images. We develop a cross-modal enhanced attentive visual–microwave fusion (EAVMF) object recognition model to jointly learn the cross-modal generator and multimodal recognizer. In addition, an attention module for the visual subnetwork is utilized to highlight the regions of interest. Two multimodal datasets with synthetic visual–microwave image pairs are built to simulate the difficult visual condition. The numerical results on these datasets demonstrate that: 1) the multimodal fusion, cross-modal enhancement, and visual attention module can enhance the performance; and 2) compared with the existing methods, the proposed EAVMF not only performs better in terms of accuracy, but also has good scalability and one-shot learning ability.
Wireless sensing technology, with its advantages in privacy protection and high recognition accuracy, has demonstrated remarkable performance in complex behavioral monitoring environments. However, single-modal sensing technologies often fail to meet the demands of multitask complex behavior recognition. Although multimodal fusion can significantly enhance system performance, challenges such as data heterogeneity and uneven modal contributions severely limit fusion efficiency. To address these challenges, this article proposes an innovative multimodal data fusion framework called the multimodal temporal-topology feature fusion network (MTTFNet). By combining the strengths of a convolutional neural network (CNN) and transformer architectures, the framework introduces a temporal feature enhancement module (TFEM) and incorporates an adaptive weighting fusion mechanism alongside a cross-modal supervision strategy. This design facilitates efficient collaborative optimization of modal features, effectively mitigating the challenges posed by heterogeneity and contribution imbalance. Experimental results demonstrate that the recognition accuracy of the radio frequency identification (RFID) modality has increased to 70.81%, while the accuracy of multimodal fusion consistently exceeds 90%. Notably, the fusion of all three modalities achieves an accuracy of 96.83%, significantly outperforming existing algorithms. These findings validate the effectiveness of MTTFNet and provide a robust solution for complex behavior recognition tasks.
Conventional shear horizontal (SH0) guided wave electromagnetic acoustic transducer (EMAT) has been widely used for pipeline detection due to its advantages of nondispersion and simple vibration mode. While echo amplitude can be utilized to detect defects, it presents limitations in discerning between those residing on the surface and those located subsurface. This article proposes a hybrid sensing methodology that integrates SH0 guided wave EMAT with permanent magnetic field perturbation (PMFP) to enhance the accuracy of both defect identification and classification. From a physical coupling perspective, the system shares the same magnetic circuit structure, and the sensing information is received by different receivers. In particular, the significant frequency difference between the PMFP and the EMAT signal can directly suppress mutual electromagnetic interference (EMI). In addition, the homopolar permanent magnet arrays in EMAT provide an inherent “near-zero” magnetic field region that improves the signal-to-noise ratio (SNR) of the PMFP signal. Both simulations and experiments have been undertaken to demonstrate the feasibility of the fusion sensing system. The system has also been validated with high sensitivity in detecting pits, slits, and large corrosion defects in plates and pipes.
… electromagnetic valve fault diagnosis method, which can effectively expand the number of diagnosable fault types for electromagnetic … Then, feature extraction and fusion of these three …
A channel prediction modeling method based on multimodal fusion perception is proposed for complex environments in unmanned aerial vehicle (UAV) air-to-ground (A2G) communications. Two-dimensional (2D) environmental information and three-dimensional (3D) point cloud data are fused to enhance the model’s ability to capture environment blockage, reflection, and multipath effects. The 2D information provides target objects’ planar distribution and texture features, while the 3D point clouds supplement spatial position, size, and height information. These complementary modalities comprehensively describe the geometric structures present in complex environments. A multimodal modeling network is constructed to explore the nonlinear mapping between environmental perception data and channel data. The network takes 2D building distribution, 3D point cloud data, global image features, UAV and receiver positions, and communication parameters as joint inputs. Feature extraction and fusion modules achieve effective joint encoding of heterogeneous multimodal features. A spatial feature decoupling (SFD) module is designed to address interference caused by coupled features. It separates the data distributions corresponding to different channel characteristics, improving the accuracy of channel impulse response (CIR) prediction. Experimental results demonstrate that the proposed method significantly improves the reliability and adaptability of UAV channel modeling in complex urban scenarios.
The process of reconstructing underlying cortical and subcortical electrical activities from Electroencephalography (EEG) or Magnetoencephalography (MEG) recordings is called Electrophysiological Source Imaging (ESI). Given the complementarity between EEG and MEG in measuring radial and tangential cortical sources, combined EEG/MEG is considered beneficial in improving the reconstruction performance of ESI algorithms. Traditional algorithms mainly emphasize incorporating predesigned neurophysiological priors to solve the ESI problem. Deep learning frameworks aim to directly learn the mapping from scalp EEG/MEG measurements to the underlying brain source activities in a data-driven manner, demonstrating superior performance compared to traditional methods. However, most of the existing deep learning approaches for the ESI problem are performed on a single modality of EEG or MEG, meaning the complementarity of these two modalities has not been fully utilized. How to fuse the EEG and MEG in a more principled manner under the deep learning paradigm remains a challenging question. This study develops a Multi-Modal Deep Fusion (MMDF) framework using Attention Neural Networks (ANN) to fully leverage the complementary information between EEG and MEG for solving the ESI inverse problem, which is termed as MMDF-ANN. Specifically, our proposed brain source imaging approach consists of four phases, including feature extraction, weight generation, deep feature fusion, and source mapping. Our experimental results on both synthetic dataset and real dataset demonstrated that using a fusion of EEG and MEG can significantly improve the source localization accuracy compared to using a single-modality of EEG or MEG. Compared to the benchmark algorithms, MMDF-ANN demonstrated good stability when reconstructing sources with extended activation areas and situations of EEG/MEG measurements with a low signal-to-noise ratio.
automatic modulation recognition (AMR) plays a crucial role in noncooperative communication environment for identifying modulation types of received radio signals. Recently, the achievements of deep learning (DL) have sparked significant interest in applying DL to the field of AMR. However, existing DL-based AMR methods only use image modal or sequence modal as input, which cannot leverage sufficient information of the signal in complicated electromagnetic environment characterized by scarce labeled sample and multipath fading. To overcome this limitation, we explore different modal representations of the signal to fully exploit their complementary information and propose a hybrid modal contrast and fusion method for AMR (HMCF-AMR). It consists of two stages: 1) modal-level feature contrast for self-supervised pretraining and 2) modal-level feature fusion for supervised fine-tuning. In modal-level feature contrast, sequence encoder and image encoder are designed to extract multiscale features of the modulated signal from the image modal and sequence modal. Meanwhile, a multitask collaborative pretraining method combining generative and contrastive learning is achieved to enhance and align different modal representations. In modal-level feature fusion, an attentional feature fusion mechanism integrates the features learned from different modal to further improve modulation recognition performance and online learning is implemented by fine-tuning to handle different scenarios. Simulation results show that our proposed HMCF-AMR outperforms other baseline models in both adequate-sample and few-shot scenarios and demonstrates greater robustness in complicated multipath fading channels.
In the context of autonomous driving environment perception, multi-modal fusion plays a pivotal role in enhancing robustness, completeness, and accuracy, thereby extending the performance boundary of the perception system. However, directly applying LiDAR-related algorithms to radar and camera fusion leads to significant challenges, such as radar sparsity, absence of height information, and noise, resulting in substantial performance loss. To address these issues, our proposed method, SparseFusion3D, utilizes a dual-branch feature-level fusion network that fully models sensor interactions, effectively mitigating the adverse effects of radar sparsity and noise on modality association. Additionally, to enhance modal correlations and accuracy while alleviating radar point cloud sparsity and measurement ambiguity, we introduce MSPCP, which compensates for point cloud offset. Moreover, we integrate Radar Painter to leverage image information and further enhance MSPCP. SparseFusion3D exhibits competitive performance compared to previous radar-camera fusion models, achieving approximately 1.5x inference speedup with similar performance to dense query methods, while also improving by 20.1% compared to the baseline approach.
In recent years, unmanned surface vehicles (USVs) have played an increasingly important role in various applications. Due to the expansion of USV application scenes from common marine areas to inland waters with complex environments, environmental perception has become an essential requirement for autonomous navigation systems of USVs. Traditional perception methods utilize either light detection and ranging (LiDAR) or radar to construct volumetric maps for environmental perception. To improve the accuracy of perception systems and reduce deployment costs, this article proposes a novel radar and camera fusion volumetric map network named FVMNet for real-time volumetric perception. FVMNet is based on a novel radar and image fusion architecture and comprises four modules: 1) the radar and image encoders can extract different features; 2) only using in training stage without extra valid time costs, auxiliary segmentation head advances the image encoder; 3) to eliminate the representation difference between image features and radar features, the BEV spatial transformer module transfers image feature representations from the perspective view to BEV space; and 4) the fusion segmentation head predicts the volumetric perception results. Compared to other baseline methods that use a single modality, FVMNet achieves state-of-the-art accuracy in the public USVInland dataset and our collected wharf dataset. We conducted comprehensive ablation experiments to validate the efficacy of the designed modules. Moreover, the proposed method demonstrates generalization in zero-shot real-world scenarios and robustness under extreme weather conditions.
Reliable 3D object perception is essential in autonomous driving. Owing to its sensing capabilities in all weather conditions, 4D radar has recently received much attention. However, compared to LiDAR, 4D radar provides much sparser point cloud. In this paper, we propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality. As the core of ZFusion, our proposed FP-DDCA (Feature Pyramid-Double Deformable Cross Attention) fuser complements the (sparse) radar information and (dense) vision information, effectively. Specifically, with a feature-pyramid structure, the FP-DDCA fuser packs Transformer blocks to interactively fuse multi-modal features at different scales, thus enhancing perception accuracy. In addition, we utilize the Depth-Context-Split view transformation module due to the physical properties of 4D radar. Considering that 4D radar has a much lower cost than LiDAR, ZFusion is an attractive alternative to LiDARbased methods. In typical traffic scenarios like the VoD (View-of-Delft) dataset, experiments show that with reasonable inference speed, ZFusion achieved the state-of-the-art mAP (mean average precision) in the region of interest, while having competitive mAP in the entire area compared to the baseline methods, which demonstrates performance close to LiDAR and greatly outperforms those camera-only methods.
Millimeter-wave radar has the advantages of strong penetration, high-precision speed detection and low power consumption. It can be used to conduct robust object detection in abnormal lighting and severe weather conditions. The emerging 4D millimeter-wave radar has improved the quality and quantity of generated point clouds. Adding radar–camera fusion enhances the tracking reliability of transportation system operation. However, it is challenging due to the absence of standardized testing methods. Hence, this paper proposes a radar–camera fusion algorithm testing framework in a highway roadside scenario using SUMO and CARLA simulators. First, we propose a 4D millimeter-wave radar simulation method. A roadside multi-sensor perception dataset is generated in a 3D environment through co-simulation. Then, deep-learning object detection models are trained under different weather and lighting conditions. Finally, we propose a baseline fusion method for the algorithm testing framework. This framework provides a realistic virtual environment for device selection, algorithm testing and parameter tuning for millimeter-wave radar–camera fusion algorithms. Solutions show that the method proposed in this paper can provide a realistic virtual environment for radar–camera fusion algorithm testing for roadside traffic perception. Compared to the camera-only tracking method, the radar–vision fusion method proposed significantly improves tracking performance in rainy night scenarios. The trajectory RMSE is improved by 68.61% in expressway scenarios and 67.45% in urban scenarios. This method can also be applied to improve the detection of stop-and-go waves on congested expressways.
Achieving all-weather vehicle perception in roadside monitoring systems composed of cameras and millimeter-wave radars is challenging, primarily due to the ineffective integration of information from these two sources. Particularly in adverse weather conditions, this lack of integration can lead to the monitoring system’s inability to promptly identify and address hazardous situations. However, current fusion methods often have the problem of being dominated by visual information, and they do not sufficiently utilize the complementary aspects of the two-source information. In this article, we present the radar and visual information fusion network (RVIFNet), a novel method that tackles these challenges through enhanced radar data representation and multilevel fusion strategies. First, we develop a pseudoimage representation method for sparse radar data and its feature extraction technique, which enhances its expressiveness and lays the groundwork for feature-level fusion with visual features. Second, we propose a multilevel fusion approach that leverages the complementary attributes of the dual-modal data in terms of spatial localization, resolution, and semantic understanding to achieve fusion at levels of low-level semantics, high-level semantics, and anchor box level. In additiona, we introduce the monitoring perspective for radar and camera (MPRC) dataset, collected and annotated specifically for roadside monitoring scenarios, and elaborate on the spatial–temporal synchronization method for the dual-source data. We evaluate RVIFNet on MPRC and the widely used in-vehicle dataset NuScenes, confirming its effectiveness for all-weather vehicle detection. To best of the authors’ knowledge, this work is among the early attempts to fuse radar and camera data for all-weather vehicle perception in the roadside monitoring scenarios.
Far-range perception is essential for intelligent transportation systems. The main challenge of far-range perception is due to the difficulty of performing accurate object detection and tracking under far distances (e.g., <inline-formula><tex-math notation="LaTeX">$> \text{150}\,\text{m}$</tex-math></inline-formula>) at low cost. To cope with such challenges, deploying millimeter wave Radars and high-definition (HD) cameras, and fusing their data for joint perception has become a common practice. The key to this solution is the precise association between two types of data captured from different perspectives. Towards this goal, the first question is which plane to conduct the association, i.e., the 2D image plane or the BEV plane. We argue that the former is more suitable because the location errors of the perspective projection points are smaller at far distances and can lead to more accurate associations. Thus, we project Radar-based target locations from the BEV to the 2D plane and then associate them with camera-based object locations. Subsequently, we map the camera-based object locations to the BEV plane through inverse projection mapping (IPM) with corresponding depth information from Radar data. Finally, we engage a BEV tracking module to generate target trajectories for traffic monitoring. We devise a transformation parameters refining approach based on the depth scaling technique. We have deployed an actual testbed on an urban expressway and conducted extensive experiments for evaluation. The results show that our system can improve <inline-formula><tex-math notation="LaTeX">$\text{AP}_{\text{BEV}}$</tex-math></inline-formula> by 32%, and reduce the location error by <inline-formula><tex-math notation="LaTeX">$\text{0.56}\,\text{m}$</tex-math></inline-formula>. Our system is capable of achieving an average location accuracy of <inline-formula><tex-math notation="LaTeX">$\text{1.3}\,\text{m}$</tex-math></inline-formula> within the <inline-formula><tex-math notation="LaTeX">$\text{500}\,\text{m}$</tex-math></inline-formula> range.
Camera and millimeter-wave (MMW) radar fusion is essential for accurate and robust autonomous driving systems. With the advancement of radar technology, next-generation high-resolution automotive radar, i.e., 4-D radar, has emerged. In addition to the target range, azimuth, and Doppler velocity measurements of traditional radar, 4-D radar provides elevation measurement to create a denser “point cloud.” In this study, we propose a camera and 4-D radar fusion network called RCFusion, which achieves multimodal feature fusion under a unified bird’s-eye view (BEV) space to accomplish 3-D object detection tasks. In the camera stream, multiscale feature maps are obtained by the image backbone and feature pyramid network (FPN); they are then converted into orthographic feature maps by an orthographic feature transform (OFT). Next, enhanced and fine-grained image BEV features are obtained via a designed shared attention encoder. Meanwhile, in the 4-D radar stream, a newly designed component named radar PillarNet efficiently encodes the radar features to generate radar pseudo-images, which are fed into the point cloud backbone to create radar BEV features. An interactive attention module (IAM) is proposed for the fusion stage, which outputs a valid fusion of the two-modal BEV features. Finally, a generic detection head predicts the object classes and locations. The proposed RCFusion is validated on the TJ4DRadSet and view-of-delft (VoD) datasets. The experimental results and analysis show that the proposed method can effectively fuse camera and 4-D radar features to achieve robust detection performance.
Autonomous driving requires an accurate and fast 3D perception system that includes 3D object detection, tracking, and segmentation. Although recent low-cost camera-based approaches have shown promising results, they are susceptible to poor illumination or bad weather conditions and have a large localization error. Hence, fusing camera with low-cost radar, which provides precise long-range measurement and operates reliably in all environments, is promising but has not yet been thoroughly investigated. In this paper, we propose Camera Radar Net (CRN), a novel camera-radar fusion framework that generates a semantically rich and spatially accurate bird’s-eye-view (BEV) feature map for various tasks. To overcome the lack of spatial information in an image, we transform perspective view image features to BEV with the help of sparse but accurate radar points. We further aggregate image and radar feature maps in BEV using multi-modal deformable attention designed to tackle the spatial misalignment between inputs. CRN with real-time setting operates at 20 FPS while achieving comparable performance to LiDAR detectors on nuScenes, and even outperforms at a far distance on 100m setting. Moreover, CRN with offline setting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first among all camera and camera-radar 3D object detectors.
Unmanned surface vehicles (USVs) have been widely used for a wide range of tasks in the past decades. Accurate perception of the surrounding environment on the water surface under complex conditions is crucial for USVs to conduct effective operations. This article proposes a radar-vision fusion framework for USVs to accurately detect typical targets on the water surface. The modality difference between images and radar measurements, along with their perpendicular coordinates, presents challenges in the fusion process. The swaying of USVs on water and the extensive areas of perception enhance the difficulties of multisensor data association. To address these problems, we propose two modules to enhance multisensor fusion performance: a movement-compensated projection module and a distance-aware probabilistic data association module. The former effectively reduces projection bias during the alignment process of radar and camera signals by compensating for sensor movement using measured roll and pitch angles from the inertial measurement unit (IMU). The latter module models target regions guided by each radar measurement as a bivariate Gaussian distribution, with its covariance matrix adaptively derived based on the distance between targets and the camera. Consequently, the association of radar points and images is robust to projection errors and works well for multiscale objects. Features of radar points and images are subsequently extracted with two parallel backbones and fused at different levels to provide sufficient semantic information for robust object detection. The proposed framework achieves an average precision (AP) of 0.501 on the challenging real-world dataset established by us, outperforming state-of-the-art vision-only and radar-vision fusion methods.
Low-cost, vision-centric 3D perception systems for autonomous driving have made significant progress in recent years, narrowing the gap to expensive LiDAR-based methods. The primary challenge in becoming a fully reliable alternative lies in robust depth prediction capabilities, as camera-based systems struggle with long detection ranges and adverse lighting and weather conditions. In this work, we introduce HyDRa, a novel camera-radar fusion architecture for diverse 3D perception tasks. Building upon the principles of dense Bird's-EyeView (BEV)-based architectures, HyDRa introduces a hybrid fusion approach to combine the strengths of complementary camera and radar features in two distinct representation spaces. Our Height Association Transformer module leverages radar features already in the perspective view to produce more robust and accurate depth predictions. In the BEV, we refine the initial sparse representation by a Radar-weighted Depth Consistency. HyDRa achieves a new state-of-the-art for cameraradar fusion of 64.2 NDS (+1.8) and 58.4 AMOTA (+1.5) on the public nuScenes dataset. Moreover, our new semantically rich and spatially accurate BEV features can be directly converted into a powerful occupancy representation, beating all previous camera-based methods on the Occ3D benchmark by an impressive 3.7 mIoU. Code and models are available at https://github.com/phi-wol/hydra.
Integrating multiple sensors and addressing diverse tasks in an end-to-end algorithm are challenging yet critical topics for autonomous driving. To this end, we introduce BEVGuide, a novel Bird's Eye- View (BEV) representation learning framework, representing the first attempt to unify a wide range of sensors under direct BEV guidance in an end-to-end fashion. Our architecture accepts input from a diverse sensor pool, including but not limited to Camera, Lidar and Radar sensors, and extracts BEV feature embeddings using a versatile and general transformer backbone. We design a BEV-guided multi-sensor attention block to take queries from BEV embeddings and learn the BEV representation from sensor-specific features. BEVGuide is efficient due to its lightweight backbone design and highly flexible as it supports almost any input sensor configurations. Extensive experiments demonstrate that our framework achieves exceptional performance in BEV perception tasks with a diverse sensor set. Project page is at https://yunzeman.github.io/BEVGuide.
4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird's-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multimodal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrated that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 3.96% in 3D and 4.17% in BEV object detection accuracy.
Bird’s-eye-view (BEV) perception has emerged as a key representation for unified scene understanding in autonomous driving. However, current BEV methods relying solely on monocular cameras suffer from severe degradation under adverse weather and dynamic scenes due to limited depth cues and illumination dependency. To address these challenges, we propose a robust multi-modal BEV perception framework that integrates dual-source 4D millimeter-wave radar and multi-view camera images. The proposed architecture systematically exploits Doppler velocity and temporal information from 4D radar to model dynamic object motion, while introducing a deformable fusion strategy in the BEV space for accurate semantic alignment across modalities. Our design includes four key modules: a Doppler-Aware Radar Encoder (DARE) that enhances motion-sensitive features via velocity-guided attention; a Fog-Aware Feature Denoising Module (FADM) that suppresses modality inconsistency in low-visibility conditions through cross-modal attention and residual enhancement; a Multi-Modal Temporal Fusion Module (TFM) that encodes radar temporal sequences using a Transformer encoder for motion continuity modeling; and a confidence-aware multi-task loss that jointly supervises semantic segmentation, motion estimation, and object detection. Extensive experiments on the DualRadar dataset and adverse-weather simulations demonstrate that our method achieves significant gains over state-of-the-art baselines in BEV segmentation accuracy, detection robustness, and motion stability. The proposed framework offers a scalable and resilient solution for real-world autonomous perception, especially under challenging environmental conditions.
Fine-Grained Image Generation Network With Radar Range Profiles Using Cross-Modal Visual Supervision
Electromagnetic imaging methods mainly utilize converted sampling, dimensional transformation, and coherent processing to obtain spatial images of targets, which often suffer from accuracy and efficiency problems. Deep neural network (DNN)-based high-resolution imaging methods have achieved impressive results in improving resolution and reducing computational costs. However, previous works exploit single modality information from electromagnetic data; thus, the performances are limited. In this article, we propose an electromagnetic image generation network (EMIG-Net), which translates electromagnetic data of multiview 1-D range profiles (1DRPs), directly into bird-view 2-D high-resolution images under cross-modal supervision. We construct an adversarial generative framework with visual images as supervision to significantly improve the imaging accuracy. Moreover, the network structure is carefully designed to optimize computational efficiency. Experiments on self-built synthetic data and experimental data in the anechoic chamber show that our network has the ability to generate high-resolution images, whose visual quality is superior to that of traditional imaging methods and DNN-based methods, while consuming less computational cost. Compared with the backprojection (BP) algorithm, the EMIG-Net gains a significant improvement in entropy (72%), peak signal-to-noise ratio (PSNR; 150%), and structural similarity (SSIM; 153%). Our work shows the broad prospects of deep learning in radar data representation and high-resolution imaging and provides a path for researching electromagnetic imaging based on learning theory.
As the electromagnetic environment has become increasingly complex in near-Earth orbital space, severe threats are posed to the link stability of space-based communication systems in satellites. It is imperative to develop effective signal recognition technologies in a complex electromagnetic environment for further cognitive anti-jamming method. While existing deep learning approaches for signal recognition predominantly rely on single-modal inputs, this paper proposes CA3M-Former, a novel attention mechanism based multi-modal framework that synergistically integrates I/Q sequences, A/P sequences, and spectrograms to exploit complementary time-frequency features. The model introduces two key innovations: (1) a cross-attention fusion module to dynamically align and combine tri-modal features while preserving temporal coherence, and (2) a cascaded temporal-spatial attention mechanism in the encoder to capture global contextual dependencies. Performance evaluations show that the proposed method can outperform its counterparts in terms of recognition accuracy under various SNR.
The discernment of jamming signal was important for the downstream tasks. Though great improvement was achieved by deep learning previously, these methods need large amounts of signals with label information. It was yet difficult to be met in the practical applications. To solve the problem, a cross-modality contrastive learning method is proposed in this article. The signalwise hierarchy and the imagewise hierarchy were presented to learn the features from IQ tensor and TF image, respectively. The cross-domain features were then aggregated subsequently. The obtained features were delivered to the learning architecture. It was composed of the pretraining and the fine-tuning. The signals without label information were first used to pretrain a base model. The similarity loss that made the positive more similar to the signal than the negative was presented to optimize the model. The pretrained model was then fine-tuned by a small amounts of labeled signals. The recognition task can be then achieved accordingly. Therefore, the labeled signals and the unlabeled ones were jointly exploited. Likewise, two different kinds of modal data were unified into a single framework. Multiple rounds of experiments were finally performed. The results thrown light on the superiority of the proposed method over the standard and recent techniques.
Although self-supervised learning methods have shown promising progress in addressing the issue of scarce labeled data in automatic modulation classification, they remain constrained by heavy reliance on extensive negative samples and an inability to effectively capture inter-modal feature correlations. To overcome these limitations, we propose a novel self-supervised automatic modulation classification algorithm based on multi-path embedding prediction, termed CAEP. In CAEP, the raw signal is first dynamically segmented into current and future sub-series. Then, dedicated encoders are utilized to extract embeddings for both sub-series and leverage current information to predict future states, while randomly masking the corresponding time–frequency images transformed from the time-domain signal to predict the obscured spectral components. Furthermore, latent temporal embeddings are deployed to predict information within the time–frequency domain to achieve cross-modal retrieval. Finally, a classification head is connected alongside a temporal modal encoder, which is fine-tuned using a limited set of labeled samples to accomplish modulation classification. Experimental results on two benchmark datasets demonstrate that the proposed method achieves robust performance across varying noise conditions.
… both the mechanical and electromagnetic systems of the motors. … This paper proposed a novel cross-modal vector fusion fault … layer with an improved machine learning classifier, so that …
… This study proposes a cross-modal deep learning model that merges channel state information (… A gated cross-modal attention mechanism is then utilized for feature-level fusion, and a …
During percutaneous coronary intervention, conventional 2D X-ray imaging lacks depth information, making it difficult for clinicians to determine the 3D position of the guidewire. While some recent approaches incorporate micro-sensors to assist with pose estimation, many rely on implanted electromagnetic sensors, which can introduce additional clinical risks. In the paper, we present a non-invasive alternative by using an external 3-axis electronic magnetometer array. We further propose a Local-Global Magneto-Visual Network framework (LG-MagNet) that fuses magnetic field information with image data to enable precise 3D pose estimation of the guidewire. Specifically, we first perform a shared encoder for cross-modal feature fusion. Then we employ convolutional operations that integrate local and global features. Finally, we utilize a lightweight prediction head for end-to-end depth regression. We constructed experimental equipment and collected a clinical simulation datasets. Results show a root mean square error (RMSE) of (0.797 ± 0.095 mm) for depth prediction along the Z-axis and an overall RMSE of (1.216 ± 0.072) mm for 3D guidewire shape reconstruction. Quantitative analysis indicates that fusing external magnetometer data with 2D imaging improves pose estimation stability, particularly in regions with curvature.
Recent years have witnessed an increasing demand for human fall detection systems. Among all existing methods, Wi-Fi-based fall detection has become one of the most promising solutions due to its pervasiveness. However, when applied to a new domain, existing Wi-Fi-based solutions suffer from severe performance degradation caused by low generalizability. In this paper, we propose XFall, a domain-adaptive fall detection system based on Wi-Fi. XFall overcomes the generalization problem from three aspects. To advance cross-environment sensing, XFall exploits an environment-independent feature called speed distribution profile, which is irrelevant to indoor layout and device deployment. To ensure sensitivity across all fall types, an attention-based encoder is designed to extract the general fall representation by associating both the spatial and temporal dimensions of the input. To train a large model with limited amounts of Wi-Fi data, we design a cross-modal learning framework, adopting a pre-trained visual model for supervision during the training process. We implement and evaluate XFall on one of the latest commercial wireless products through a year-long deployment in real-world settings. The result shows XFall achieves an overall accuracy of 96.8%, with a miss alarm rate of 3.1% and a false alarm rate of 3.3%, outperforming the state-of-the-art solutions in both in-domain and cross-domain evaluation.
Narrowband and wideband waveforms are usually adopted simultaneously during the observation of micro-motion space targets by inverse synthetic aperture radar (ISAR), which can collect rich multimodal information in the time-Doppler, time-range, and range-instantaneous-Doppler (RID) domains. In order to exploit the electromagnetic scattering, shape, structure, and motion characteristics, this article proposes an attention-augmented cross-modal feature fusion recognition network (ACM-FR Net). First, the ACM-FR Net adopts a convolutional neural network (CNN) to extract initial feature vectors from joint time–frequency (JTF) image, high-resolution range profiles (HRRPs), and RID image. Then, it transforms the feature vectors of the three modalities into feature sequences. Finally, it achieves interactive feature fusion by implementing ACM feature fusion. In the four-category micro-motion space target recognition experiments, the proposed ACM-FR Net has demonstrated high accuracy and noise robustness.
With the rapid development of radar jamming systems, especially digital radio frequency memory (DRFM), the electromagnetic environment has become increasingly complex. In recent years, most existing studies have focused solely on either jamming recognition or anti-jamming strategy design. In this paper, we propose a unified framework that integrates interference recognition with intelligent anti-jamming strategy selection. Specifically, time-frequency (TF) features of radar echoes are first extracted using both Short-Time Fourier Transform (STFT) and Smoothed Pseudo Wigner-Ville Distribution (SPWVD). A feature fusion method is then designed to effectively combine these two types of time-frequency representations. The fused TF features are further combined with time-domain features of the radar echoes through a cross-modal fusion module based on an attention mechanism. Finally, the recognition results, together with information obtained from the passive radar, are fed into a Deep Q-Network (DQN)-based intelligent anti-jamming strategy network to select jamming suppression waveforms. The key jamming parameters obtained by the passive radar provide essential information for intelligent decision-making, enabling the generation of more effective strategies tailored to specific jamming types. The proposed method demonstrates improvements in both jamming type recognition accuracy and the stability of anti-jamming strategy selection under complex environments. Experimental results show that our method achieves superior performance compared to Support Vector Machines (SVM), VGG-16, and 2D-CNN methods, with respective improvements of 1.41%, 2.5%, and 14.51% in overall accuracy. Moreover, in comparison with the SARSA algorithm, the designed algorithm achieves faster reward convergence and more stable strategy generation.
Reliable fault diagnosis of permanent magnet synchronous motors is critical, yet multimodal models using vibration and current signals often fail under industrial noise and data scarcity. The fundamental challenge is the inability of models to bridge the “semantic gap” between the distinct physical domains of mechanical dynamics (vibration) and electromagnetic principles (current). This article proposes a Physics-Synergized Bidirectional Network (PSB-Net) that injects domain knowledge to solve this gap. The core innovation is a two-stage, physics-guided approach: First, raw signals are encoded into two-dimensional feature maps with clear physical meaning and high noise immunity [cyclic spectral coherence (CSC) and spectral harmonic coherence (SHC)]. Second, a novel attention and fusion architecture (PSSAM and BCMIN) is specifically designed to interpret and fuse these physics-based maps. This approach ensures the model learns physically consistent correlations rather than spurious statistical ones. Experimental validation confirms PSB-Net’s superiority, achieving 99.46% peak accuracy. It demonstrates exceptional robustness, maintaining 92.95% accuracy under severe noise (SNR = 1) and 90.36% in data-scarce scenarios (4.0 testing-to-training ratio). Its generalization capability across six distinct domains proves its effectiveness for industrial applications.
In recent years, with the rise of deep learning, it has become a hot research topic to combine time-frequency analysis technology with deep learning to recognize radar signals. For the application of deep learning in radar signal recognition, however, the discovery of adversarial examples poses a tremendous security risk. Based on experiments, it appears that the radar signal recognition model based on the time-frequency image have been shown to be less vulnerable to adversarial attack methods based on time domain. Therefore, we propose a cross-modal attack (CMA). Firstly, we establish a surrogate model architecture locally, including three parts: time-frequency analysis, data quantization, and classifier. Secondly, we train this architecture as a whole and generate adversarial examples utilizing the trained surrogate model architecture parameters and adversarial attack methods. Finally, we carry out the CMA on the radar signal recognition model based on the time-frequency image by adding adversarial perturbations to the original signal. According to experimental results, the CMA can reduce the model recognition accuracy by more than 30%, demonstrating good attack performance, when the perturbation strength is 0.1 and the signal-to-noise ratio is 0 dB.
In complex electromagnetic environments, effectively identifying interference sources, rapidly resolving spectrum conflicts, and improving resource utilization are key challenges in intelligent spectrum cognition and management. This paper proposes a cognition and conflict resolution method based on MC-Trans. By constructing a spatiotemporal-spectrum-protocol trimodal fusion architecture, a causal reasoning module is designed to decouple the causal relationship between interference sources and normal signals. A dynamic weight adjustment mechanism is introduced in the conflict resolution layer, and the confidence level of each modality is calculated based on the attention graph. When a spectrum occupancy conflict is detected, the primary interference source is determined through causal graph reasoning, and a reinforcement learning strategy is used to dynamically allocate spectrum resources. Experimental validation using the WiSig dataset shows that MC-Trans achieves a situation misjudgment rate of only 12.9% in a densely populated scenario with 50 devices, a stable response time of less than 58.3ms, and a spectrum utilization rate of 78.1%, providing a new paradigm for intelligent decision-making in complex electromagnetic environments.
… RF-based multimodal framework that relies on limited labeled RF data and performs effectively in RF … capability, enabling it to effectively integrate and learn from diverse RF modalities. …
Object identification is a pivotal enabling technique for smart home and manufacturing applications. Traditional methodologies for object identification predominantly rely on a singular sensor modality, which inherently limits their ability to furnish a detailed characterization of the target object. Addressing this deficiency, in this paper, we fill this gap by introducing CRFusion, the first-of-its-kind system that integrates the object RGB image and the radio frequency (RF) signal reflected by the object for fine-grained object identification. CRFusion leverages the complementary characteristics between visible light and radio frequency modalities to simultaneously determine the category and material of target objects. We design a multifaceted object feature from the RF signal, called the Energy Reflection Factor (ERF), which not only reveals the object texture but complements the image modality for identifying the object category. By integrating the characteristics of radar, we obtain radar feature maps based on the ERF of target objects. Additionally, we have developed a modality fusion network to comprehensively integrate the image and ERF features. We conducted a comprehensive evaluation of CRFusion using a commercial mmWave radar development board and camera. The results show that CRFusion achieves a classification accuracy of over 96%, demonstrating its robustness, and potential for application.
A systematic survey of the state-of-the-art on the systems merging radio frequency identification (RFID) and computer vision (CV) is reported in this article. This review is structured on the basis of the main application contexts in which these systems have been proposed: 1) inventory; 2) augmented reality (AR); 3) perception of human activity and human computer interaction (HCI); 4) robotics; 5) tracking of generic tagged targets; 6) assistance to elderly adults or people with disabilities; 7) electronic article surveillance (EAS); and 8) medical and veterinary research. The presented survey aims to summarize the algorithmic features of the existing approaches, and highlight the solved challenges and the limits of the available solutions.
Remote physiology, which involves monitoring vital signs without the need for physical contact, has great potential for various applications. Current remote physiology methods rely only on a single camera or radio frequency (RF) sensor to capture the microscopic signatures from vital movements. However, our study shows that fusing deep RGB and RF features from both sensor streams can further improve performance. Because these multimodal features are defined in distinct dimensions and have varying contextual importance, the main challenge in the fusion process lies in the effective alignment of them and adaptive integration of features under dynamic scenarios. To address this challenge, we propose a novel vital sensing model, named Fusion-Vital, that combines the RGB and RF modalities through the new introduction of pairwise input formats and transformer-based fusion strategies. We also perform comprehensive experiments based on a newly collected and released remote vital dataset comprising synchronized video-RF sensors, showing the superiority of the fusion approach over the previous single-sensor baselines in various aspects.
… To close this gap, we present a radar-vision multimodal fusion framework that integrates … The framework makes three principal contributions at the system-integration level. First, a …
We present RFSensingGPT, an integrated framework for radio frequency (RF) sensing that combines technical question-answering, code retrieval, and spectrogram analysis through retrieval-augmented generation (RAG). Our framework addresses the fundamental challenge of applying large language models to RF sensing applications, where specialized domain knowledge is underrepresented in general training corpora. The system leverages a filtered RedPajama dataset containing RF-relevant technical documents, processed through a hybrid retrieval mechanism that combines vector-based similarity search with best match (BM25)-based query fusion. Performance evaluation using document collections ranging from 5K to 80K demonstrates that RAG consistently maintains superior faithfulness across all dataset sizes (0.9033-0.9779 vs 0.8162-0.8506, average improvement of 13.0%) compared to baseline LLM implementations. Our hierarchical chunking approach using MarkdownHeaderTextSplitter achieves optimal precision (0.31-0.32) at lower k-values while maintaining correctness scores of 4.0-5.0. The framework integrates CLIP-based vision models for RF pattern recognition, achieving 93.23% accuracy in radar data analysis tasks. Implementation benchmarks show efficient processing with minimal GPU memory requirements (0.66GB) even at scale. Through a comprehensive evaluation of the embedding models, RFSensingGPT establishes a new benchmark for technical query understanding and RF spectrogram analysis in the emerging field of integrated sensing and communications systems for 6G networks.
Multimodal fusion-based methods are a research hotspot for Automatic Modulation Recognition (AMR). But the existing methods primarily emphasize information integration and neglect the balance between the modalities. This letter proposes a novel Contrastive Learning-based Multimodal Fusion (CLMF) model, which integrates both signals and key features to enhance AMR. To obtain adequate signal representations, a contrastive learning architecture is proposed to learn the meaningful representations from the multimodal fusion data, and a Multi-Layer Perceptron (MLP) is incorporated for precise signal classification. Moreover, a threshold discrimination disturbance strategy is designed to balance the information conflicts arising from the two modalities. The experiments demonstrate the efficiency of the CLMF model for AMR on the public dataset.
In order to clear the world of the threat posed by landmines and other explosive devices, robotic systems can play an important role. However, the development of such field robots that need to operate in hazardous conditions requires the careful consideration of multiple aspects related to the perception, mobility, and collaboration capabilities of the system. In the framework of a European challenge, the Artificial Intelligence for Detection of Explosive Devices - eXtended (AIDEDeX) project proposes to design a heterogeneous multi-robot system with advanced sensor fusion algorithms. This system is specifically designed to detect and classify improvised explosive devices, explosive ordnances, and landmines. This project integrates specialised sensors, including electromagnetic induction, ground penetrating radar, X-Ray backscatter imaging, Raman spectrometers, and multimodal cameras, to achieve comprehensive threat identification and localisation. The proposed system comprises a fleet of unmanned ground vehicles and unmanned aerial vehicles. This article details the operational phases of the AIDEDeX system, from rapid terrain exploration using unmanned aerial vehicles to specialised detection and classification by unmanned ground vehicles equipped with a robotic manipulator. Initially focusing on a centralised approach, the project will also explore the potential of a decentralised control architecture, taking inspiration from swarm robotics to provide a robust, adaptable, and scalable solution for explosive detection.
Robust and accurate perception of dynamic objects and map elements is crucial for autonomous vehicles performing safe navigation in complex traffic scenarios. While vision-only methods have become the de facto standard due to their technical advances, they can benefit from effective and cost-efficient fusion with radar measurements. In this work, we advance fusion methods by repurposing Gaussian Splatting as an efficient universal view transformer that bridges the view disparity gap, mapping both image pixels and radar points into a common Bird's-Eye View (BEV) representation. Our main contribution is GaussianCaR, an end-to-end network for BEV segmentation that, unlike prior BEV fusion methods, leverages Gaussian Splatting to map raw sensor information into latent features for efficient camera-radar fusion. Our architecture combines multi-scale fusion with a transformer decoder to efficiently extract BEV features. Experimental results demonstrate that our approach achieves performance on par with, or even surpassing, the state of the art on BEV segmentation tasks (57.3%, 82.9%, and 50.1% IoU for vehicles, roads, and lane dividers) on the nuScenes dataset, while maintaining a 3.2x faster inference runtime. Code and project page are available online.
Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.
Accurate perception of UAVs in complex low-altitude environments is critical for airspace security and related intelligent systems. Developing reliable solutions requires large-scale, accurately annotated, and multimodal data. However, real-world UAV data collection faces inherent constraints due to airspace regulations, privacy concerns, and environmental variability, while manual annotation of 3D poses and cross-modal correspondences is time-consuming and costly. To overcome these challenges, we introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding. It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions (clear, cloudy, rainy, foggy), featuring multiple UAV models (micro, small, medium-sized) and five modalities - RGB, IR, LiDAR, Radar, and DVS (Dynamic Vision Sensor). Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting. We further propose LGFusionNet, a LiDAR-guided multimodal fusion baseline, and a dedicated UAV trajectory prediction baseline to facilitate benchmarking. With its controllable simulation environment, comprehensive scenario coverage, and rich annotations, UAV3D offers a public benchmark for advancing 3D perception of UAVs.
With the rapid advancement of autonomous driving technology, there is a growing need for enhanced safety and efficiency in the automatic environmental perception of vehicles during their operation. In modern vehicle setups, cameras and mmWave radar (radar), being the most extensively employed sensors, demonstrate complementary characteristics, inherently rendering them conducive to fusion and facilitating the achievement of both robust performance and cost-effectiveness. This paper focuses on a comprehensive survey of radar-vision (RV) fusion based on deep learning methods for 3D object detection in autonomous driving. We offer a comprehensive overview of each RV fusion category, specifically those employing region of interest (ROI) fusion and end-to-end fusion strategies. As the most promising fusion strategy at present, we provide a deeper classification of end-to-end fusion methods, including those 3D bounding box prediction based and BEV based approaches. Moreover, aligning with recent advancements, we delineate the latest information on 4D radar and its cutting-edge applications in autonomous vehicles (AVs). Finally, we present the possible future trends of RV fusion and summarize this paper.
Urban water-surface robust perception serves as the foundation for intelligent monitoring of aquatic environments and the autonomous navigation and operation of unmanned vessels, especially in the context of waterway safety. It is worth noting that current multi-sensor fusion and multi-task learning models consume substantial power and heavily rely on high-power GPUs for inference. This contributes to increased carbon emissions, a concern that runs counter to the prevailing emphasis on environmental preservation and the pursuit of sustainable, low-carbon urban environments. In light of these concerns, this paper concentrates on low-power, lightweight, multi-task panoptic perception through the fusion of visual and 4D radar data, which is seen as a promising low-cost perception method. We propose a framework named Achelous++ that facilitates the development and comprehensive evaluation of multi-task water-surface panoptic perception models. Achelous++ can simultaneously execute five perception tasks with high speed and low power consumption, including object detection, object semantic segmentation, drivable-area segmentation, waterline segmentation, and radar point cloud semantic segmentation. Furthermore, to meet the demand for developers to customize models for real-time inference on low-performance devices, a novel multi-modal pruning strategy known as Heterogeneous-Aware SynFlow (HA-SynFlow) is proposed. Besides, Achelous++ also supports random pruning at initialization with different layer-wise sparsity, such as Uniform and Erdos-Renyi-Kernel (ERK). Overall, our Achelous++ framework achieves state-of-the-art performance on the WaterScenes benchmark, excelling in both accuracy and power efficiency compared to other single-task and multi-task models. We release and maintain the code at https://github.com/GuanRunwei/Achelous.
Most automated driving systems comprise a diverse sensor set, including several cameras, Radars, and LiDARs, ensuring a complete 360\deg coverage in near and far regions. Unlike Radar and LiDAR, which measure directly in 3D, cameras capture a 2D perspective projection with inherent depth ambiguity. However, it is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures for optimal path planning. The 3D space is typically simplified to the BEV space by omitting the less relevant Z-coordinate, which corresponds to the height dimension.The most basic approach to achieving the desired BEV representation from a camera image is IPM, assuming a flat ground surface. Surround vision systems that are pretty common in new vehicles use the IPM principle to generate a BEV image and to show it on display to the driver. However, this approach is not suited for autonomous driving since there are severe distortions introduced by this too-simplistic transformation method. More recent approaches use deep neural networks to output directly in BEV space. These methods transform camera images into BEV space using geometric constraints implicitly or explicitly in the network. As CNN has more context information and a learnable transformation can be more flexible and adapt to image content, the deep learning-based methods set the new benchmark for BEV transformation and achieve state-of-the-art performance. First, this chapter discusses the contemporary trends of multi-camera-based DNN (deep neural network) models outputting object representations directly in the BEV space. Then, we discuss how this approach can extend to effective sensor fusion and coupling downstream tasks like situation analysis and prediction. Finally, we show challenges and open problems in BEV perception.
Increasing attention is being paid to millimeter-wave (mmWave), 30 GHz to 300 GHz, and terahertz (THz), 300 GHz to 10 THz, sensing applications including security sensing, industrial packaging, medical imaging, and non-destructive testing. Traditional methods for perception and imaging are challenged by novel data-driven algorithms that offer improved resolution, localization, and detection rates. Over the past decade, deep learning technology has garnered substantial popularity, particularly in perception and computer vision applications. Whereas conventional signal processing techniques are more easily generalized to various applications, hybrid approaches where signal processing and learning-based algorithms are interleaved pose a promising compromise between performance and generalizability. Furthermore, such hybrid algorithms improve model training by leveraging the known characteristics of radio frequency (RF) waveforms, thus yielding more efficiently trained deep learning algorithms and offering higher performance than conventional methods. This dissertation introduces novel hybrid-learning algorithms for improved mmWave imaging systems applicable to a host of problems in perception and sensing. Various problem spaces are explored, including static and dynamic gesture classification; precise hand localization for human computer interaction; high-resolution near-field mmWave imaging using forward synthetic aperture radar (SAR); SAR under irregular scanning geometries; mmWave image super-resolution using deep neural network (DNN) and Vision Transformer (ViT) architectures; and data-level multiband radar fusion using a novel hybrid-learning architecture. Furthermore, we introduce several novel approaches for deep learning model training and dataset synthesis.
Artificial intelligence is a key enabler for next-generation wireless communication and sensing. Yet, today's learning-based wireless techniques do not generalize well: most models are task-specific, environment-dependent, and limited to narrow sensing modalities, requiring costly retraining when deployed in new scenarios. This work introduces a task-agnostic, multi-modal foundational model for physical-layer wireless systems that learns transferable, physics-aware representations across heterogeneous modalities, enabling robust generalization across tasks and environments. Our framework employs a physics-guided self-supervised pretraining strategy incorporating a dedicated physical token to capture cross-modal physical correspondences governed by electromagnetic propagation. The learned representations enable efficient adaptation to diverse downstream tasks, including massive multi-antenna optimization, wireless channel estimation, and device localization, using limited labeled data. Our extensive evaluations demonstrate superior generalization, robustness to deployment shifts, and reduced data requirements compared to task-specific baselines.
Deep understanding of electromagnetic signals is fundamental to dynamic spectrum management, intelligent transportation, autonomous driving and unmanned vehicle perception. The field faces challenges because electromagnetic signals differ greatly from text and images, showing high heterogeneity, strong background noise and complex joint time frequency structure, which prevents existing general models from direct use. Electromagnetic communication and sensing tasks are diverse, current methods lack cross task generalization and transfer efficiency, and the scarcity of large high quality datasets blocks the creation of a truly general multitask learning framework. To overcome these issue, we introduce EMind, an electromagnetic signals foundation model that bridges large scale pretraining and the unique nature of this modality. We build the first unified and largest standardized electromagnetic signal dataset covering multiple signal types and tasks. By exploiting the physical properties of electromagnetic signals, we devise a length adaptive multi-signal packing method and a hardware-aware training strategy that enable efficient use and representation learning from heterogeneous multi-source signals. Experiments show that EMind achieves strong performance and broad generalization across many downstream tasks, moving decisively from task specific models to a unified framework for electromagnetic intelligence. The code is available at: https://github.com/GabrielleTse/EMind.
Traditional single-modal sensing systems-based solely on either radio frequency (RF) or visual data-struggle to cope with the demands of complex and dynamic environments. Furthermore, single-device systems are constrained by limited perspectives and insufficient spatial coverage, which impairs their effectiveness in urban or non-line-of-sight scenarios. To overcome these challenges, we propose a novel large language model (LLM)-driven distributed integrated multimodal sensing and semantic communication (LLM-DiSAC) framework. Specifically, our system consists of multiple collaborative sensing devices equipped with RF and camera modules, working together with an aggregation center to enhance sensing accuracy. First, on sensing devices, LLM-DiSAC develops an RF-vision fusion network (RVFN), which employs specialized feature extractors for RF and visual data, followed by a cross-attention module for effective multimodal integration. Second, a LLM-based semantic transmission network (LSTN) is proposed to enhance communication efficiency, where the LLM-based decoder leverages known channel parameters, such as transceiver distance and signal-to-noise ratio (SNR), to mitigate semantic distortion. Third, at the aggregation center, a transformer-based aggregation model (TRAM) with an adaptive aggregation attention mechanism is developed to fuse distributed features and enhance sensing accuracy. To preserve data privacy, a two-stage distributed learning strategy is introduced, allowing local model training at the device level and centralized aggregation model training using intermediate features. Finally, evaluations on a synthetic multi-view RF-visual dataset generated by the Genesis simulation engine show that LLM-DiSAC achieves a good performance.
A non-experimental approach to developing high-performance EMI shielding materials is urgently needed to reduce costs and manpower. In this investigation, a multimodal data fusion …
Multiple types of sensory information are detected and integrated to improve perceptual accuracy and sensitivity in biological cognition. However, current studies on electronic skin (e-skin) systems have mainly focused on the optimization of the modality-specific data acquisition and processing. Endowing e-skins with the abilities of multimodal sensing and even perception that can achieve high-level perception behaviors has been insufficiently explored. Moreover, the perception progress of multisensory e-skin systems is faced with challenges at both device and software levels. Here, we provide a perspective on the multisensory fusion of e-skins. The recent progress in e-skins realizing multimodal sensing is reviewed, followed by bottom-up and top-down multimodal perception. With the deepening understanding of neuroscience and the rapid advance of novel algorithms and devices, multimodal perception function becomes possible and will promote the development of highly intelligent e-skin systems.
… Chinese medicine, intelligent fusion imaging is achieved … , Laplacian pyramid for image fusion. We have proposed an … diagnostic methods with knowledge graph fusion, as well as the …
Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textual representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks. Code, dataset and benchmark are available at https://github.com/Fancy-MLLM/R1-Onevision
本报告将电磁与多模态感知领域的研究划分为四大核心方向:自动驾驶与无人系统的环境感知、电磁信号处理与通信感知一体化、工业医疗与生理监测的精密传感、以及雷达物理成像与通用多模态表征。研究趋势显示,从传统的传感器融合正向物理机理驱动的深度学习范式及具身智能方向演进。