相机与雷达非刚性配准
传感器外参标定与时空同步技术
该组关注相机与雷达建立初始关联的基础,包括自动外参标定算法、时间偏移校准、镜像辅助标定以及针对移动平台的自位姿估计,旨在消除几何错位与时钟偏差。
- A Low-Cost Feature Matching-Based Temporal Registration Method for Radar and Camera(Xinyu Liu, Gui Zhang, Zhenmiao Deng, Y. Ye, 2025, 2025 IEEE International Radar Conference (RADAR))
- 360° Camera-LiDAR Spatio-Temporal Calibration by Sensor-wise Object Trajectories Alignment(Shingo Takebayashi, Ahmed Farid, G. Gebreyesus, Yuya Ieiri, Osamu Yoshie, 2025, 2025 7th International Conference on Control and Robotics (ICCR))
- A Novel Method of Spatial Calibration for Camera and 2D Radar Based on Registration(Chan-Ho Song, Guk-Jin Son, Hee-Earn Kim, Dongyeong Gu, Jin-Hee Lee, Youngduk Kim, 2017, 2017 6th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI))
- Joint Registration and Fusion of an Infrared Camera and Scanning Radar in a Maritime Context(D. Cormack, Isabel Schlangen, J. Hopgood, Daniel E. Clark, 2020, IEEE Transactions on Aerospace and Electronic Systems)
- Automatic Spatial Calibration of Near-Field MIMO Radar With Respect to Optical Depth Sensors(V. Wirth, Johanna Braunig, Danti Khouri, Florian Gutsche, M. Vossiek, Tim Weyrich, Marc Stamminger, 2024, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- ShanghaiTech Mapping Robot is All You Need: Robot System for Collecting Universal Ground Vehicle Datasets(Bowen Xu, Xiting Zhao, Delin Feng, Yuanyuan Yang, Sören Schwertfeger, 2024, ArXiv)
- Differentiable Targetless Radar-Camera Extrinsic Calibration Based on Detection Attributes(Ganlin Zhang, Lin Cao, Zongmin Zhao, Kehu Yang, Dongfeng Wang, Chong Fu, 2025, IEEE Transactions on Instrumentation and Measurement)
- SPECal: Spatial Calibration Based on Self-Pose-Estimation Between mmWave Radar and Camera(Yi Li, Lei Xie, Yu He, Jingyi Ning, Sanglu Lu, 2025, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies)
- Online Camera LiDAR Fusion and Object Detection on Hybrid Data for Autonomous Driving(Koyel Banerjee, Dominik Notz, J. Windelen, Sumanth Gavarraju, Mingkang He, 2018, 2018 IEEE Intelligent Vehicles Symposium (IV))
- 3DRadar2ThermalCalib: Accurate Extrinsic Calibration between a 3D mmWave Radar and a Thermal Camera Using a Spherical-Trihedral(Jun Zhang, Shini Zhang, Guohao Peng, H. Zhang, Danwei W. Wang, 2022, 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC))
- An Automatic Extrinsic Calibration Method for mmWave Radar and Camera in Traffic Environment(Yulin Wu, Junran Fan, Yuxuan Ma, Lihang Huang, Guolong Cui, Shisheng Guo, 2026, IEEE Transactions on Intelligent Transportation Systems)
- Research on key technologies of spatiotemporal matching of radar video heterogeneous data fusion for T-CPS(Xiuling Wei, M. Sejera, 2025, No journal)
- Artemis: Contour-Guided 3-D Sensing and Localization With mmWave Radar for Infrastructure-Assisted Autonomous Vehicles(Kaikai Deng, Ling Xing, Honghai Wu, Huahong Ma, Jianping Gao, Yue Ling, 2025, IEEE Internet of Things Journal)
- Warping of Radar Data Into Camera Image for Cross-Modal Supervision in Automotive Applications(Christopher Grimm, T. Fei, Ernst Warsitz, Ridha Farhoud, Tobias Breddermann, R. Haeb-Umbach, 2020, IEEE Transactions on Vehicular Technology)
- A Universal Framework for Extrinsic Calibration of Camera, Radar, and LiDAR(Sijie Hu, Alessandro Goldwurm, Martín Mujica, Sylvain Cadou, Frédéric Lerasle, 2026, IEEE Robotics and Automation Letters)
- Mirror-assisted calibration of a multi-modal sensing array with a ground penetrating radar and a camera(Chieh Chou, Shu-Hao Yeh, Dezhen Song, 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Automatic Spatial Calibration of Near-Field MIMO Radar With Respect to Optical Sensors(V. Wirth, J. Bräunig, Danti Khouri, Florian Gutsche, M. Vossiek, Tim Weyrich, Marc Stamminger, 2024, ArXiv)
非刚性形变配准与异构模态(SAR-光学)特征匹配
研究由于设备形变、运动非同步或传感器原理差异导致的非线性几何偏差。涉及SAR与光学图像的辐射差异克服、非刚性点集配准(如CPD、RBF)、以及针对复杂曲面的动态轨迹配准。
- AlignTrack: Top-Down Spatiotemporal Resolution Alignment for RGB-Event Visual Tracking(Chuanyu Sun, Jiqing Zhang, Yang Wang, Yuanchen Wang, Yutong Jiang, Baocai Yin, Xin Yang, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- 3D non-rigid registration using color: Color Coherent Point Drift(Marcelo Saval-Calvo, J. A. López, Andrés Fuster Guilló, Victor Villena-Martinez, Robert B. Fisher, 2018, ArXiv)
- Weighted registration of multiple trajectories of dynamic objects for online calibration of MMW radar and camera(Gang Huang, Qinlei Dong, Huiling Cao, Zhenling Chen, Zhaozheng Hu, 2025, Measurement Science and Technology)
- Global 3D Non-Rigid Registration of Deformable Objects Using a Single RGB-D Camera(Jingyu Yang, D. Guo, Kun Li, Zhenchao Wu, Yu-Kun Lai, 2019, IEEE Transactions on Image Processing)
- Research on ship trajectory monitoring methods based on multisource data(Jixiang Que, 2026, No journal)
- Geo-Registration of Terrestrial LiDAR Point Clouds with Satellite Images without GNSS(Xinyu Wang, Muhammad Ibrahim, Haitian Wang, A. Mansoor, Ajmal S. Mian, 2025, ArXiv)
- Spatio-Temporal Non-Rigid Registration of 3D Point Clouds of Plants(Nived Chebrolu, T. Läbe, C. Stachniss, 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA))
- Automatic Registration of Mini-Rf S-Band Level-1 Data(Zihan Xu, Fei Zhao, Pingping Lu, Yao Gao, Tingyu Meng, Yanan Dang, Mofei Li, 2024, IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium)
- OS3Flow: Optical and SAR Image Registration Using Symmetry-Guided Semi-Dense Optical Flow(Zixuan Sun, Shuaifeng Zhi, K. Huo, Xuecong Liu, Weidong Jiang, Yongxiang Liu, 2024, IEEE Geoscience and Remote Sensing Letters)
- A Novel MSPA-OS Method for Robust and Fast Optical-to-SAR Image Registration(Shuangtian Ye, Jing Liu, Shuncheng Tan, Yanheng Ma, Jialiang Wei, Qianchao He, Wei Jia, 2025, IEEE Geoscience and Remote Sensing Letters)
- DAS-Net: A Dual-Branch Structure-Aware Network for SAR–Optical Image Registration in Agricultural and Natural Scenes(Qi Kang, Jixian Zhang, Guoman Huang, Ruyi Wang, 2025, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing)
- Optical and SAR Image Registration Based on Deep and Shallow Feature Concatenation(Minghao Wang, Boli Xiong, Gangyao Kuang, 2025, 2025 International Conference on Microwave and Millimeter Wave Technology (ICMMT))
- Non-rigid infrared and visible image registration by enhanced affine transformation(Chaobo Min, Yan Gu, Yingjie Li, Feng Yang, 2020, Pattern Recognit.)
- Generalized Non-rigid Point Set Registration with Hybrid Mixture Models Considering Anisotropic Positional Uncertainties(Z. Min, Li Liu, M. Meng, 2019, No journal)
- 3D Non-rigid Registration of Deformable Object Using GPU(Junesuk Lee, Eung-Su Kim, Soon-Yong Park, 2019, No journal)
- Global as-Conformal-as-Possible Non-Rigid Registration of Multi-view Scans(Zhenchao Wu, Kun Li, Yu-Kun Lai, Jingyu Yang, 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME))
- PLISA: An Optical–SAR Remote Sensing Image Registration Method Based on Pseudo-Label Learning and Interactive Spatial Attention(Yixuan Zhang, Ruiqi Liu, Zeyu Zhang, Limin Shi, Lubin Weng, Lei Hu, 2025, Remote Sensing)
- Learning-Based SAR–Optical Registration for Navigation: Insights From a Multiyear, Multiseason Continental-Scale Dataset(Simon Bertrand, Guillaume Bourmaud, C. Vacar, Lionel Bombrun, 2025, IEEE Transactions on Geoscience and Remote Sensing)
- Templateless Non-Rigid Reconstruction and Motion Tracking With a Single RGB-D Camera(Kangkan Wang, Guofeng Zhang, Shi-hong Xia, 2017, IEEE Transactions on Image Processing)
- Online Global Non‐rigid Registration for 3D Object Reconstruction Using Consumer‐level Depth Cameras(Jiamin Xu, Weiwei Xu, Y. Yang, Z. Deng, H. Bao, 2018, Computer Graphics Forum)
- Adaptive enhanced affine transformation for non-rigid registration of visible and infrared images(Chaobo Min, Yan Gu, Yingjie Li, Feng Yang, 2020, IET Image Process.)
- Feature Registration and Fusion of SAR and Optical Images Based on Deep Learning(Xin Meng, Sai Wang, Zhuang Liu, 2025, 2025 4th International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE))
- A Multilevel Point-Matching Algorithm Based on Hierarchical Feature Detection and Description for SAR-to-Optical Image Registration(Zhixin Lian, Shiyang Tang, Jiahao Han, Yue Wu, Mingjin Zhang, Zhanye Chen, Linrang Zhang, 2025, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing)
- Fast and Robust Optical-to-SAR Remote Sensing Image Registration Using Region-Aware Phase Descriptor(Yibin Ye, Qinwei Wang, Hong Zhao, Xichao Teng, Yijie Bian, Zhang Li, 2024, IEEE Transactions on Geoscience and Remote Sensing)
- Cosine Similarity Template Matching Networks for Optical and SAR Image Registration(Wenxuan Xiong, Mingyu Sun, Hua Du, Bangshu Xiong, Congxuan Zhang, Qiaofeng Ou, Zhibo Rao, 2025, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing)
BEV与Transformer架构下的深度特征融合
采用统一的鸟瞰图(BEV)或Transformer注意力机制,将相机语义与雷达深度/速度信息在特征层对齐。重点解决跨模态冲突、深度补偿以及在自动驾驶场景下的3D目标检测与跟踪。
- RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network(Zhiwei Lin, Zhe Liu, Yongtao Wang, Le Zhang, Ce Zhu, 2024, ArXiv)
- MWRC3D: 3D Object Detection with Millimeter-Wave Radar and Camera Fusion(Ren Wang, Ningyun Lu, 2024, 2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE))
- MMCRF: multi-modal camera-radar fusion for 3D object detection via polar-coordinate feature alignment(Tiezhen Jiang, Runjie Kang, Qingzhu Li, 2025, Engineering Research Express)
- Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection(Jiahui Fu, Chen Gao, Zitian Wang, Lirong Yang, Xiaofei Wang, Beipeng Mu, Si Liu, 2024, 2024 IEEE International Conference on Robotics and Automation (ICRA))
- MSSF: A 4D Radar and Camera Fusion Framework With Multi-Stage Sampling for 3D Object Detection in Autonomous Driving(Hongsi Liu, Jun Liu, Guangfeng Jiang, Xin Jin, 2024, IEEE Transactions on Intelligent Transportation Systems)
- SGDet3D: Semantics and Geometry Fusion for 3D Object Detection Using 4D Radar and Camera(Xiaokai Bai, Zhu Yu, Lianqing Zheng, Xiaohan Zhang, Zili Zhou, Xue Zhang, Fang Wang, Jie Bai, Hui-Liang Shen, 2025, IEEE Robotics and Automation Letters)
- MVFusion: Multi-View 3D Object Detection with Semantic-aligned Radar and Camera Fusion(Zizhang Wu, Gui Chen, Yuanzhu Gan, Lei Wang, Jian Pu, 2023, 2023 IEEE International Conference on Robotics and Automation (ICRA))
- BEVCar: Camera-Radar Fusion for BEV Map and Object Segmentation(Jonas Schramm, Niclas Vödisch, Kürsat Petek, Ravi Kiran, S. Yogamani, Wolfram Burgard, Abhinav Valada, 2024, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception(Youngseok Kim, Sanmin Kim, J. Shin, Junwon Choi, Dongsuk Kum, 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection(Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, F. Fent, Gerhard Rigoll, 2024, ArXiv)
- TransCAR: Transformer-Based Camera-and-Radar Fusion for 3D Object Detection(Su Pang, Daniel Morris, H. Radha, 2023, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- DEN: Depth Enhancement Network for 3-D Object Detection With the Fusion of mmWave Radar and Vision in Autonomous Driving(Wenxiang Wang, Jianping Han, Zhongmin Jiang, Zhiyuan Zhou, Yingxiao Wu, 2025, IEEE Internet of Things Journal)
- BiAdaFusion: Bi-directionally Adaptive Radar-Camera Fusion for Enhancing the 3D Object Detector Performances On Static Objects(Liye Jia, Fengyufan Yang, Xinyue Zhang, Ka Lok Man, Jeremy S. Smith, Sheng Xu, Young-Ae Jung, Yutao Yue, 2025, 2025 International Conference on Platform Technology and Service (PlatCon))
- RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection(Jingtong Yue, Zhiwei Lin, Xin Lin, Xiaoyu Zhou, Xiangtai Li, Lu Qi, Yongtao Wang, Ming-Hsuan Yang, 2025, ArXiv)
- Camera-Radar Fusion With Feature Alignment: Adding Camera Texture to Radar(Xinbei Jian, Xiaoming Gao, Wanli Dong, Zhilei Zhu, 2024, 2024 17th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI))
- CRKD: Enhanced Camera-Radar Object Detection with Cross-Modality Knowledge Distillation(Lingjun Zhao, Jingyu Song, Katherine A. Skinner, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Enhancing Camera-4D mmRadar Fusion with Voxelized Modulation and Deformable Cross-Attention(Cheng-Wei Chen, Sin-Ye Jhong, Wei-Jie Chang, Hsin-Chun Lin, Cheng-Cheng Huang, Yul-Lung Chang, Yung-Yao Chen, 2025, 2025 International Automatic Control Conference (CACS))
- Spatial Attention Fusion for Obstacle Detection Using MmWave Radar and Vision Sensor(Shuo Chang, Yifan Zhang, Fan Zhang, Xiaotong Zhao, Sai Huang, Zhiyong Feng, Zhiqing Wei, 2020, Sensors (Basel, Switzerland))
- Zfusion: an Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving(Sheng Yang, Tong Zhan, Shichen Qiao, Jicheng Gong, Qing Yang, Jian Wang, Yanfeng Lu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving(Yuzhi Wu, Li Xiao, Jun Liu, Guangfeng Jiang, Xianggen Xia, 2025, ArXiv)
- DiffRCF: Diffusion Model for Robust 3D Object Detection with Radar-Camera Fusion(Rong Liu, Jiayin Deng, Boning Zhu, Zhiqun Hu, Zhaoming Lu, 2025, 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC))
- RCDFNet: A 4-D Radar and Camera Dual-Level Fusion Network for 3-D Object Detection(Peifeng Cheng, Hang Yan, Yukang Wang, Luping Wang, 2025, IEEE Sensors Journal)
- Bidirectional Radar–Camera Fusion With Dual Cross-Attention for 3-D Object Detection(Wei He, Zhenmiao Deng, Ping-ping Pan, Y. Ye, 2025, IEEE Sensors Journal)
- Adaptive Cross-Attention Gated Network for Radar-Camera Fusion in BEV Space(Ji-Yong Lee, Jae-Hyeok Lee, Dong-oh Kang, 2025, 2025 27th International Conference on Advanced Communications Technology (ICACT))
语义引导、知识蒸馏与跨模态一致性增强
利用高级语义先验(如分割图)引导配准,或通过知识蒸馏、对比学习增强模态间特征的一致性,同时包含对雷达多路径干扰、异常点分类等鲁棒性提升研究。
- SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning(Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax, 2025, ArXiv)
- Cross-Modal Supervision-Based Multitask Learning With Automotive Radar Raw Data(Yi Jin, Anastasios Deligiannis, Juan-Carlos Fuentes-Michel, M. Vossiek, 2023, IEEE Transactions on Intelligent Vehicles)
- RF-Pose Estimation based on Contrastive Camera-Radar-Images Pretraining(Yen-Hsiang Tseng, Po-Hsuan Tseng, 2025, ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- Adaptive Cross-Modal Denoising: Enhancing LiDAR–Camera Fusion Perception in Adverse Circumstances(Muhammad Arslan Ghaffar, Kangshuai Zhang, Nuo Pan, Lei Peng, 2026, Sensors (Basel, Switzerland))
- RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion(Geonho Bang, Minjae Seong, Jisong Kim, Geunju Baek, Daye Oh, Junhyun Kim, Junho Koh, J. Choi, 2025, ArXiv)
- Revisiting Radar Camera Alignment by Contrastive Learning for 3D Object Detection(Linhua Kong, Dongxia Chang, Liang Liu, Zisen Kong, Pengyuan Li, Yao Zhao, 2025, ArXiv)
- Deep Learning-based Anomaly Detection in Radar Data with Radar-Camera Fusion(Dian Ning, Dong Seog Han, 2023, 2023 28th Asia Pacific Conference on Communications (APCC))
- Non-parametric consistency test for multiple-sensing-modality data fusion(M. P. Gerardo-Castro, T. Peynot, F. Ramos, R. Fitch, 2015, 2015 18th International Conference on Information Fusion (Fusion))
- Radar Signal Abnormal Point Classification based on Camera-Radar Sensor Fusion(Hyojeong Seo, Dong Seog Han, 2023, 2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC))
- A Cross-Modal Attention-Driven Multi-Sensor Fusion Method for Semantic Segmentation of Point Clouds(Hui-sheng Shi, Xin Wang, Jianghong Zhao, Xinnan Hua, 2025, Sensors (Basel, Switzerland))
- RaSS: 4D mm-Wave Radar Point Cloud Semantic Segmentation with Cross-Modal Knowledge Distillation(Chenwei Zhang, Zhiyu Xiang, Ruoyu Xu, Hangguan Shan, Xijun Zhao, Ruina Dang, 2025, Sensors (Basel, Switzerland))
- Depth Estimation Based on MMwave Radar and Camera Fusion with Attention Mechanisms and Multi-Scale Features for Autonomous Driving Vehicles(Zhaohuan Zhu, Feng Wu, Wenqing Sun, Quanying Wu, Feng Liang, Wuhan Zhang, 2025, Electronics)
- Adaptive Multi-source Signal Fusion Algorithm for Enhanced Digital Signal Processing in Heterogeneous Systems(Shangyang Xu, Dingyi Chen, 2025, International Journal of Pattern Recognition and Artificial Intelligence)
- RODNet: Radar Object Detection using Cross-Modal Supervision(Yizhou Wang, Zhongyu Jiang, Xiangyu Gao, Jenq-Neng Hwang, Guanbin Xing, Hui Liu, 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV))
人体姿态识别、行为感知与健康监测
专门针对人体的融合感知研究,利用雷达的穿透性/生命体征感应与相机的视觉细节,实现非接触式心率监测、跌倒检测、多尺度3D姿态估计及步态识别。
- Walking Further: Semantic-Aware Multimodal Gait Recognition Under Long-Range Conditions(Zhiyang Lu, Wenzheng Jiang, Tian-Xiang Wu, Zhichao Wang, Changwang Zhang, Siqi Shen, Ming-Hsu Cheng, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark(Yuan-Hao Ho, Jen-Hao Cheng, Sheng-Yao Kuan, Zhongyu Jiang, Wenhao Chai, Hsiang-Wei Huang, Chih-Lung Lin, Jenq-Neng Hwang, 2024, No journal)
- Radar-Camera-Based Cross-Modal Bi-Contrastive Learning for Human Motion Recognition(Yuh-Shyan Chen, Kuang-Hung Cheng, 2024, 2024 IEEE Wireless Communications and Networking Conference (WCNC))
- Multimodal Remote Heart Rate Estimation via Spatio-Temporal Transformers and Adaptive Fusion(Hyunduk Kim, Sang-Heon Lee, Myoung-Kyu Sohn, Junkwang Kim, Hyeyoung Park, 2026, IEEE Transactions on Instrumentation and Measurement)
- Non-contact Robust Respiration Detection By Using Radar-Depth Camera Sensor Fusion(Heng Zhao, Xiaomeng Gao, Xiaonan Jiang, Hong Hong, X. Liu, 2020, 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC))
- Real-Time Driver State Detection Using mmWave Radar: A Spatiotemporal Fusion Network for Behavior Monitoring on Edge Platforms(Shih-Pang Tseng, Wun-Yang Wu, Jhing-Fa Wang, Dawei Tao, 2025, Electronics)
- M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction(Junqiao Fan, Yunjiao Zhou, Yizhuo Yang, Xinyuan Cui, Jiarui Zhang, Lihua Xie, Jianfei Yang, Chris Xiaoxuan Lu, Fangqiang Ding, 2025, ArXiv)
- End-to-End Target Liveness Detection via mmWave Radar and Vision Fusion for Autonomous Vehicles(Shuai Wang, Luoyu Mei, Zhimeng Yin, Hao Li, Ruofeng Liu, Wenchao Jiang, Chris Xiaoxuan Lu, 2023, ACM Transactions on Sensor Networks)
- Learning to Analyze Human Skeletal by Radar–Camera Supervision(Ziyi Jiang, Feng Ke, Wenyuan Kang, Yikui Zhai, Qian Zhang, X. Zhang, Xiangmin Xu, 2025, IEEE Transactions on Instrumentation and Measurement)
- VRFfall: Cross Vision-RF Fall Detection with Camera and mmWave Radar(Yanying Zhu, Haotian Song, Kaishun Wu, Min Sun, Li Zhou, 2024, 2024 IEEE 30th International Conference on Parallel and Distributed Systems (ICPADS))
- mmHPE: Robust Multiscale 3-D Human Pose Estimation Using a Single mmWave Radar(Yingxiao Wu, Zhongmin Jiang, Haocheng Ni, Changlin Mao, Zhiyuan Zhou, Wenxiang Wang, Jianping Han, 2025, IEEE Internet of Things Journal)
- MTGEA: A Multimodal Two-Stream GNN Framework for Efficient Point Cloud and Skeleton Data Alignment(Gawon Lee, Jihie Kim, 2023, Sensors (Basel, Switzerland))
- Contactless Monitoring Of Human Vitals: A Study With Simultaneous Measurements Using FMCW Radar And Thermal Camera(Soumitra Kundu, Gargi Panda, A. Routray, Rajlakshmi Guha, Pravansu Mohanty, 2023, 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC))
- Open-Set Occluded Person Identification With mmWave Radar(Tao Wang, Yang Zhao, Ming-Ching Chang, Jie Liu, 2025, IEEE Transactions on Mobile Computing)
特种平台与复杂环境下的协同感知应用
侧重于配准技术在非传统场景的集成,如无人船(USV)水面导航、无人机(UAV)巡检、机器人抓取、工业振动监测以及特殊目标(液体、隐蔽物)探测。
- ASY-VRNet: Waterway Panoptic Driving Perception Model based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar(Runwei Guan, Shanliang Yao, Xiaohui Zhu, K. Man, Yong Yue, Jeremy S. Smith, E. Lim, Yutao Yue, 2023, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Mask-VRDet: A robust riverway panoptic perception model based on dual graph fusion of vision and 4D mmWave radar(Runwei Guan, Shanliang Yao, Lulu Liu, Xiaohui Zhu, Ka Lok Man, Yong Yue, Jeremy S. Smith, Eng Gee Lim, Yutao Yue, 2023, Robotics Auton. Syst.)
- WaterVG: Waterway Visual Grounding Based on Text-Guided Vision and mmWave Radar(Runwei Guan, Liye Jia, Fengyufan Yang, Shanliang Yao, Erick Purwanto, Xiaohui Zhu, Eng Gee Lim, Jeremy S. Smith, Ka Lok Man, Yutao Yue, 2024, IEEE Transactions on Intelligent Transportation Systems)
- WMF-YOLOv7: A Water Surface Object Detection Algorithm Based on Radar Camera Fusion Improved YOLOv7(Yuanhui Wang, Jin Zhu, Shujie Sun, Kaiheng Dai, Alexander Inyutin, 2025, 2025 37th Chinese Control and Decision Conference (CCDC))
- Onboard Powerline Perception System for UAVs Using mmWave Radar and FPGA-Accelerated Vision(N. Malle, Frederik Falk Nyboe, E. Ebeid, 2022, IEEE Access)
- MCLiD: Multi-Target and Container-Independent Liquid Sensing via mmWave and Camera Fusion(Jiawen Gai, Cheng Peng, Zhekai Xu, Kaiyan Cui, Yiming Wang, Zhengxin Guo, F. Xiao, 2025, 2025 IEEE 31th International Conference on Parallel and Distributed Systems (ICPADS))
- FuseGrasp: Radar-Camera Fusion for Robotic Grasping of Transparent Objects(Hongyu Deng, Tianfan Xue, He Chen, 2025, IEEE Transactions on Mobile Computing)
- Enhancing Noncontact Vibration Monitoring With mmWave Radar and Camera Fusion(Yantao Han, Xiulong Liu, Hankai Liu, Xiaomin Zhou, Zhihua Yang, Xin Xie, Xinyu Tong, Keqiu Li, 2025, IEEE Internet of Things Journal)
- MMW-Carry: Enhancing Carry Object Detection Through Millimeter-Wave Radar–Camera Fusion(Xiangyu Gao, Youchen Luo, Ali Alansari, Yaping Sun, 2024, IEEE Sensors Journal)
- A high-accuracy hollowness inspection system with sensor fusion of ultra-wide-band radar and depth camera(Haoran Kang, Wentao Zhang, Yangtao Ge, Haiou Liao, Bangzhen Huang, Jing Wu, Rui-Jun Yan, I-Ming Chen, 2022, Robotica)
- Radar and Camera Fusion for Vacant Parking Space Detection(Bo-Xun Wu, Jiantao Lin, Hsien-Kai Kuo, Po-Yu Chen, Jiun-In Guo, 2022, 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS))
- NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar(Runwei Guan, Jianan Liu, Liye Jia, Haocheng Zhao, Shanliang Yao, Xiaohui Zhu, Ka Lok Man, Eng Gee Lim, Jeremy S. Smith, Yutao Yue, 2024, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- CMU-GPR Dataset: Ground Penetrating Radar Dataset for Robot Localization and Mapping(Alexander Baikovitz, Paloma Sodhi, M. Dille, M. Kaess, 2021, ArXiv)
最终分组涵盖了相机与雷达配准从底层几何标定(外参、时空同步)到中层非刚性形变校正(SAR匹配、非刚性点集配准),再到高层深度感知(BEV特征融合、语义增强)的完整技术路径。研究重点正从早期的手工特征匹配转向基于深度学习的端到端特征级对齐,尤其是在自动驾驶的3D感知与复杂环境下的生物医疗感知、特种机器人领域展现出极高的应用价值。
总计157篇相关文献
We present a novel global non-rigid registration method for dynamic 3D objects. Our method allows objects to undergo large non-rigid deformations and achieves high-quality results even with substantial pose change or camera motion between views. In addition, our method does not require a template prior and uses less raw data than tracking-based methods since only a sparse set of scans is needed. We simultaneously compute the deformations of all the scans by optimizing a global alignment problem to avoid the well-known loop closure problem and use an as-rigid-as-possible constraint to eliminate the shrinkage problem of the deformed shapes, especially near open boundaries of scans. To cope with large-scale problems, we design a coarse-to-fine multi-resolution scheme, which also avoids the optimization being trapped into local minima. The proposed method is evaluated on public datasets and real datasets captured by an RGB-D sensor. The experimental results demonstrate that the proposed method obtains better results than several state-of-the-art methods.
Accurate point-wise velocity estimation in 3D is crucial for robot interaction with non-rigid, dynamic agents, such as humans, enabling robust performance in path planning, collision avoidance, and object manipulation in dynamic environments. To this end, this paper proposes a novel RADAR, LiDAR, and camera fusion pipeline for point-wise 3D velocity estimation named CaRLi-V. This pipeline leverages raw RADAR measurements to create a novel RADAR representation, the velocity cube, which densely represents radial velocities within the RADAR's field-of-view. By combining the velocity cube for radial velocity extraction, optical flow for tangential velocity estimation, and LiDAR for point-wise range measurements through a closed-form solution, our approach can produce 3D velocity estimates for a dense array of points. Developed as an open-source ROS2 package, CaRLi-V has been field-tested against a custom dataset and proven to produce low velocity error metrics relative to ground truth, enabling point-wise velocity estimation for robotic applications.
No abstract available
Online Global Non‐rigid Registration for 3D Object Reconstruction Using Consumer‐level Depth Cameras
We investigate how to obtain high‐quality 360‐degree 3D reconstructions of small objects using consumer‐level depth cameras. For many homeware objects such as shoes and toys with dimensions around 0.06 – 0.4 meters, their whole projections, in the hand‐held scanning process, occupy fewer than 20% pixels of the camera's image. We observe that existing 3D reconstruction algorithms like KinectFusion and other similar methods often fail in such cases even under the close‐range depth setting. To achieve high‐quality 3D object reconstruction results at this scale, our algorithm relies on an online global non‐rigid registration, where embedded deformation graph is employed to handle the drifting of camera tracking and the possible nonlinear distortion in the captured depth data. We perform an automatic target object extraction from RGBD frames to remove the unrelated depth data so that the registration algorithm can focus on minimizing the geometric and photogrammetric distances of the RGBD data of target objects. Our algorithm is implemented using CUDA for a fast non‐rigid registration. The experimental results show that the proposed method can reconstruct high‐quality 3D shapes of various small objects with textures.
Analyzing sensor data of plants and monitoring plant performance is a central element in different agricultural robotics applications. In plant science, phenotyping refers to analyzing plant traits for monitoring growth, for describing plant properties, or characterizing the plant's overall performance. It plays a critical role in the agricultural tasks and in plant breeding. Recently, there is a rising interest in using 3D data obtained from laser scanners and 3D cameras to develop automated non-intrusive techniques for estimating plant traits. In this paper, we address the problem of registering 3D point clouds of the plants over time, which is a backbone of applications interested in tracking spatio-temporal traits of individual plants. Registering plants over time is challenging due to its changing topology, anisotropic growth, and non-rigid motion in between scans. We propose a novel approach that exploits the skeletal structure of the plant and determines correspondences over time and drives the registration process. Our approach explicitly accounts for the non-rigidity and the growth of the plant over time in the registration. We tested our approach on a challenging dataset acquired over the course of two weeks and successfully registered the 3D plant point clouds recorded with a laser scanner forming a basis for developing systems for automated temporal plant-trait analysis.
No abstract available
Non-rigid registration, performing well in all-weather and all-day/night conditions, directly determine the reliability of visible (VIS) and infrared (IR) image fusion. On account of non-planar scenes and differences between IR and VIS cameras, non-linear transformation models are more helpful to non-rigid image registration than the affine model. However, most of non-linear models usually used on non-rigid registration are constructed by control points at present. Aiming at the issue that the adaptiveness and generalization of the control-point-based models are limited, adaptive enhanced affine transformation (AEAT) is proposed for image registration, generalizing the affine model from linear to non-linear case. Firstly, Gaussian weighted shape context, measuring the structural similarity between multimodal images, is designed to extract putative matches from edge maps of IR and VIS images. Secondly, to implement global image registration, the optimal parameters of the AEAT modal are estimated from putative matches by a strategy of subsection optimization. Experiment results show that this approach is robust in different registration tasks and outperforms several competitive methods on registration precision and speed.
In this paper, we present a novel framework for global non-rigid registration of multi-view scans captured using consumer-level depth cameras. In our method, all scans from different viewpoints are allowed to undergo large non-rigid deformations and finally fused into a complete high quality model. To avoid the well-known loop closure problem, we simultaneously optimize a global alignment problem instead of pairwise non-rigid registration in succession. We employ a joint point-to-point and point-to-plane positional constraint to reduce the influence of wrong correspondences, and incorporate an as-conformal-as-possible constraint to avoid mesh distortions during deformation. We also design a reweighting scheme on position and transformation to reduce registration errors. Experimental results on both public datasets and real scanned datasets demonstrate that our approach outperforms state-of-the-art methods through extensive quantitative and qualitative evaluations.
Abstract Research into object deformations using computer vision techniques has been under intense study in recent years. A widely used technique is 3D non-rigid registration to estimate the transformation between two instances of a deforming structure. Despite many previous developments on this topic, it remains a challenging problem. In this paper we propose a novel approach to non-rigid registration combining two data spaces in order to robustly calculate the correspondences and transformation between two data sets. In particular, we use point color as well as 3D location as these are the common outputs of RGB-D cameras. We have propose the Color Coherent Point Drift (CCPD) algorithm (an extension of the CPD method (Myronenko and Song, 2010)). Evaluation is performed using synthetic and real data. The synthetic data includes easy shapes that allow evaluation of the effect of noise, outliers and missing data. Moreover, an evaluation of realistic figures obtained using Blensor is carried out. Real data acquired using a general purpose Primesense Carmine sensor is used to validate the CCPD for real shapes. For all tests, the proposed method is compared to the original CPD showing better results in registration accuracy in most cases.
No abstract available
Abstract Image registration is a prerequisite for infrared (IR) and visible (VIS) image fusion. In practical application, most scenes are not planar and there is significant distinctness between IR and VIS cameras. Therefore, for non-rigid IR and VIS image registration, non-linear transformation is more applicable than affine transformation. Typically, non-linear transformation is modeled with point feature. However, this can degrade the generalization ability of transformation model and increase computational complexity. Aim at this problem, we propose an enhanced affine transformation (EAT) for non-rigid IR and VIS image registration. In this paper, image registration is transformed into point set registration and then the optimal EAT model constructed by global deformation is estimated from local feature. At first, a Gaussian-fields-based objective function is established and simplified by using the potential correspondence between an image pair. With the combination of affine and polynomial transformation, the EAT model is then proposed to describe the regular pattern of non-rigid and global deformation between an image pair. Finally, a coarse-to-fine strategy based on quasi-Newton method is designed and applied to determine the optimal transformation coefficients from edge point feature of IR and VIS images, in order to accomplish non-rigid image registration. The qualitative and quantitative comparisons on synthesized point sets and real images demonstrate that the proposed method is superior over the state-of-the-art methods in the accuracy and efficiency of image registration.
The registration and mapping of The Miniature Radio Frequency (Mini-RF) images from The Lunar Reconnaissance Orbiter (LRO) and the derived data products has been a problem for lunar remote sensing analysis and multi-source data fusion. In this context, we propose an automated registration methodology for Mini-RF S-band level-1 data. This method corrects offsets from synthetic aperture radar (SAR) imaging to match SAR data with optical and DEM data in the map-projection. It could produce maps in polar and non-polar area with precision comparable to manually registered maps. Using manually-labeled craters features for evaluation, the processed maps match well to LRO Wide Angle Camera (WAC) and Digital Elevation Model (DEM) data on 100m- level.
Accurate geo-registration of LiDAR point clouds remains a significant challenge in urban environments where Global Navigation Satellite System (GNSS) signals are denied or degraded. Existing methods typically rely on real-time GNSS and Inertial Measurement Unit (IMU) data, which require pre-calibration and assume stable signals. However, this assumption often fails in dense cities, resulting in localization errors. To address this, we propose a structured post-hoc geo-registration method that accurately aligns LiDAR point clouds with satellite images. The proposed approach targets point cloud datasets where reliable GNSS information is unavailable or degraded, enabling city-scale geo-registration as a post-processing solution. Our method uses a pre-trained Point Transformer to segment road points, then extracts road skeletons and intersections from the point cloud and the satellite image. Global alignment is achieved through rigid transformation using corresponding intersection points, followed by local non-rigid refinement with radial basis function (RBF) interpolation. Elevation discrepancies are corrected using terrain data from the Shuttle Radar Topography Mission (SRTM). To evaluate geo-registration accuracy, we measure the absolute distances between the roads extracted from the two modalities. Our method is validated on the KITTI benchmark and a newly collected dataset of Perth, Western Australia. On KITTI, our method achieves a mean planimetric alignment error of 0.69m, corresponding to a 50% reduction in global geo-registration bias compared to the raw KITTI annotations. On Perth dataset, it achieves a mean planimetric error of 2.17m from GNSS values extracted from Google Maps, corresponding to 57.4% improvement over rigid alignment. Elevation correlation factor improved by 30.5% (KITTI) and 55.8% (Perth).
No abstract available
Accurate spatiotemporal registration is essential for effective sensor fusion. However, achieving precise time synchronization at the hardware level is costly. This paper proposes a feature point matching-based temporal registration method to estimate the system's time offset, which is suitable for low-cost radar-vision fusion devices, such as radar-camera systems used in civilian applications. Feature point matching is the key approach for achieving temporal registration between radar and video data. It works by identifying common feature points between radar data and video images to align the timestamps. This method integrates the radar's target detection capability with the imaging ability of visual sensors. It primarily relies on the maximum doppler frequency shift in radar data as feature points, and the pixel peak of the target motion curve in the video image as the corresponding feature points. By matching these feature points, the temporal offset is calibrated. Experimental results show that the estimated time offset is very close to the ground truth, demonstrating that the feature point matching-based method can accurately estimate the time offset and achieve temporal registration between radar and video data.
Multi-sensor fusion can improve the accuracy and reliability of information perception and is widely used in autonomous driving, smart AGVs, drones, and other intelligent agents. Considering the comprehensive efficiency and economy, the fusion of millimeter wave (MMW) radar and camera with complementary advantages is a potential solution in autonomous driving perception. The calibration flexibility of the two sensors determines the usability of the perception system. Unlike traditional offline calibration, dynamic calibration allows for flexible calibration using dynamic targets found in the environment. It can be applied when cumbersome offline calibration can not be performed or specific calibration templates are unavailable. The paper proposes dynamic calibration of MMW radar and camera by fusing trajectory constraints and multi-target (radar cross section) RCSs, On the one hand, trajectory constraints can preliminarily achieve the calibration of external parameters. On the other hand, introducing RCS can enhance the accurate tracking of target trajectories, enabling automated trajectory generation. By assigning higher calibration weights to trajectories with larger RCS values, both the calibration accuracy and the robustness of the entire perception system can be further improved. The selection of multiple targets ensures the wide applicability of the calibration results. RCS information reflects the signal-to-noise ratio of the target. Compared with existing traditional methods, this algorithm does not require dedicated calibration equipment. As targets are continuously acquired, more accurate calibration results can be iteratively refined. The algorithm offers high computational efficiency and good performance. Additionally, when sensors become loose while the vehicle is in operation, this algorithm can update the calibration results in conjunction with the system’s miscalibration detection mechanism, ensuring the long-term reliability of the entire system. Simulations and real-world experiments demonstrate that this method significantly enhances calibration accuracy over traditional methods, validating its effectiveness. It achieves accurate, flexible, and rapid calibration without requiring specific templates.
While recent low-cost radar-camera approaches have shown promising results in multi-modal 3D object detection, both sensors face challenges from environmental and intrinsic disturbances. Poor lighting or adverse weather conditions degrade camera performance, while radar suffers from noise and positional ambiguity. Achieving robust radar-camera 3D object detection requires consistent performance across varying conditions, a topic that has not yet been fully explored. In this work, we first conduct a systematic analysis of robustness in radar-camera detection on five kinds of noises and propose RobuRCDet, a robust object detection model in BEV. Specifically, we design a 3D Gaussian Expansion (3DGE) module to mitigate inaccuracies in radar points, including position, Radar Cross-Section (RCS), and velocity. The 3DGE uses RCS and velocity priors to generate a deformable kernel map and variance for kernel size adjustment and value distribution. Additionally, we introduce a weather-adaptive fusion module, which adaptively fuses radar and camera features based on camera signal confidence. Extensive experiments on the popular benchmark, nuScenes, show that our model achieves competitive results in regular and noisy conditions.
Radar can enhance target sensing capability after fusion with visible light to achieve all-weather target detection and identification due to lower requirements for weather and light conditions. However, the mainstream radar and camera fusion methods now use decision-level fusion, which fuses the separately processed radar and image data detection results, and fails to take full advantage of the camera's semantic richness and radar's accurate detection distance. Based on this basic observation, we propose a novel feature-level fusion method, which first optimizes for the camera and radar feature misalignment problem by using a deformable attention mechanism to guide the camera features to offset to the corresponding radar positions and then integrates the optimized camera information into two consecutive cross-attention layers, which incorporate the camera and radar features in turn, exploiting the spatial and contextual relationships to achieve stable and efficient fusion. Extensive experimental results on the popular RADIATE dataset have shown the effectiveness of our method. Compared with the baselines, our method performs better under bad weather conditions. Moreover, the proposed method is robust against various real-world scenes such as rain, fog, and snow.
No abstract available
Autonomous driving requires an accurate and fast 3D perception system that includes 3D object detection, tracking, and segmentation. Although recent low-cost camera-based approaches have shown promising results, they are susceptible to poor illumination or bad weather conditions and have a large localization error. Hence, fusing camera with low-cost radar, which provides precise long-range measurement and operates reliably in all environments, is promising but has not yet been thoroughly investigated. In this paper, we propose Camera Radar Net (CRN), a novel camera-radar fusion framework that generates a semantically rich and spatially accurate bird’s-eye-view (BEV) feature map for various tasks. To overcome the lack of spatial information in an image, we transform perspective view image features to BEV with the help of sparse but accurate radar points. We further aggregate image and radar feature maps in BEV using multi-modal deformable attention designed to tackle the spatial misalignment between inputs. CRN with real-time setting operates at 20 FPS while achieving comparable performance to LiDAR detectors on nuScenes, and even outperforms at a far distance on 100m setting. Moreover, CRN with offline setting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first among all camera and camera-radar 3D object detectors.
Reliable 3D object perception is essential in autonomous driving. Owing to its sensing capabilities in all weather conditions, 4D radar has recently received much attention. However, compared to LiDAR, 4D radar provides much sparser point cloud. In this paper, we propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality. As the core of ZFusion, our proposed FP-DDCA (Feature Pyramid-Double Deformable Cross Attention) fuser complements the (sparse) radar information and (dense) vision information, effectively. Specifically, with a feature-pyramid structure, the FP-DDCA fuser packs Transformer blocks to interactively fuse multi-modal features at different scales, thus enhancing perception accuracy. In addition, we utilize the Depth-Context-Split view transformation module due to the physical properties of 4D radar. Considering that 4D radar has a much lower cost than LiDAR, ZFusion is an attractive alternative to LiDARbased methods. In typical traffic scenarios like the VoD (View-of-Delft) dataset, experiments show that with reasonable inference speed, ZFusion achieved the state-of-the-art mAP (mean average precision) in the region of interest, while having competitive mAP in the entire area compared to the baseline methods, which demonstrates performance close to LiDAR and greatly outperforms those camera-only methods.
The 4-D radar, as an advanced vehicle sensor, provides denser point clouds and elevation information than millimeter-wave (MMW) radar, establishing it as a valuable sensor for autonomous driving. Recently, the fusion of 4-D radar and camera has become a viable alternative to LiDAR for autonomous driving perception systems. However, existing fusion methods have not fully realized the potential of complementary advantages offered by these two modalities. To address this limitation, this study proposes RCDFNet, a dual-level fusion network for 3-D object detection with 4-D radar and camera. Specifically, we first exploit the geometric information from the 4-D radar to extract semantic image features from the perspective view (PV), yielding pseudo-camera features under the bird’s-eye view (BEV). In the first fusion level, the pseudo-camera features are integrated with the 4-D radar features using the spatial point fusion (SPF) module to obtain the radar-guided fusion features. In the second fusion level, we employ the deformable attention module (DA) mechanism to facilitate the interaction between camera-view-transform features and the radar-guided fusion features, to generate BEV-match fusion features. Furthermore, we propose the target-aware refinement (TAR) module to learn foreground targets’ occupancy adaptively in the BEV, mitigating the impact of the background noise of two sensors. We validate the RCDFNet on the TJ4DRadSet and View-of-Delft (VoD) datasets. The experimental results demonstrate that RCDFNet significantly outperforms the baseline RCFusion, achieving the state-of-the-art performance in 3-D object detection. Code and models are available at https://github.com/ D-Hourse/RCDFNet/tree/master
Three-dimensional object detection is a critical task in autonomous driving. Although recent radar-camera fusion methods have achieved promising results in 3D detection, including under challenging conditions such as low illumination or adverse weather, they overlook the problem of sensor data loss and fail to fully exploit the correlations between different modal features. In this paper, we propose DiffRCF, a novel radar-camera fusion framework. Specifically, we leverage sparse yet accurate radar points to enhance perspective image features and transform them into bird's-eye-view (BEV) representations. DiffRCF integrates a modality-aware weighting mechanism to adaptively assess the importance of each modality under varying conditions and a conditional diffusion model to reconstruct missing information. Additionally, we employ deformable crossattention and spatial attention mechanisms to better align and fuse multi-modal features. Experiments on the nuScenes dataset demonstrate that DiffRCF achieves state-of-the-art performance among single-frame radar-camera fusion methods and exhibits strong robustness against poor lighting and sensor degradation.
Fusing multimodal sensors for 3D object detection has been extensively researched in the field of autonomous driving. However, existing multimodal sensor fusion methods still struggle to provide reliable detection across different modalities under diverse environmental conditions. Specifically, straightforward methods like summation or concatenation in radar-camera fusion may lead to spatial misalignment and fail to localize objects in complex scenes. To address this, we propose Adaptive Cross-Attention Gated Network (ACAGN) to enhance radar-camera fusion capabilities in Bird's-Eye View (BEV) space. Our approach integrates a deformable cross-attention and an adaptive gated network mechanism. The deformable cross-attention aligns radar and camera features from BEV with greater spatial precision, handling variations between those features effectively. Meanwhile, the adaptive gated network dynamically filters and prioritizes the most relevant information from each sensor. This dual approach improves stability and robustness of detection, as demonstrated through extensive evaluations on the nuScenes dataset.
In recent years, many deep learning-based water surface object detection algorithms have been well applied in large-scale object recognition. However, in complex water surface environments, the previous algorithms can not achieve good performance due to the influence of darkness, rain and fog. To address the above problem, we propose a water surface object detection model, named WMF-YOLOv7, which is based on YOLOv7 with three modifications. Firstly, a novel radar image conversion method is proposed to extract effective features from the sparse points cloud of millimeter-wave radar; secondly, a deformable convolution-based radar feature extraction method is used to reduce the nuisance alarm rate of feature fusion caused by the misalignment of optical image and radar image; finally, a data fusion module based on the normalised attention mechanism is used to reduce the effect of noise in radar images effectively through the sparse weight penalty mechanism. The experimental results show that the proposed method outperforms the benchmark algorithm in terms of recognition accuracy and robustness.
A Hybrid Model for Object Detection Based on Feature-Level Camera-Radar Fusion in Autonomous Driving
Preventing car collisions through object detection has always been a major research direction in the field of autonomous driving. Recent years, camera-based object detection technology has achieved great success. However, its performance is still insufficient under poor lighting or weather conditions. Therefore, the fusion of various sensor information has become a new trend in object detection for autonomous driving. This paper proposes a hybrid object detection model that fuses millimeter-wave radar and camera at the feature level. The model uses a traditional convolutional neural network to extract features from data collected by the radar and camera, and performs multi-scale deep fusion. Subsequently, a multi-scale deformable attention module is used to process the fused feature maps for object detection. We tested this model on the nuScenes for autonomous driving, which includes night and rainy scenes. The hybrid model achieved a mean average precision (mAP) of 47.8%, which is 1.4% higher than that of the baseline object detection model.
The number of nodes in sensor networks is continually increasing, and maintaining accurate track estimates inside their common surveillance region is a critical necessity. Modern sensor platforms are likely to carry a range of different sensor modalities, all providing data at differing rates, and with varying degrees of uncertainty. These factors complicate the fusion problem as multiple observation models are required, along with a dynamic prediction model. However, the problem is exacerbated when sensors are not registered correctly with respect to each other, i.e., if they are subject to a static or dynamic bias. In this case, measurements from different sensors may correspond to the same target, but do not correlate with each other when in the same Frame of Reference (FoR), which decreases track accuracy. This paper presents a method to jointly estimate the state of multiple targets in a surveillance region, and to correctly register a radar and an Infrared Search and Track (IRST) system onto the same FoR to perform sensor fusion. Previous work using this type of parent-offspring process has been successful when calibrating a pair of cameras, but has never been attempted on a heterogeneous sensor network, or in a maritime environment. This article presents results on both simulated scenarios and a segment of real data that show a significant increase in track quality in comparison to using incorrectly calibrated sensors or single-radar only.
Sensor fusion has become an active research field due to its numerous advantages, such as improved perception capabilities, enhanced environment understanding, and better object detection and tracking performance. However, existing multi-sensors fusion models primarily concentrate on the combination of lidar and camera and neglect the millimeter-wave radar (MWR), an inexpensive and promising sensor. To address this limitation, this paper integrates camera and MWR, and proposes a novel framework called MWRC3D to achieve low cost and efficient 3d object detection. Specifically, we first propose an attention-based deep layer aggregation (ADLA) module for learning global associations and dependencies between individual pixels in an image, which improves the ability to characterize image features. Then, we introduce deformable convolutional networks (DCNs) to model geometric transformations, and a MWR data enhancement module is employed to correct 3D offset of radar point cloud from image center point. Finally, we stitch and fuse the image features with the radar feature maps as input to the quadratic regression head to obtain accurate 3D object detection boxes. To evaluate the effectiveness of the proposed model, extensive experiments are conducted based on the nuScenes dataset. The results demonstrate that the demonstrate that our method outperforms all baselines while maintaining high efficiency.
No abstract available
No abstract available
Three-dimensional object detection is one of the key tasks in autonomous driving. To reduce costs in practice, low-cost multi-view cameras for 3D object detection are proposed to replace the expansive LiDAR sensors. However, relying solely on cameras is difficult to achieve highly accurate and robust 3D object detection. An effective solution to this issue is combining multi-view cameras with the economical millimeter-wave radar sensor to achieve more reliable multi-modal 3D object detection. In this paper, we introduce RCBEVDet, a radar-camera fusion 3D object detection method in the bird's eye view (BEV). Specifically, we first design RadarBEVNet for radar BEV feature extraction. RadarBEVNet consists of a dual-stream radar backbone and a Radar Cross-Section (RCS) aware BEV encoder. In the dual-stream radar backbone, a point-based encoder and a transformer-based encoder are proposed to extract radar features, with an injection and extraction module to facilitate communication between the two encoders. The RCS-aware BEV encoder takes RCS as the object size prior to scattering the point feature in BEV. Besides, we present the Cross-Attention Multi-layer Fusion module to automatically align the multi-modal BEV feature from radar and camera with the deformable attention mechanism, and then fuse the feature with channel and spatial fusion layers. Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks. Furthermore, RCBEVDet achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21∼28 FPS. The source code will be released at https://github.com/VDIGPKU/RCBEVDet.
The sparsity and uncertainty of millimeter-wave (MMW) radar point clouds make it difficult to extract features for matching with images, leading to a major challenge for targetless radar-camera extrinsic calibration. To address this problem, we propose a differentiable calibration method based on detection attributes, enabling targetless extrinsic calibration between MMW radar and camera. Our approach utilizes a three-branch neural network to extract cross-modality features between point clouds and images. By leveraging the dynamic properties and radar cross-section (RCS) values of detected objects, the sparse MMW radar point clouds are augmented, and by calculating the relevance between paired features, the impact of uncertainty in MMW radar detection is resolved. Subsequently, a differentiable probabilistic perspective-n-point (PnP) solver is employed to achieve end-to-end extrinsic parameter estimation without relying on initial extrinsic parameters or specific calibration targets. On the pose dataset constructed from the nuScenes dataset, the proposed method achieved a registration accuracy of 90.10%. Additionally, real-world experiments validate its precision and robustness in radar-camera calibration, demonstrating its effectiveness in targetless scenarios.
Place recognition is essential for achieving closedloop or global positioning in autonomous vehicles and mobile robots. Despite recent advancements in place recognition using 2D cameras or 3D LiDAR, it remains to be seen how to use 4D radar for place recognition - an increasingly popular sensor for its robustness against adverse weather and lighting conditions. Compared to LiDAR point clouds, radar data are drastically sparser, noisier and in much lower resolution, which hampers their ability to effectively represent scenes, posing significant challenges for 4D radar-based place recognition. This work addresses these challenges by leveraging multimodal information from sequential 4D radar scans and effectively extracting and aggregating spatio-temporal features. Our approach follows a principled pipeline that comprises (1) dynamic points removal and ego-velocity estimation from velocity property, (2) bird's eye view (BEV) feature encoding on the refined point cloud, (3) feature alignment using BEV feature map motion trajectory calculated by ego-velocity, (4) multiscale spatio-temporal features of the aligned BEV feature maps are extracted and aggregated. Real-world experimental results validate the feasibility of the proposed method and demonstrate its robustness in handling dynamic environments. Source codes are available.
The emerging 4D millimeter-wave radar, measuring the range, azimuth, elevation, and Doppler velocity of objects, is recognized for its cost-effectiveness and robustness in autonomous driving. Nevertheless, its point clouds exhibit significant sparsity and noise, restricting its standalone application in 3D object detection. Recent 4D radar-camera fusion methods have provided effective perception. Most existing approaches, however, adopt explicit Bird's-Eye-View fusion paradigms originally designed for LiDAR-camera fusion, neglecting radar's inherent drawbacks. Specifically, they overlook the sparse and incomplete geometry of radar point clouds and restrict fusion to coarse scene-level integration. To address these problems, we propose MLF-4DRCNet, a novel two-stage framework for 3D object detection via multi-level fusion of 4D radar and camera images. Our model incorporates the point-, scene-, and proposal-level multi-modal information, enabling comprehensive feature representation. It comprises three crucial components: the Enhanced Radar Point Encoder (ERPE) module, the Hierarchical Scene Fusion Pooling (HSFP) module, and the Proposal-Level Fusion Enhancement (PLFE) module. Operating at the point-level, ERPE densities radar point clouds with 2D image instances and encodes them into voxels via the proposed Triple-Attention Voxel Feature Encoder. HSFP dynamically integrates multi-scale voxel features with 2D image features using deformable attention to capture scene context and adopts pooling to the fused features. PLFE refines region proposals by fusing image features, and further integrates with the pooled features from HSFP. Experimental results on the View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate that MLF-4DRCNet achieves the state-of-the-art performance. Notably, it attains performance comparable to LiDAR-based models on the VoD dataset.
Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.
As one of the automotive sensors that have emerged in recent years, 4D millimeter-wave radar has a higher resolution than conventional 3D radar and provides precise elevation measurements. But its point clouds are still sparse and noisy, making it challenging to meet the requirements of autonomous driving. Camera, as another commonly used sensor, can capture rich semantic information. As a result, the fusion of 4D radar and camera can provide an affordable and robust perception solution for autonomous driving systems. However, previous radar-camera fusion methods have not yet been thoroughly investigated, resulting in a large performance gap compared to LiDAR-based methods. Specifically, they ignore the feature-blurring problem and do not deeply interact with image semantic information. To this end, we present a simple but effective multi-stage sampling fusion (MSSF) network based on 4D radar and camera. On the one hand, we design a fusion block that can deeply interact point cloud features with image features, and can be applied to commonly used single-modal backbones in a plug-and-play manner. The fusion block encompasses two types, namely, simple feature fusion (SFF) and multi-scale deformable feature fusion (MSDFF). The SFF is easy to implement, while the MSDFF has stronger fusion abilities. On the other hand, we propose a semantic-guided head to perform foreground-background segmentation on voxels with voxel feature re-weighting, further alleviating the problem of feature blurring. Extensive experiments on the View-of-Delft (VoD) and TJ4DRadset datasets demonstrate the effectiveness of our MSSF. Notably, compared to state-of-the-art methods, MSSF achieves a 7.0% and 4.0% improvement in 3D mean average precision on the VoD and TJ4DRadSet datasets, respectively. It even surpasses classical LiDAR-based methods on the VoD dataset.
: With the rapid development of intelligent vehicle technology, 3D object detection and tracking play a crucial role in the field of autonomous driving. This article deeply studies the point cloud and image fusion 3D object detection and tracking algorithm in the field of autonomous driving. Laser radar and surround view camera are used as perception sensors, and based on deep learning theory and methods, the focus is on overcoming the difficulties of multi-modal data fusion in 3D detection and tracking under complex traffic conditions. This article proposes a series of innovative algorithms, including MaskSensing algorithm based on image instance segmentation, DeformFusion algorithm based on Transformer architecture, MixFusion algorithm with hybrid fusion strategy, and DeepTrack3D algorithm. These algorithms have achieved significant results on datasets such as nuScenes, effectively improving the accuracy and robustness of 3D object detection and tracking. In the future, further research is needed in areas such as temporal fusion, interactive fusion, and unsupervised learning to enhance the performance of autonomous driving technology.
Traffic information perception equipment provides basic data support for traffic situation prediction, signal control and other traffic applications of traffic information-physical fusion system. Based on the research of traffic information logistics fusion system and data fusion technology route, this study analyzed the problems of incomplete information perception and low accuracy of single sensor, and proposed traffic information perception using heterogeneous data fusion of millimeter-wave radar and visual sensor; targeted research on key technologies of millimeter-wave radar and video spatiotemporal registration, completed the calculation of radar coordinate system data conversion to video pixel coordinate system, and realized camera lens distortion correction; carried out internal parameter acquisition and spatiotemporal matching of radar video feature matching, so that data can be displayed in the same dimension, ensuring the accuracy of subsequent data set fusion; further providing effective and reliable support for data fusion.
Highlights What are the main findings? An Adaptive Cross-Modal Denoising (ACMD) framework is presented, introducing a reliability-driven uni-directional fusion mechanism that selectively refines the noisy modality using semantic cues from the cleaner sensor. A novel attention-based ABC + CMD pipeline is developed, enabling efficient noise-aware feature alignment and outperforming state-of-the-art unimodal and multimodal denoising methods across LiDAR–camera perception tasks. What are the implications of the main findings? ACMD enhances the robustness of autonomous perception in adverse weather by achieving large gains in PSNR, Chamfer Distance, and Joint Denoising Effect, without adding computational burden. The plug-and-play ACMD design generalizes to any encoder–decoder backbone, making it suitable for deployment in real-time AV systems and for future multimodal sensing combinations (LiDAR–thermal, radar–camera). Abstract Autonomous vehicles (AVs) rely on LiDAR and camera sensors to perceive their environment. However, adverse weather conditions, such as rain, snow, and fog, negatively affect these sensors, reducing their reliability by introducing unwanted noise. Effective denoising of multimodal sensor data is crucial for safe and reliable AV operation in such circumstances. Existing denoising methods primarily focus on unimodal approaches, addressing noise in individual modalities without fully leveraging the complementary nature of LiDAR and camera data. To enhance multimodal perception in adverse weather, we propose a novel Adaptive Cross-Modal Denoising (ACMD) framework, which leverages modality-specific self-denoising encoders, followed by an Adaptive Bridge Controller (ABC) to evaluate residual noise and guide the direction of cross-modal denoising. Following this, the Cross-Modal Denoising (CMD) module is introduced, which selectively refines the noisier modality using semantic guidance from the cleaner modality. Synthetic noise was added to both sensors’ data during training to simulate real-world noisy conditions. Experiments on the WeatherKITTI dataset show that ACMD surpasses traditional unimodal denoising methods (Restormer, PathNet, BM3D, PointCleanNet) by 28.2% in PSNR and 33.3% in CD, and outperforms state-of-the-art fusion models by 16.2% in JDE. The ACMD framework enhances AV reliability in adverse weather conditions, supporting safe autonomous driving.
Four-dimensional (4D) millimeter-wave (mmWave) radars are promising for developing three-dimensional (3D) object detection systems for autonomous driving (AD) and advanced driver-assistance systems (ADAS). Existing radar-based approaches focus more on augmenting average performance than on detecting static objects. In this paper, we propose BiAdaFusion, a detector optimized for static objects based on radar-camera fusion (RCF). To mitigate cross-modal inconsistency and enhance the quality of the fusion features, the Bi-directional Adaptive Feature Alignment (BiA-FA) mechanism consists of the Spatial-Channel Adaptive Feature Aligner (SCA-FA) and Channel-Spatial Adaptive Feature Aligner (CSA-FA) for the fusion stages within BiAdaFusion. Experimental results on the View-of-Delft (VoD) dataset demonstrate that BiAdaFusion achieves an average precision (AP) of over 48 for car objects, the majority of which are static, while maintaining over 54 mean average precisions (mAPs) for 3D driving scene understanding. This research demonstrates that BiAdaFusion provides a novel paradigm for designing RCF-based 3D object detectors, enhancing performance on static objects.
Accurate extrinsic calibration of camera, radar, and LiDAR is critical for multi-modal sensor fusion in autonomous vehicles and mobile robots. Existing methods typically perform pair-wise calibration and rely on specialized targets, limiting scalability and flexibility. We introduce a universal calibration framework based on an Iterative Best Match (IBM) algorithm that refines alignment by optimizing correspondences between sensors, eliminating traditional point-to-point matching. IBM naturally extends to simultaneous camera–LiDAR–radar calibration and leverages tracked natural targets (e.g., pedestrians) to establish cross-modal correspondences without predefined calibration markers. Experiments on a realistic multi-sensor platform (fisheye-camera, LiDAR, and radar) and the KITTI dataset validate the accuracy, robustness, and efficiency of our method.
In Advanced Driver Assistance Systems (ADAS), environmental perception and object detection are crucial for ensuring safe autonomous driving. Single-modality systems often struggle under adverse weather conditions, underscoring the need for multi-modal approaches. Current fusion methods typically rely on simplistic concatenation of multi-modal features, which neglects semantic alignment and does not fully exploit inter-modal correlations. This paper proposes a crossattention feature fusion specifically designed to enhance the global correlation between camera and radar features. By dynamically adjusting feature weights through cross-attention, our approach significantly improves feature integration. Furthermore, we propose a depth-weighted voting fusion strategy to select the most accurate sensor depth, thereby enhancing decision-making stability. Experimental results on the nuScenes dataset show substantial improvements, with mean Average Precision (mAP) of 0.399 and mean Average Translation Error (mATE) of 0.602, highlighting the effectiveness of our approach in enhancing the robustness and accuracy of multi-modal fusion.
Accurate environmental perception is fundamental to safe autonomous driving; however, most existing multimodal systems rely on fixed or heuristic sensor fusion strategies that cannot adapt to scene-dependent variations in sensor reliability. This paper proposes Cross-Modal Adaptive Attention (CMAA), a unified end-to-end Bird’s-Eye-View (BEV) perception framework that dynamically fuses camera, LiDAR, and RADAR information through learnable, context-aware modality gating. Unlike static fusion approaches, CMAA adaptively reweights sensor contributions based on global scene descriptors, enabling the robust integration of semantic, geometric, and motion cues without manual tuning. The proposed architecture jointly performs 3D object detection, multi-object tracking, and motion forecasting within a shared BEV representation, preserving spatial alignment across tasks and supporting efficient real-time deployment. Experiments conducted on the official nuScenes validation split demonstrate that CMAA achieves 0.528 mAP and 0.691 NDS, outperforming fixed-weight fusion baselines while maintaining a compact model size and efficient inference. Additional tracking evaluation using the official nuScenes tracking devkit reports improved tracking performance, while motion forecasting experiments show reduced trajectory displacement errors (minADE and minFDE). Ablation studies further confirm the complementary contributions of adaptive modality gating and bidirectional cross-modal refinement, and a stratified dynamic analysis reveals consistent reductions in velocity estimation error across object classes, motion regimes, and environmental conditions. These results demonstrate that adaptive multimodal fusion improves robustness, motion reasoning, and perception reliability in complex traffic environments while remaining computationally efficient for deployment in safety-critical autonomous driving systems.
In this work, we present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird's Eye View (BEV)-based architectures for depth estimation, contemporary query-based transformers excel in camera-only detection through object-centric methodology. However, these query-based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial improvements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the-art performance metrics of 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available at https://github.com/phi-wol/sparc.
Multi-view radar-camera fused 3D object detection provides a farther detection range and more helpful features for autonomous driving, especially under adverse weather. The current radar-camera fusion methods deliver kinds of designs to fuse radar information with camera data. However, these fusion approaches usually adopt the straightforward concatenation operation between multi-modal features, which ignores the semantic alignment with radar features and sufficient correlations across modals. In this paper, we present MVFusion, a novel Multi-View radar-camera Fusion method to achieve semantic-aligned radar features and enhance the cross-modal information interaction. To achieve so, we inject the semantic alignment into the radar features via the semantic-aligned radar encoder (SARE) to produce image-guided radar features. Then, we propose the radar-guided fusion transformer (RGFT) to fuse our radar and image features to strengthen the two modals' correlation from the global scope via the cross-attention mechanism. Extensive experiments show that MVFusion achieves state-of-the-art performance (51.7% NDS and 45.3% mAP) on the nuScenes dataset. We shall release our code and trained networks upon publication.
Autonomous perception systems demand robust performance across diverse conditions. While visual sensors provide rich semantic information, their performance degrades significantly under adverse weather and lighting. Conversely, millimeter-wave radar sensors offer strong all-weather robustness and direct velocity sensing but produce sparse, low-resolution data with limited semantics, hindering precise object detection. To address this issue, this paper proposes the Multi- Modal Camera-Radar Fusion (MMCRF) method. This approach skips traditional signal processing by directly utilizing raw radar Range-Doppler (RD) data. Meanwhile, an independent image processing network is responsible for handling camera data. The first step involves the projection of the images onto a polar coordinate grid within a Bird’s Eye View (BEV) perspective. Then, depth features are extracted through a specifically designed encoder-decoder network. These visual features are then deeply fused with Range-Azimuth (RA) features from the radar RD spectrum for object detection. In terms of accuracy advantages, this method significantly outperforms existing fusion detection frameworks in distance and azimuth error metrics. The distance error (RE) reaches 0.11 m, which is 8.3% lower than the current optimal method; the azimuth error (AE) is 0.09, 18.2% lower than that of the sub AP of 96.12%, an AR of 92.23%, and real-time inference at 58.91 FPS.
Radar-camera fusion methods have emerged as a cost-effective approach for 3D object detection but still lag behind LiDAR-based methods in performance. Recent works have focused on employing temporal fusion and Knowledge Distillation (KD) strategies to overcome these limitations. However, existing approaches have not sufficiently accounted for uncertainties arising from object motion or sensor-specific errors inherent in radar and camera modalities. In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). RAKD is designed to consider the inherent errors in the range and azimuth directions, enabling effective knowledge transfer from LiDAR features to refine inaccurate BEV representations. TKD mitigates temporal misalignment caused by dynamic objects by aligning historical radar-camera BEV features with current LiDAR representations. RDKD enhances feature discrimination by distilling relational knowledge from the teacher model, allowing the student to differentiate foreground and background features. RCTDistill achieves state-of-the-art radar-camera fusion performance on both the nuScenes and View-of-Delft (VoD) datasets, with the fastest inference speed of 26.2 FPS.
LiDAR, radar, and cameras are widely used in autonomous driving systems, but each modality has inherent limitations, such as LiDAR's sensitivity to adverse weather, radar's low spatial resolution, and cameras' dependence on lighting conditions. To address these challenges, this study proposes a novel multi-modal fusion framework that integrates these sensors to enhance object detection accuracy and robustness. A cross-modal feature alignment strategy ensures spatial and semantic consistency across sensor data, while an attention-based mechanism dynamically adjusts the contributions of each modality based on their reliability in different scenarios. Experimental results on the KITTI and nuScenes datasets show that the framework achieves a mean Average Precision (mAP) of 89.4% while maintaining real-time efficiency at 36.8 FPS. Compared to single-modality baselines and traditional fusion methods, the proposed framework demonstrates superior detection performance, particularly in scenarios involving occlusion, low-light conditions, and dense traffic. Ablation studies validate the effectiveness of the cross-modal alignment and attention mechanisms, highlighting their critical roles in achieving robust detection. The proposed framework offers a scalable and efficient solution for autonomous driving systems, effectively addressing the limitations of single-modality sensors.
Radar-based human motion recognition (HMR) is gaining increasing attention, primarily due to its robust performance in various lighting conditions, especially in healthcare and safety applications, with a specific emphasis on personal privacy. This paper introduces a novel cross-modal bi-contrastive learning method, named BiCLR. Utilizing a Transformer-based network [1] for temporal modeling, BiCLR excels in discriminating instances across both single-modal and cross-modal settings through self-supervised learning. Additionally, the “Radar Combination Map (RCM)” is proposed to provide a comprehensive representation by seamlessly integrating the Range-Doppler Map (RDM), Range-Azimuth Map (RAM), and Range-Elevation Map (REM) into a unified map. The primary objective of this work is to address the inherent sparsity in radar data through cross-modality and the newly introduced RCM, presenting a transferable framework applicable to various downstream tasks. This contributes to a deeper understanding of radar-based recognition systems. Despite being trained on a smaller dataset, the pre-trained encoder demonstrates remarkable effectiveness in leveraging cross-modal and contrastive learning methods and the newly introduced radar data format in a HMR task using solely radar data, as substantiated by a thorough evaluation.
In the domain of autonomous driving, the integration of multi-modal perception techniques based on data from diverse sensors has demonstrated substantial progress. Effectively surpassing the capabilities of state-of-the-art single-modality detectors through sensor fusion remains an active challenge. This work leverages the respective advantages of cameras in perspective view and radars in Bird’s Eye View (BEV) to greatly enhance overall detection and tracking performance. Our approach, Camera-Radar Associated Fusion Tracking Booster (CRAFTBooster) represents a pioneering effort to enhance radar-camera fusion in the tracking stage, contributing to improved 3D MOT accuracy. The superior experimental results on K-Radaar dataset, which exhibit 5-6% on IDF1 tracking performance gain, validate the potential of effective sensor fusion in advancing autonomous driving.
Environmental perception is an essential task for autonomous driving, which is typically based on LiDAR or camera sensors. In recent years, 4D mm-Wave radar, which acquires 3D point cloud together with point-wise Doppler velocities, has drawn substantial attention owing to its robust performance under adverse weather conditions. Nonetheless, due to the high sparsity and substantial noise inherent in radar measurements, most radar perception studies are limited to object-level tasks, with point-level tasks such as semantic segmentation remaining largely underexplored. This paper aims to explore the possibility of using 4D radar in semantic segmentation. We set up the ZJUSSet dataset containing accurate point-wise class labels for radar and LiDAR. Then we propose a cross-modal distillation framework RaSS to fulfill the task. An adaptive Doppler compensation module is also designed to facilitate the segmentation. Experimental results on ZJUSSet and VoD dataset demonstrate that our RaSS model significantly outperforms the baselines and competitors. Code and dataset will be available upon paper acceptance.
Recently, 3D object detection algorithms based on radar and camera fusion have shown excellent performance, setting the stage for their application in autonomous driving perception tasks. Existing methods have focused on dealing with feature misalignment caused by the domain gap between radar and camera. However, existing methods either neglect inter-modal features interaction during alignment or fail to effectively align features at the same spatial location across modalities. To alleviate the above problems, we propose a new alignment model called Radar Camera Alignment (RCAlign). Specifically, we design a Dual-Route Alignment (DRA) module based on contrastive learning to align and fuse the features between radar and camera. Moreover, considering the sparsity of radar BEV features, a Radar Feature Enhancement (RFE) module is proposed to improve the densification of radar BEV features with the knowledge distillation loss. Experiments show RCAlign achieves a new state-of-the-art on the public nuScenes benchmark in radar camera fusion for 3D Object Detection. Furthermore, the RCAlign achieves a significant performance gain (4.3\% NDS and 8.4\% mAP) in real-time 3D detection compared to the latest state-of-the-art method (RCBEVDet).
In the field of 3D object detection for autonomous driving, LiDAR-Camera (LC) fusion is the top-performing sensor configuration. Still, LiDAR is relatively high cost, which hinders adoption of this technology for consumer automobiles. Alternatively, camera and radar are commonly deployed on vehicles already on the road today, but performance of Camera-Radar (CR) fusion falls behind LC fusion. In this work, we propose Camera-Radar Knowledge Distillation (CRKD) to bridge the performance gap between LC and CR detectors with a novel cross-modality KD framework. We use the Bird'View (BEV) representation as the shared feature space to enable effective knowledge distillation. To accommodate the unique cross-modality KD path, we propose four distillation losses to help the student learn crucial features from the teacher model. We present extensive evaluations on the nuScenes dataset to demonstrate the effectiveness of the proposed CRKD framework. The project page for CRKD is https://song-jingyu.github.io/CRKD.
Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird’s-eye view (BEV) representation space. However, our empirical findings indicate that previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. These conflicts encompass extrinsic conflicts caused by BEV feature construction and inherent conflicts stemming from heterogeneous sensor signals. Therefore, we propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features. Specifically, we devise a Semantic-guided Flow-based Alignment (SFA) module to resolve extrinsic conflicts via unifying spatial distribution in BEV space before fusion. Moreover, we design a Dissolved Query Recovering (DQR) mechanism to remedy inherent conflicts by preserving objectness clues that are lost in the fusion BEV feature. In general, our method maximizes the effective information utilization of each modality and leverages inter-modal complementarity. Our method achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset. The code is released at https://github.com/fjhzhixi/ECFusion.
We present an approach to automatically generate semantic labels for real recordings of automotive range-Doppler (RD) radar spectra. Such labels are required when training a neural network for object recognition from radar data. The automatic labeling approach rests on the simultaneous recording of camera and lidar data in addition to the radar spectrum. By warping radar spectra into the camera image, state-of-the-art object recognition algorithms can be applied to label relevant objects, such as cars, in the camera image. The warping operation is designed to be fully differentiable, which allows backpropagating the gradient computed on the camera image through the warping operation to the neural network operating on the radar data. As the warping operation relies on accurate scene flow estimation, we further propose a novel scene flow estimation algorithm which exploits information from camera, lidar and radar sensors. The proposed scene flow estimation approach is compared against a state-of-the-art scene flow algorithm, and it outperforms it by approximately 30% w.r.t. mean average error. The feasibility of the overall framework for automatic label generation for RD spectra is verified by evaluating the performance of neural networks trained with the proposed framework for Direction-of-Arrival estimation.
The 3-D object detection is one of the core technologies for autonomous driving. Relying on a single type of sensor is inadequate for effectively perceiving the driving environment, and multisensor fusion solutions mitigate this limitation. This article presents a novel approach for early and middle multimodal fusion, referred to as Bi-RC Fusion, that not only maximizes the utilization of raw data from heterogeneous sensors in the early fusion (EF) but also facilitates bidirectional assistance between radar and camera through our module in the middle fusion (MF) stage, which consists of two parallel sets of multihead cross-attention. In the final fusion stage, we have developed an adaptive feature fusion (AFF) module to enhance the efficiency of cross-modal integration. In addition, we have designed an efficient multiscale feature fusion (MFF) module to learn the fused feature representations across different scales. We have conducted extensive experiments on the nuScenes dataset, including robustness experiments particularly designed for challenging weather conditions. The experimental results demonstrate that our proposed method significantly enhances the performance of 3-D object detection, particularly regarding metrics related to velocity and attribute errors.
4D millimeter-wave radar has gained attention as an emerging sensor for autonomous driving in recent years. However, existing 4D radar and camera fusion models often fail to fully exploit complementary information within each modality and lack deep cross-modal interactions. To address these issues, we propose a novel 4D radar and camera fusion method, named SGDet3D, for 3D object detection. Specifically, we first introduce a dual-branch fusion module that employs geometric depth completion and semantic radar PillarNet to comprehensively leverage geometric and semantic information within each modality. Then we introduce an object-oriented attention module that employs localization-aware cross-attention to facilitate deep interactions across modalites by allowing queries in bird's-eye view (BEV) to attend to interested image tokens. We validate our SGDet3D on the TJ4DRadSet and View-of-Delft (VoD) datasets. Experimental results demonstrate that SGDet3D effectively fuses 4D radar data and camera image and achieves state-of-the-art performance.
Accurate fall detection systems are vital to address the global health concern of elderly falls, which often lead to severe injuries, hospitalizations, and fatalities. Since falls can happen at any time in any location, it is imperative to have a comprehensive system that boasts high applicability across a broad range of scenarios, operating seamlessly 24/7. However, within a range of fall detection systems, most of the existing work is built upon mono-modal sensors, which are inevitably inherited and constrained by mono-modal shortages. To overcome the constraints of mono-modal systems, we introduce VRFfall, a novel multi-modal fall detection system that seamlessly fuses mmWave radar and camera technologies. As a system with high generalization capabilities, VRFfall supports both multi-modal and mono-modal inputs with its independent feature extraction pipeline for each modality. Utilizing a cross-modal knowledge transfer design, VRFfall enhances performance with mono-modal input by leveraging fused knowledge from the other modality. Moreover, to ensure optimal fusion decisions under modal discrepancies, VRFfall incorporates an adaptive Modal Quality Assessment Module (MQAM) that dynamically evaluates and fuses features from both modalities. Extensive evaluations using a dataset collected from 20 volunteers across two environments and three conditions have been conducted on VRFfall. The results demonstrate its high performance and excellent generalization across diverse environments and conditions, promising a 24/7 continuous fall detection system.
With the rapid development of autonomous driving technology, radar sensors play a vital role in the perception system due to their robustness under harsh environmental conditions, exact range and velocity perception capability. However, the state-of-the-art performance of algorithms solely based on radar to achieve various perception tasks, such as classifying road users and infrastructures, still lags far behind expectation. Their failure can mainly be accounted for the extreme sparseness of radar point cloud for objects, low angular resolution, and the issue of ghost targets. In this work, we propose a novel network that employs the complex range-Doppler matrix as input to achieve radar-tailored panoptic segmentation (i.e., free-space segmentation and object detection). Our network surpasses previous works in free-space segmentation and object detection tasks, and the improvement in the former task is especially notable. During training, the segmented camera image with radar customized adaption is utilized as the ground truth. Through such a cross-modal supervision method, the labeling expense is alleviated considerably. Based on it, we further design an innovative camera-radar system concept that is able to automatically train deep neural networks with radar measurement.
This paper presents a novel framework for robust 3D object detection from point clouds via cross-modal hallucination. Our proposed approach is agnostic to either hallucination direction between LiDAR and 4D radar. We introduce multiple alignments on both spatial and feature levels to achieve simultaneous backbone refinement and hallucination generation. Specifically, spatial alignment is proposed to deal with the geometry discrepancy for better instance matching between LiDAR and radar. The feature alignment step further bridges the intrinsic attribute gap between the sensing modalities and stabilizes the training. The trained object detection models can deal with difficult detection cases better, even though only single-modal data is used as the input during the inference stage. Extensive experiments on the View-of-Delft (VoD) dataset show that our proposed method outperforms the state-of-the-art (SOTA) methods for both radar and LiDAR object detection while maintaining competitive efficiency in runtime.
To bridge the modality gap between camera images and LiDAR point clouds in autonomous driving systems—a critical challenge exacerbated by current fusion methods’ inability to effectively integrate cross-modal features—we propose the Cross-Modal Fusion (CMF) framework. This attention-driven architecture enables hierarchical multi-sensor data fusion, achieving state-of-the-art performance in semantic segmentation tasks.The CMF framework first projects point clouds onto the camera coordinates through the use of perspective projection to provide spatio-depth information for RGB images. Then, a two-stream feature extraction network is proposed to extract features from the two modalities separately, and multilevel fusion of the two modalities is realized by a residual fusion module (RCF) with cross-modal attention. Finally, we design a perceptual alignment loss that integrates cross-entropy with feature matching terms, effectively minimizing the semantic discrepancy between camera and LiDAR representations during fusion. The experimental results based on the SemanticKITTI and nuScenes benchmark datasets demonstrate that the CMF method achieves mean intersection over union (mIoU) scores of 64.2% and 79.3%, respectively, outperforming existing state-of-the-art methods in regard to accuracy and exhibiting enhanced robustness in regard to complex scenarios. The results of the ablation studies further validate that enhancing the feature interaction and fusion capabilities in semantic segmentation models through cross-modal attention and perceptually guided cross-entropy loss (Pgce) is effective in regard to improving segmentation accuracy and robustness.
3D single object tracking based on point clouds is a key challenge in robotics and autonomous driving technology. Mainstream methods rely on point clouds for geometric matching or motion estimation between the target template and the search area. However, the lack of texture and the sparsity of incomplete point clouds make it difficult for unimodal trackers to distinguish objects with similar structures. To overcome the limitations of previous methods, this letter proposes a cross-modal fusion conflict elimination tracker (CCETrack). The point clouds collected by LiDAR provide accurate depth and shape information about the surrounding environment, while the camera sensor provides RGB images containing rich semantic and texture information. CCETrack fully leverages both modalities to track 3D objects. Specifically, to address cross-modal conflicts caused by heterogeneous sensors, we propose a global context alignment module that aligns RGB images with point clouds and generates enhanced image features. Then, a sparse feature enhancement module is designed to optimize voxelized point cloud features using the rich image features. In the feature fusion stage, both modalities are converted into BEV features, with the template and search area features fused separately. A self-attention mechanism is employed to establish bidirectional communication between regions. Our method maximizes the use of effective information and achieves state-of-the-art performance on the KITTI and nuScenes datasets through multimodal complementarity.
The environmental perception capability of autonomous driving technology is the key to system reliability. Tesla's camera-based vision solution has insufficient robustness in scenes such as complex lighting and bad weather. The accident rate of Autopilot in extreme weather is 27% higher than that in normal scenes. This study designs a cross-modal fusion algorithm based on a dynamic weighted attention mechanism for the Tesla HW4.0 platform to address challenges such as multi-sensor spatiotemporal calibration, dynamic scene weight allocation, and computational efficiency optimization. The algorithm realizes data fusion of lidar, camera, and millimeter-wave radar through a four-layer architecture, and uses a scene-adaptive weight mechanism (such as the sunny day weight allocation is [0.5, 0.3, 0.2]) and spatiotemporal correlation modeling to improve perception accuracy in complex scenes. Experiments show that the algorithm achieves 92.1% mAP@0.5 in heavy rain scenarios, which is 15.6% higher than the Tesla BEV solution. The detection distance of cone barrels is 68 meters, and the false detection rate is only 6.3%. The energy efficiency ratio on the Jetson Orin platform is 1.31 FPS/W, which meets the real-time and low power consumption requirements of the vehicle. The research provides an efficient solution for cross-modal fusion of autonomous driving.
Transparent objects are prevalent in everyday environments, but their distinct physical properties pose significant challenges for camera-guided robotic arms. Current research is mainly dependent on camera-only approaches, which often falter in suboptimal conditions, such as low-light environments. In response to this challenge, we present FuseGrasp, the first radar-camera fusion system tailored to enhance the transparent objects manipulation. FuseGrasp exploits the weak penetrating property of millimeter-wave (mmWave) signals, which causes transparent materials to appear opaque, and combines it with the precise motion control of a robotic arm to acquire high-quality mmWave radar images of transparent objects. The system employs a carefully designed deep neural network to fuse radar and camera imagery, thereby improving depth completion and elevating the success rate of object grasping. Nevertheless, training FuseGrasp effectively is non-trivial, due to limited radar image datasets for transparent objects. We address this issue utilizing large RGB-D dataset, and propose an effective two-stage training approach: we first pre-train FuseGrasp on a large public RGB-D dataset of transparent objects, then fine-tune it on a self-built small RGB-D-Radar dataset. Furthermore, as a byproduct, FuseGrasp can determine the composition of transparent objects, such as glass or plastic, leveraging the material identification capability of mmWave radar. This identification result facilitates the robotic arm in modulating its grip force appropriately. Extensive testing reveals that FuseGrasp significantly improves the accuracy of depth reconstruction and material identification for transparent objects. Moreover, real-world robotic trials have confirmed that FuseGrasp markedly enhances the handling of transparent items.
No abstract available
No abstract available
This article introduces MMW-Carry, a system designed to predict the probability of individuals carrying various objects using millimeter-wave (MMWave) radar signals, complemented by camera input. The primary goal of MMW-Carry is to provide a rapid and cost-effective preliminary screening solution, specifically tailored for non-super-sensitive scenarios. Overall, MMW-Carry achieves significant advancements in two crucial aspects. First, it addresses localization challenges in complex indoor environments caused by multipath reflections, enhancing the system’s overall robustness. This is accomplished by the integration of camera-based human detection, tracking, and the radar–camera plane transformation for obtaining subjects’ spatial occupancy region, followed by a zooming-in operation on the radar images. Second, the system performance is elevated by leveraging long-term observation of a subject. This is realized through the intelligent fusion of neural network results from multiple different-view radar images of an in-track moving subject and their carried objects, facilitated by a proposed knowledge-transfer module. Our experiment results demonstrate that MMW-Carry detects objects with an average error rate of 25.22% false positives and a 21.71% missing rate (MR) for individuals moving randomly in a large indoor space, carrying the common-in-everyday-life objects, both in open carry or concealed ways. These findings affirm MMW-Carry’s potential to extend its capabilities to detect a broader range of objects for diverse applications.
In this paper, a non-contact respiration detection scheme based on Doppler radar-depth camera sensor fusion has been proposed. A continuous-wave (CW) Doppler radar sensor and a depth camera are used to measure the respiratory motion separately. Then the Bayesian sensor fusion algorithm is used to estimate the cycle-to-cycle breathing rate. The experiments prove that the proposed fusion scheme can provide an accurate breathing rate estimation than using a single sensor. In particular, the proposed scheme can give a reasonable estimation even under the influence of body movement.
Early and accurate detection of crossing pedestrians is crucial in automated driving in order to perform timely emergency manoeuvres. However, this is a difficult task in urban scenarios where pedestrians are often occluded (not visible) behind objects, e.g., other parked vehicles. We propose an occlusion aware fusion of stereo camera and radar sensors to address scenarios with crossing pedestrians behind such parked vehicles. Our proposed method adapts both the expected rate and properties of detections in different areas according to the visibility of the sensors. In our experiments on a real-world dataset, we show that the proposed occlusion aware fusion of radar and stereo camera detects the crossing pedestrians on average 0.26 seconds earlier than using the camera alone, and 0.15 seconds earlier than fusing the sensors without occlusion information. Our dataset containing 501 relevant recordings of pedestrians behind vehicles will be publicly available on our website for non-commercial, scientific use.
Liquid sensing is critical for food safety and public security. Although mmWave-based approaches enable non-invasive and high-accuracy sensing, they are typically limited to single-target and fixed-container scenarios, restricting their applicability in real-world scenarios. In this paper, we present MCLiD, a multi-modal liquid sensing framework that fuses mmWave radar and camera data to achieve simultaneous multi-target and container-independent liquid identification. The basic idea is to leverage camera-captured object positions and container information to guide mmWave data processing, generating robust and discriminative liquid-specific representations for identification. MCLiD addresses a series of practical challenges and integrates three specialized modules for image-mmWave signature construction, liquid-specific feature extraction, and identification. Experimental results show that MCLiD achieves an average accuracy of 96.46 % across all combinations of 10 liquids and 7 container types. In multi-target scenarios, it maintains 96.4% accuracy for two concurrent liquids and 94.02% for five. These results indicate that MCLiD could enable rapid, non-invasive liquid detection for food safety and high-throughput public security applications.
This article proposes a non-contact negative emotion recognition method. The method categorizes emotions into positive and negative, and utilizes facial expression features and heart rate as indicators to identify negative emotions among individuals in indoor spaces. The proposed method employs a combination of cameras and millimeter-wave radar as sensors for the first time, which enhances its environmental adaptability and improves the accuracy of emotion recognition. This method can be applied in commercial office spaces to evaluate the work quality and psychological state of employees from multiple perspectives and used for early detection of depression tendencies and stress detection.
Fatigue and distracted driving are among the leading causes of traffic accidents, highlighting the importance of developing efficient and non-intrusive driver monitoring systems. Traditional camera-based methods are often limited by lighting variations, occlusions, and privacy concerns. In contrast, millimeter-wave (mmWave) radar offers a non-contact, privacy-preserving, and environment-robust solution, providing a forward-looking alternative. This study introduces a novel deep learning model, RTSFN (radar-based temporal-spatial fusion network), which simultaneously analyzes the temporal motion changes and spatial posture features of the driver. RTSFN incorporates a cross-gated fusion mechanism that dynamically integrates multi-modal information, enhancing feature complementarity and stabilizing behavior recognition. Experimental results show that RTSFN effectively detects dangerous driving states with an average F1 score of 94% and recognizes specific high-risk behaviors with an average F1 score of 97% and can run in real-time on edge devices such as the NVIDIA Jetson Orin Nano, demonstrating its strong potential for deployment in intelligent transportation and in-vehicle safety systems.
Abstract With the dangerous and troublesome nature of hollow defects inside building structures, hollowness inspection has always been a challenge in the field of construction quality assessment. Several methods have been proposed for inspecting hollowness inside concrete structures. These methods have shown great advantages compared to manual inspection but still lack autonomy and have several limitations. In this paper, we propose a range-point migration-based non-contact hollowness inspection system with sensor fusion of ultra-wide-band radar and laser-based depth camera to extract both outer surface and inner hollowness information accurately and efficiently. The simulation result evaluates the performance of the system based on the original range-point migration algorithm, and our proposed one and the result of our system show great competitiveness. Several simulation experiments of structures that are very common in reality are carried out to draw more convincing conclusions about the system. At the same time, a set of laboratory-made concrete components were used as experimental objects for the robotic system. Although still accompanied by some problems, these experiments demonstrate the availability of an automated hollow-core detection system.
This paper proposed a parking space detection system using the fusion of both radar and camera. Such fused approach addresses the challenging issue in which the border of parking slots is missing and/or not well drawn. Considering the adoption cost of the system, we incorporate Mediatek's Autus R10 (MT2706) Ultra-Short Range Radar with single transmitter and receiver (1T1R) antenna configuration. Such 1T1R radar has a lower cost than mTmR counterparts used in a vast majority of papers on the subject of radar signal processing. To the best of our knowledge, this is the first system fusing camera and radar (1T1R) to detect vacant parking space and estimate precise parking space coordinate without border lines of parking slots. At the same time, the proposed approach can also tackle extreme use-cases e.g., (1) non-car object occupation on the parking slot and (2) in a dim light night or rainy day. In the proposed system, the camera and radar detect the vacant parking space separately, then the corresponding results are fused together to get the final parking space coordinate for the auto-parking system. The experimental evaluations showed that the space prediction error was less than 1% with sufficient light and around 8% for the extreme situation. In summary, the proposed system handles the use-cases in different light conditions and weather, such as sunny days, rainy days, and night.
Despite radar's popularity in the automotive industry, for fusion-based 3D object detection, most existing works focus on LiDAR and camera fusion. In this paper, we propose TransCAR, a Transformer-based Camera-And-Radar fusion solution for 3D object detection. Our TransCAR consists of two modules. The first module learns 2D features from surround-view camera images and then uses a sparse set of 3D object queries to index into these 2D features. The vision-updated queries then interact with each other via transformer self-attention layer. The second module learns radar features from multiple radar scans and then applies transformer decoder to learn the interactions between radar features and vision-updated queries. The cross-attention layer within the transformer decoder can adaptively learn the soft-association between the radar features and vision-updated queries instead of hard-association based on sensor calibration only. Finally, our model estimates a bounding box per query using set-to-set Hungarian loss, which enables the method to avoid non-maximum suppression. TransCAR improves the velocity estimation using the radar scans without temporal information. The superior experimental results of our TransCAR on the challenging nuScenes datasets illustrate that our TransCAR outperforms state-of-the-art Camera-Radar fusion-based 3D object detection approaches.
Automotive radar and camera fusion relies on linear point transformations from one sensor's coordinate system to the other. However, these transformations cannot handle non-linear dynamics and are susceptible to sensor noise. Furthermore, they operate on a point-to-point basis, so it is impossible to capture all the characteristics of an object. This paper introduces a method that performs detection-to-detection association by projecting heterogeneous object features from the two sensors into a common high-dimensional space. We associate 2D bounding boxes and radar detections based on the Euclidean distance between their projections. Our method utilizes deep neural networks to transform feature vectors instead of single points. Therefore, we can leverage real-world data to learn non-linear dynamics and utilize several features to provide a better description for each object. We evaluate our association method against a traditional rule-based method, showing that it improves the accuracy of the association algorithm and it is more robust in complex scenarios with multiple objects.
With the resurgence of non-contact vital sign sensing due to the COVID-19 pandemic, remote heart-rate monitoring has gained significant prominence. Many existing methods use cameras; however previous work shows a performance loss for darker skin tones. In this paper, we show through light transport analysis that the camera modality is fundamentally biased against darker skin tones. We propose to reduce this bias through multi-modal fusion with a complementary and fairer modality - radar. Through a novel debiasing oriented fusion framework, we achieve performance gains over all tested baselines and achieve skin tone fairness improvements over the RGB modality. That is, the associated Pareto frontier between performance and fairness is improved when compared to the RGB modality. In addition, performance improvements are obtained over the radar-based method, with small trade-offs in fairness. We also open-source the largest multi-modal remote heart-rate estimation dataset of paired camera and radar measurements with a focus on skin tone representation.
No abstract available
This paper proposes contactless monitoring of Heart Rate (HR) and Breath Rate (BR) with simultaneous measurements using Frequency Modulated Continuous Wave (FMCW) radar and thermal camera. The radar collects the body movement signals which include Random Body Movements (RBMs). Non-negative Matrix Factorization (NMF) and Wavelet analysis were used on this signal to get the accurate values of HR and BR. Similarly, with thermal imaging, nostril and forehead regions are tracked to estimate the values of BR as well as HR. We conducted an experiment with 50 subjects to find similarities in the performance of radar and thermal camera while measuring HR and BR. Simultaneously, these two methods have been validated with pulse oximeter and visual camera. From the visual camera, we can get the abdominal movements on which the BR can be ascertained whereas pulse oximeter gives us the HR. Radar signals are degraded because of large RBMs whereas thermal signals get distorted because of sudden temperature changes in the surroundings, sweating, and occlusion. We used a Signal Quality Metric (SQM) to ascertain the measurement quality of the vital signs. This SQM-based approach can further be used for sensor fusion to build a robust contactless system to monitor vital signs.Clinical relevance— Contactless and accurate measurement of HR and BR is very essential for continuous and comfortable monitoring of vitals. In this paper, we combine both FMCW radar and thermal camera so that one can complement the other in adverse scenarios on the basis of signal quality.
The Advanced Air Mobility (AAM) landscape is evolving rapidly, with Uncrewed Aircraft Systems (UAS) technology advancing at a fast pace and regulation efforts being carried out worldwide to make room for autonomous flights in the civil airspace. Focusing on the surveillance aspect for AAM, strategies should be designed to cope with dense volumes of operations of small targets flying close to the ground in and around urban areas. In this framework, this paper proposes a distributed sensing solution to enhance AAM surveillance by exploiting different ground-based nodes equipped with radars and cameras. The strategy exploits a node-level fusion, where radar and camera measurements are used to estimate the position and velocity of targets, and a network-level fusion to merge such estimates enhancing accuracy and coverage. Aiming to go beyond the purely non-cooperative approach, the proposed strategy also foresees the possibility to include information shared by cooperative platforms, notifying the surrounding traffic about their instantaneous position. The proposed strategy is tested on data collected during experimental tests with up to five UAS, assumed cooperative, and three ground-based nodes. The results achieved show that the strategy enables confirmation of all cooperative targets with a coverage above 90 %. Other non-cooperative targets, such as birds, are also identified thus achieving a comprehensive picture of the monitored low-altitude airspace.
For a single sensor such as radar, video monitoring system cannot cope with such as illumination change, the movement of the branches swing, non-motor vehicles, goals, and other objects of shade, and the shortcomings of the sensor itself parameters change factors such as interference, lacking in vehicle trajectory identification accuracy, etc. In this paper, by studying the vehicle detection and tracking algorithm of the fusion of millimeter-wave radar sensor and camera vision sensor, a vehicle detection algorithm based on the fusion of radar and video is proposed. The radar sensor is the main one, and the video sensor is the auxiliary one. The advantages of the two are combined to track and identify the vehicle. Finally, the experimental results show that the algorithm can effectively deal with the non-motor vehicle interference, frequent blocking, illumination changes and other situations, and improve the accuracy of vehicle detection results.
Environment perception for autonomous driving traditionally uses sensor fusion to combine the object detections from various sensors mounted on the car into a single representation of the environment. Non-calibrated sensors result in artifacts and aberration in the environment model, which makes tasks like free-space detection more challenging. In this study, we improve the LiDAR and camera fusion approach of Levinson and Thrun. We rely on intensity discontinuities and erosion and dilation of the edge image for increased robustness against shadows and visual patterns, which is a recurring problem in point cloud related work. Furthermore, we use a gradientfree optimizer instead of an exhaustive grid search to find the extrinsic calibration. Hence, our fusion pipeline is lightweight and able to run in real-time on a computer in the car. For the detection task, we modify the Faster R-CNN architecture to accommodate hybrid LiDAR-camera data for improved object detection and classification. We test our algorithms on the KITTI data set and locally collected urban scenarios. We also give an outlook on how radar can be added to the fusion pipeline via velocity matching.
In practical applications, multi-source systems-such as radar, cameras, and infrared sensors-provide diverse and complementary signal information. However, traditional digital signal processing (DSP) algorithms often fail to fully exploit the synergy among these heterogeneous sources, resulting in low information utilization and limited processing accuracy. To address this issue, this paper proposes an Adaptive Multi-source Signal Fusion Algorithm (AMSFA) designed for efficient digital signal processing in multi-source systems. AMSFA is a lightweight, non-neural model-based approach that combines optimization-driven weight computation and rule-based temporal alignment. AMSFA dynamically adjusts the fusion weights of each signal source based on quality indicators such as signal-to-noise ratio (SNR) and inter-source correlation, enabling adaptive and collaborative signal integration. The proposed method incorporates a weighted dynamic update mechanism and a temporal alignment module to enhance robustness and real-time performance in complex environments. Experimental results on simulated multi-source signals and publicly available multi-modal datasets demonstrate that AMSFA significantly outperforms conventional methods in signal enhancement, target recognition, and interference suppression. This work offers a novel and effective solution for signal fusion in multi-source systems and holds promising potential for real-world deployment.
Radar is usually more robust than the camera in severe driving scenarios, e.g., weak/strong lighting and bad weather. However, unlike RGB images captured by a camera, the semantic information from the radar signals is noticeably difficult to extract. In this paper, we propose a deep radar object detection network (RODNet), to effectively detect objects purely from the carefully processed radar frequency data in the format of range-azimuth frequency heatmaps (RAMaps). Three different 3D autoencoder based architectures are introduced to predict object confidence distribution from each snippet of the input RAMaps. The final detection results are then calculated using our post-processing method, called location-based non-maximum suppression (L-NMS). Instead of using burdensome human-labeled ground truth, we train the RODNet using the annotations generated automatically by a novel 3D localization method using a camera-radar fusion (CRF) strategy. To train and evaluate our method, we build a new dataset – CRUW, containing synchronized videos and RAMaps in various driving scenarios. After intensive experiments, our RODNet shows favorable object detection performance without the presence of the camera.
Multi-modal fusion technology is an important technology in the field of autonomous driving perception. It integrates some information of Radar, Camera, Lidar and IMU to give a consistent interpretation of the surrounding environment. This paper is mainly an algorithm improvement of Centerfusion. Hvdetfusion aims to fuse the information of Camera and Radar. The BEVDepth4D embedded in the model contains the height, width, depth and number of channels of the image. In terms of feature extraction, this paper uses a powerful multi-modal general model InternImage-Base, which makes the model have more accurate classification accuracy in the NuScenes dataset. In addition, the model adds preprocessing during training, data enhancement, removal of invalid point cloud noise, point cloud speed compensation can improve prediction accuracy. HvDetFusion shows high mAP and NDS in multi-sensor fusion by comparison. Focusing on data purification, kinematics compensation, and appropriate Batchsize settings will affect the final fusion effect. Hvdetfusion multi-modal fusion has the characteristics of complementary advantages, stability, high accuracy, and strong anti-interference, thereby increasing the safety and decision-making efficiency of autonomous driving.
Advancements in autopilot driving and car technology have propelled the development of safe autonomy, particularly in navigating complex traffic environments. This paper delves into applying sensor fusion techniques in automobile systems, focusing on their relevance in ensuring safe navigation amidst the intricacies of Indian road conditions. Of course, the conditions of Indian roads have changed hugely and drastically over time but still, it would not be wrong to say that India is still grappling with a high incidence of road accidents exacerbated by non-compliance with traffic rules, integrating sensor fusion technologies becomes pivotal in enhancing the safety and reliability of autonomous systems. By including and adding data from various sensors such as cameras, LiDAR, and radar, these systems gain a comprehensive understanding of their surroundings, enabling real-time decision-making in dynamic traffic scenarios. Furthermore, the paper explores the integration of machine learning algorithms to augment sensor fusion capabilities, facilitating adaptive responses to erratic traffic behaviour and irregular driving practices. Through a technology-centric lens, this research aims to play the role of sensor fusion in bolstering the efficacy of autonomous systems on Indian roads, thereby contributing to the mitigation of road accidents and promoting safer transportation infrastructures.
In this paper, a vision-assisted multipath recognition and suppression method is proposed for the problem of millimeter wave (mmWave) radar producing false targets under the influence of multipath interference. First, object detection is performed on the image and rectangular clustering is performed on the mmWave radar point cloud to complete the data preprocessing. Subsequently, nearest-neighbor frame matching and direct linear transform (DLT) algorithms are used to achieve spatio-temporal calibration of the two sensors. An axial adaptive cost-normalized matching algorithm is then proposed to associate targets from the two sensors, thereby establishing target association pairs. Finally, multipath ghosts in mmWave radar are recognized and suppressed based on the target association results. Experimental results show that the proposed method efficiently recognizes and suppresses multipath ghosts in traffic scenarios.
Spatial calibration aligns heterogeneous sensor data, e.g., millimeter-wave radar (mmWave radar) and camera, in a common coordinate system, but cross-modal correspondence remains challenging due to differing data representations. In this paper, we propose SPECal to perform spatial calibration between mmWave radars and cameras mounted on a moving platform. Our core idea is to leverage the moving platform as a bridge by separately estimating the transformation matrices of the radar and the camera relative to the platform, thereby constructing their mutual mapping. Specifically, to mitigate dynamic interference, we treat dynamic points as outliers and apply a RANSAC-based method combined with two strategies: radar pose consistency filtering and inlier persistence weighting. To estimate the radar's pose from the radar point cloud, we also introduce a velocity projection model, where the radar-measured velocity is the projection of the actual velocity along the radar's radial direction. Furthermore, we propose a cross-modal spatial alignment method based on camera depth maps to refine the estimated pose. Experiment results demonstrate that SPECal achieves an average rotation error (RE) of 4.38° with a standard deviation of 1.19°, and an average translation error (TE) of 14.34 mm with a standard deviation of 3.59 mm.
Human detection and tracking in indoor environments are essential for applications such as smart homes, security monitoring, and elder care. Vision-based approaches provide detailed imagery, but are often constrained by occlusions, illumination changes, and privacy concerns. mmWave radar offers a robust alternative; however, the sparsity of its point clouds poses challenges for accurate skeleton reconstruction. To overcome this limitation, we present a deep learning framework that reconstructs 3D human skeletons from mmWave signals. The proposed system aligns radar point clouds with skeleton annotations from a vision sensor, allowing the network to learn a direct mapping from sparse radar inputs to skeletal structures. Experimental results confirm that our method enables reliable and privacy-preserving indoor human detection and tracking.
Digital Twins (DTs) are rapidly emerging as a transformative paradigm for real-time monitoring, simulation, and decision-making across domains such as smart mobility, industrial automation, and intelligent infrastructure. However, current DT implementations often rely on computationally intensive deep learning pipelines and vision-based sensing, which hinder their deployment in resource-constrained or privacy-sensitive environments. In this work, we introduce RadarVision-Twin, a lightweight, interpretable, and edge-deployable Digital Twin system that fuses mmWave radar and RGB camera inputs for real-time anomaly detection and feedback-driven visualization. Unlike traditional camera-centric systems, RadarVision-Twin leverages radar spectrograms to extract motion energy and camera frames to estimate edge density—two physically meaningful features that capture dynamic behavior and scene structure, respectively. These features are fused and fed into an XGBoost classifier, chosen for its efficiency and explainability, to detect motion anomalies in real-time. Our system supports synchronized multimodal visualization, live anomaly flagging, and operator feedback, forming a closed-loop DT that evolves with its environment. We validate our approach on a large-scale multimodal dataset comprising over 14,000 radar-camera frames, demonstrating that our fusion strategy achieves macro F1-scores of 98%, surpassing unimodal baselines. Furthermore, the system runs at low latency on commodity hardware and requires no GPU acceleration, making it suitable for embedded or edge deployments. By eliminating the need for deep models while maintaining interpretability and responsiveness, RadarVisionTwin paves the way for transparent, low-power, and privacy-aware Digital Twins in future 6G-enabled environments.
The knowledge of the precise 3D position of a target in tracking applications is a fundamental requirement. The lack of a low-cost single sensor capable of providing the three-dimensional position (of a target) makes it necessary to use complementary sensors together. This research presents a Local Positioning System (LPS) for outdoor scenarios, based on a data fusion approach for unmodified UAV tracking, combining a vision sensor and mmWave radar. The proposed solution takes advantage of the radar's depth observation ability and the potential of a neural network for image processing. We have evaluated five data association approaches for radar data cluttered to get a reliable set of radar observations. The results demonstrated that the estimated target position is close to an exogenous ground truth obtained from a Visual Inertial Odometry (VIO) algorithm executed onboard the target UAV. Moreover, the developed system's architecture is prepared to be scalable, allowing the addition of other observation stations. It will increase the accuracy of the estimation and extend the actuation area. To the best of our knowledge, this is the first work that uses a mmWave radar combined with a camera and a machine learning algorithm to track a UAV in an outdoor scenario.
In the realm of autonomous driving, precise and robust 3-D perception is paramount. Multimodal fusion for 3-D object detection is crucial for improving accuracy, generalization, and robustness in autonomous driving. In this article, we introduce the depth enhancement network (DEN), an innovative camera-radar fusion framework that generates an accurate depth estimation for 3-D object detection. To overcome the limitations caused by the lack of spatial information in an image, DEN estimates image depth using accurate radar points. Furthermore, to extract more comprehensive and fine-grained scene depth information, we present an innovative label optimization strategy (LOS) that enhances label density and quality. DEN achieves an 18.78% reduction in mean absolute error (MAE) and a 12.8% decrease in root mean-square error (RMSE) for depth estimation. Additionally, it improves 3-D object detection accuracy by 0.8% compared to the baseline model. Under low visibility conditions, DEN demonstrates a 6.7% reduction in MAE and a 9.6% reduction in RMSE compared to the baseline. These improvements demonstrated its robustness and enhanced performance under challenging conditions.
Waterway perception is critical for the special operations and autonomous navigation of Unmanned Surface Vessels (USVs), but current perception schemes are sensor-based, neglecting the interaction between humans and USVs for embodied perception in various operations. Therefore, inspired by visual grounding, we present WaterVG, the inaugural visual grounding dataset tailored for USV-based waterway perception guided by human prompts. WaterVG contains a wealth of prompts describing multiple targets, with instance-level annotations, including bounding boxes and masks. Specifically, WaterVG comprises 11,568 samples and 34,987 referred targets, integrating both visual and radar characteristics. The text-guided two-sensor pattern provides a fine granularity of text prompts aligned with the visual and radar features of the referent targets, containing both qualitative and numeric descriptions. To enhance the endurance and maintain the normal operations of USVs in open waterways, we propose Potamoi, a low-power visual grounding model. Potamoi is a multi-task model employing a sophisticated Phased Heterogeneous Modality Fusion (PHMF) mechanism, which includes Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). The ARW module utilizes a gating mechanism to adaptively extract essential radar features for fusion with visual inputs, ensuring prompt alignment. MHSCA, characterized by its low parameter count and computational efficiency (FLOPs), effectively integrates contextual information from both sensors with linguistic features, delivering outstanding performance in visual grounding tasks. Comprehensive experiments and evaluations on WaterVG demonstrate that Potamoi achieves state-of-the-art results compared to existing methods. The project is available at https://github.com/GuanRunwei/WaterVG.
Millimeter-wave (mmWave) radar and camera are two of the most critical sensors in modern intelligent transportation systems (ITS), enabling complex fusion-based perception tasks through collaborative working. Achieving high-quality data fusion in ITS requires precise extrinsic parameters (EPs), which define the relative spatial relationship between sensors. However, in real traffic environments, manual measurement of EPs between sensors is labor-intensive and limited in accuracy. To overcome these challenges, this paper proposes an automatic extrinsic calibration method for mmWave radar and camera in traffic environments, requiring only time-synchronized sensor data. First, we develop a novel calibration model with eight parameters, the radar-camera-ground (RCG) model, which describes the spatial relationships between the sensors and the ground. Then, a calibration method named Gaussian Modeling Linear Optical Projection (GLP) is proposed. Specifically, an image instance segmentation model is applied to detect targets from images. Simultaneously, 2D information of targets detected by the radar is extended into 3D space according to the RCG Model. Next, vehicle targets detected by radar and camera are transformed into corresponding 3D and 2D Gaussian models, respectively, leveraging their positional, velocity, and shape features. Afterward, a mapping relationship between the 3D and 2D Gaussian models is established through a linearized optical projection function. Finally, the optimal EPs are estimated via the global optimization algorithm, minimizing the designed calibration loss function based on Bhattacharyya distance. Experimental results on practical traffic scenario data demonstrate that the proposed method outperforms existing approaches in average calibration accuracy and robustness, validating its reliability and superiority.
The successful operation of autonomous vehicles hinges on their ability to accurately identify objects in their vicinity, particularly living targets such as bikers and pedestrians. However, visual interference inherent in real-world environments, such as omnipresent billboards, poses substantial challenges to extant vision-based detection technologies. These visual interference exhibit similar visual attributes to living targets, leading to erroneous identification. We address this problem by harnessing the capabilities of mmWave radar, a vital sensor in autonomous vehicles, in combination with vision technology, thereby contributing a unique solution for liveness target detection. We propose a methodology that extracts features from the mmWave radar signal to achieve end-to-end liveness target detection by integrating the mmWave radar and vision technology. This proposed methodology is implemented and evaluated on the commodity mmWave radar IWR6843ISK-ODS and vision sensor Logitech camera. Our extensive evaluation reveals that the proposed method accomplishes liveness target detection with a mean average precision of 98.1%, surpassing the performance of existing studies.
Panoptic Driving Perception (PDP) is critical for the autonomous navigation of Unmanned Surface Vehicles (USVs). A PDP model typically integrates multiple tasks, necessitating the simultaneous and robust execution of various perception tasks to facilitate downstream path planning. The fusion of visual and radar sensors is currently acknowledged as a robust and cost-effective approach. However, most existing research has primarily focused on fusing visual and radar features dedicated to object detection or utilizing a shared feature space for multiple tasks, neglecting the individual representation differences between various tasks. To address this gap, we propose a pair of Asymmetric Fair Fusion (AFF) modules with favorable explainability designed to efficiently interact with independent features from both visual and radar modalities, tailored to the specific requirements of object detection and semantic segmentation tasks. The AFF modules treat image and radar maps as irregular point sets and transform these features into a crossed-shared feature space for multitasking, ensuring equitable treatment of vision and radar point cloud features. Leveraging AFF modules, we propose a novel and efficient PDP model, ASY-VRNet, which processes image and radar features based on irregular super-pixel point sets. Additionally, we propose an effective multi-task learning method specifically designed for PDP models. Compared to other lightweight models, ASY-VRNet achieves state-of-the-art performance in object detection, semantic segmentation, and drivable-area segmentation on the WaterScenes benchmark. Our project is publicly available at https://github.com/GuanRunwei/ASY-VRNet.
Single sensors often fail to meet the needs of practical applications due to their lack of robustness and poor detection accuracy in harsh weather and complex environments. A vehicle detection method based on the fusion of millimeter wave (mmWave) radar and monocular vision was proposed to solve this problem in this paper. The method successfully combines the benefits of mmWave radar for measuring distance and speed with the vision for classifying objects. Firstly, the raw point cloud data of mmWave radar can be processed by the proposed data pre-processing algorithm to obtain 3D detection points with higher confidence. Next, the density-based spatial clustering of applications with noise (DBSCAN) clustering fusion algorithm and the nearest neighbor algorithm were also used to correlate the same frame data and adjacent frame data, respectively. Then, the effective targets from mmWave radar and vision were matched under temporal-spatio alignment. In addition, the successfully matched targets were output by using the Kalman weighted fusion algorithm. Targets that were not successfully matched were marked as new targets for tracking and handled in a valid cycle. Finally, experiments demonstrated that the proposed method can improve target localization and detection accuracy, reduce missed detection occurrences, and efficiently fuse the data from the two sensors.
No abstract available
Sensor fusion is crucial for an accurate and robust perception system on autonomous vehicles. Most existing datasets and perception solutions focus on fusing cameras and LiDAR. However, the collaboration between camera and radar is significantly under-exploited. Incorporating rich semantic information from the camera and reliable 3D information from the radar can achieve an efficient, cheap, and portable solution for 3D perception tasks. It can also be robust to different lighting or all-weather driving scenarios due to the capability of mmWave radars. In this paper, we introduce the CRUW3D dataset, including 66K synchronized and well-calibrated camera, radar, and LiDAR frames in various driving scenarios. Unlike other large-scale autonomous driving datasets, our radar data is in the format of radio frequency (RF) tensors that contain not only 3D location information but also spatio-temporal semantic information. This kind of radar format can enable machine learning models to generate more reliable object perception results after interacting and fusing the information or features between the camera and radar. We run several camera- and radar-based baseline methods for 3D object detection and multi-object tracking on our dataset. We hope the CRUW3D dataset will foster radar and multi-modal 3D perception research. CRUW3D is available at https://huggingface.co/datasets/uwipl/CRUW3D
Both 3D mmWave Radar (3D: $x, y, z$) and thermal camera are robust in harsh environments. Fusing them is beneficial for V2X road side units and unmanned vehicles to operate in all-weather conditions. To fuse the two sensors, accurate extrinsic calibration is indispensable. However, limited literatures can be found, since 3D radar is pushed into the market just for a short period. Most research are focused on 2D radar (2D: $x, y$) and rgb camera calibration. Recently, one research is proposed for 3D radar and rgb camera, but it requires continuous movement of the sensors to perform hand-eye calibration. It is impractical for the sensors mounted to the road side in V2X, where the sensors are non-movable. To solve the problems, 3DRadar2ThermalCalib is proposed. 1) A novel calibration target, a spherical-trihedral, is introduced. We use the sphere center as the common feature. It can be directly detected by both sensors. 2) We propose the methods to automatically detect the sphere center from both sensors. 3) The optimal extrinsic parameter is obtained by minimizing the re-projection error. Both quantitative and qualitative analysis in real environments demonstrate the method is accurate. The re-projection error can reach 1.88 pixels.
Autonomous Unmanned Aerial Vehicle (UAV) interactions with powerlines, such as close-up inspections for fault detection or grasping and landing for recharging, require advanced onboard perception capabilities. To solve such tasks, the UAV must be equipped with perception abilities that allow it to navigate between powerlines and safely approach specific cables of interest. A perception system with such capabilities requires state-of-the-art sensor technologies and data processing while still being subject to the limited hardware and energy resources of the UAV. In this paper, we present an advanced embedded system based on the cutting-edge Multiprocessing System-on-Chip (MPSoC) for onboard UAV powerline perception. Our platform consists of a mmWave radar and an RGB camera with data processing carried out on the MPSoC, covering both CPU and Field-Programmable Gate Array (FPGA) computations. Following hardware-software co-design methodology, the heavy image processing tasks are accelerated in the FPGA and fused with computationally light mmWave data on the CPU, facilitating pose-estimation of the power lines. Utilizing the open-source autonomy frameworks PX4 and ROS2, we demonstrate integration of the system with onboard path planning based on the estimated cable positions. The robustness of the detection and pose-estimation methods have been demonstrated in several tests performed both in simulated and real-world powerline environments. The results show that our proposed perception system allows the UAV to safely navigate in close proximity to powerlines, by perceiving more individual cables at longer distances compared to previous work, while remaining lightweight, power-efficient, and low-cost.
This paper presents a millimeter wave (mmWave) radar and Machine Vision fusion system to alert drivers of potential pedestrian collisions. The system is composed of two subsystems, the mmWave Pedestrian Localization subsystem and the Machine Vision Pedestrian Classification subsystem. The mmWave Pedestrian Localization subsystem obtains the relative location of the pedestrians using a mmWave radar sensor while the Machine Vision Pedestrian Classification subsystem uses Histogram of Oriented Gradients and Support Vector Machine algorithms to classify pedestrians in a camera’s field-of-view. The two-layer pedestrian detection design protects the system from any miss detections within a single subsystem. By utilizing mmWave technology with Machine Vision, the safety operation of cars and pedestrian safety can be increased. The proposed system utilizes Texas Instrument’s AWR1642BOOST mmWave Radar with the high computing power of NVIDIA’s Jetson Nano
For autonomous driving, it is important to detect obstacles in all scales accurately for safety consideration. In this paper, we propose a new spatial attention fusion (SAF) method for obstacle detection using mmWave radar and vision sensor, where the sparsity of radar points are considered in the proposed SAF. The proposed fusion method can be embedded in the feature-extraction stage, which leverages the features of mmWave radar and vision sensor effectively. Based on the SAF, an attention weight matrix is generated to fuse the vision features, which is different from the concatenation fusion and element-wise add fusion. Moreover, the proposed SAF can be trained by an end-to-end manner incorporated with the recent deep learning object detection framework. In addition, we build a generation model, which converts radar points to radar images for neural network training. Numerical results suggest that the newly developed fusion method achieves superior performance in public benchmarking. In addition, the source code will be released in the GitHub.
Radio frequency sensors can penetrate non-metal objects and provide complementary information to vision sensors for person identification (PID) purposes. However, there is a lack of research on millimeter wave (mmWave) radar for PID under occlusions, particularly in addressing the open-set recognition problem. Thus, we propose an open-set occluded PID (OSO-PID) framework that can deal with various obstacle and occlusion scenarios with open-set recognition capability. We first introduce a new dataset, mmWave-ocPID, comprising mmWave radar measurements and RGB-depth images, collected from 23 human subjects. We next design a novel neural network, mm-PIDNet, for occluded person identification using mmWave radar measurements. mm-PIDNet incorporates a transformer encoder, a bidirectional long short-term memory module, and a novel supervised contrastive learning module to improve PID performance. For open-set recognition, we enhance the mmWave radar-based PID method by integrating supervised contrastive learning with the Weibull models, which can identify out-of-distribution samples. We perform extensive indoor experiments with a variety of obstacles and occlusion scenarios. Our experimental results show that mm-PIDNet achieves an F1-score of 0.93 on average, outperforming state-of-the-art methods by up to 13.41% for occluded cases. For open-set PID, the OSO-PID framework achieves an F1-score above 0.8 when the openness is less than 14.36%.
Nowadays, human pose estimation (HPE) is widely used in several application areas. The current mainstream method based on vision suffers from privacy leakage and relies on lighting conditions. To adopt a more privacy-preserving and pervasive HPE approach, recent studies have implemented 3-D HPE using commodity radio frequency (RF) signals. However, RF-based HPE faces issues, such as resolution limitations and complex data processing, which makes it challenging to extract and utilize multiscale human activity features. In this article, we propose mmHPE, a novel approach to detect and reconstruct 3-D human posture in multiscale scenarios using a single millimeter wave radar. mmHPE consists of three main parts. Specifically, we develop a 3-D target detection network (TDN) and design an optimized loss function for it to enhance its 3-D target bounding box (BBox) detection capability in radar 3-D space. Next, an enhanced point cloud generator (EPCG) algorithm based on the 3-D target BBox is proposed to generate a stable and accurate point cloud of the target. Furthermore, we design a multiscale coarse-fine HPE network (CFN) ranging from approximate to precise estimation for reconstructing a 3-D skeleton from point cloud data. Extensive experiments demonstrate that our method surpasses other methods for 3-D human pose reconstruction in multiscale scenes, with an average error of 4.50 cm. Our method is robust enough to accurately estimate the target pose even in occluded or low-light scenes.
Autonomous driving vehicles have strong path planning and obstacle avoidance capabilities, which provide great support to avoid traffic accidents. Autonomous driving has become a research hotspot worldwide. Depth estimation is a key technology in autonomous driving as it provides an important basis for accurately detecting traffic objects and avoiding collisions in advance. However, the current difficulties in depth estimation include insufficient estimation accuracy, difficulty in acquiring depth information using monocular vision, and an important challenge of fusing multiple sensors for depth estimation. To enhance depth estimation performance in complex traffic environments, this study proposes a depth estimation method in which point clouds and images obtained from MMwave radar and cameras are fused. Firstly, a residual network is established to extract the multi-scale features of the MMwave radar point clouds and the corresponding image obtained simultaneously from the same location. Correlations between the radar points and the image are established by fusing the extracted multi-scale features. A semi-dense depth estimation is achieved by assigning the depth value of the radar point to the most relevant image region. Secondly, a bidirectional feature fusion structure with additional fusion branches is designed to enhance the richness of the feature information. The information loss during the feature fusion process is reduced, and the robustness of the model is enhanced. Finally, parallel channel and position attention mechanisms are used to enhance the feature representation of the key areas in the fused feature map, the interference of irrelevant areas is suppressed, and the depth estimation accuracy is enhanced. The experimental results on the public dataset nuScenes show that, compared with the baseline model, the proposed method reduces the average absolute error (MAE) by 4.7–6.3% and the root mean square error (RMSE) by 4.2–5.2%.
Infrastructure-assisted autonomous driving has become a new paradigm that enables autonomous vehicles to fuse sensor data and improve driving safety, where a key enabling technology for achieving this vision is to real-time and accurate registering 3-D mmWave radar point clouds between the infrastructure and the vehicle. To this end, we propose Artemis, a novel lightweight system capable of achieving real-time registration with decimeter-level localization. Artemis consists of three components: 1) a modal association-based salient object extraction component leverages the complementary advantages of cameras and radars to extract semantics and areas of salient objects for radar point clouds; 2) a salient object shape construction component extracts the shape contour of salient objects based on their inherent geometries; and 3) a contour-guided 3-D point cloud registration component combines two key strategies, keypoint matching strategy and early exit strategy, to quickly select keypoints and transformation directions for achieving accurate registration in real-time. We implement and evaluate Artemis with two multiview datasets collected in the CARLA platform and campus. The experiment results show that Artemis achieves an average registration error of 0.33 m within 32.26 ms.
Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ($9\times$ prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.
Automated manufacturing is the cornerstone of the Industrial Internet of Things (IIoT) ecosystem, where vibration monitoring technology is a critical tool for maintaining industrial machinery. The prevailing approach mostly employs inertial measurement units (IMUs), lasers, and cameras, each demonstrating deployment constraints. In recent years, millimeter-wave (mmWave) radar has shown high vibration measurement performance, but it faces challenges in accurately localizing vibrating objects and determining observation points. This study introduces a new system called VibCamera, which leverages the mmWave vibration measurement technology with computer vision (CV) algorithms for vibration monitoring. With the positional assistant of CV semantic segmentation, the radar can accurately determine sufficient observation points, thereby achieving precise measurement with high directionality. VibCamera includes two camera modes, RGB-only and RGB+depth, and solves two technical challenges: 1) integrating multimodal information for vibration target localization and 2) extracting high-quality vibration signals in interference environments. VibCamera provides more consistent and precise outcomes without the need for physical contact. The experimental results indicate that the RGB-only mode has amplitude and frequency errors below $27.04 \; \mu \rm m$ and 0.22 Hz, respectively, with a 90% probability, and the RGB+depth mode has errors below $23.72 \; \mu \rm m$ and 0.21 Hz.
Accurate registration of synthetic aperture radar (SAR) and optical images is essential for multimodal data fusion, yet the nonlinear radiometric discrepancies and speckle noise inherent in SAR imagery make this task highly challenging. We propose DAS-Net, a lightweight dual-branch, attention-guided, structure-aware network designed to address these difficulties. The network introduces three key contributions: a modality-adaptive dual-branch feature extractor to capture robust structural cues, a multiscale context aggregation module with attention to enhance geometry-consistent representations, and a frequency-domain-driven coarse-to-fine matching strategy that achieves sub-pixel alignment. In addition, a structure-aware matching loss jointly enforces global semantic alignment and local spatial consistency. Extensive experiments on three SAR–optical datasets (two public and one self-constructed) show that DAS-Net consistently outperforms state-of-the-art handcrafted and learning-based methods in terms of matching precision, number of correct correspondences, and registration accuracy. The network also demonstrates strong adaptability in large-scale farmland scenes characterized by weak textures and repetitive patterns, where conventional approaches often fail. These results confirm the effectiveness and efficiency of DAS-Net for SAR–optical image registration and its potential for multimodal remote sensing applications.
High-precision registration of synthetic aperture radar (SAR) and optical images based on point features remains a particularly challenging task, as the detection and description of feature points are susceptible to nonlinear radiometric distortions and SAR speckle noise. For this purpose, a multilevel point-matching algorithm based on hierarchical feature detection and description is proposed in this letter to improve the accuracy of SAR-to-optical (S-O) image registration. First, a FAST feature detector (OIPC-Fast) is constructed by combining overlapping chunking, image stratification, and phase congruency (PC). The OIPC-Fast detector performs hierarchical feature detection on SAR and optical images based on image properties by two-dimensional discrete wavelet transform and multimoment of PC map, respectively. Feature points with high consistency are screened out by voting criteria. The repeatability of keypoints is effectively improved. Then, a multilevel matching strategy is proposed. The SAR feature descriptor is constructed in this strategy by capturing more layers of image information rather than using a single denoised SAR image information after preprocessing, thus enhancing the robustness of SAR feature descriptors. Ten sets of real image data are used for experimental validation. Compared with some of the most advanced algorithms, the results indicate that the registration accuracy can be improved by applying the proposed point-matching algorithm to S-O image registration.
The registration of optical and synthetic aperture radar (SAR) images is severely affected by nonlinear radiometric distortion (NRD) and speckle noise. To address these challenges, we propose a novel multiscale phase asymmetry-based optical SAR registration (MSPA-OS) method, which pioneeringly incorporates phase asymmetry (PA) into the feature extraction process. Compared with phase congruency (PC), PA is more robust to noise. By aggregating PA across multiple scales, we efficiently extract the comprehensive structural features of images. Moreover, a multiregion cross-scale matching (MRCSM) strategy with the rotation-invariant descriptors is devised to handle substantial geometric deformations. Furthermore, MSPA-OS employs a set of monogenic filters to process images, significantly increasing the computational speed. Finally, we compare the performance of MSPA-OS with those of seven state-of-the-art methods using synthetic and real datasets. The experimental results show that MSPA-OS exhibits competitive registration robustness and speed.
Optical and synthetic aperture radar (SAR) image registration is crucial for multimodal image fusion and applications. However, several challenges limit the performance of existing deep learning-based methods in cross-modal image registration: 1) significant nonlinear radiometric variations between optical and SAR images affect the shared feature learning and matching; 2) limited textures in images hinder discriminative feature extraction; and 3) the local receptive field of convolutional neural networks (CNNs) restricts the learning of contextual information, while the transformer can capture long-range global features but with high computational complexity. To address these issues, this article proposes a multiexpert learning framework with the state space model (ME-SSM) for optical and SAR image registration. First, to improve the registration performance with limited textures, ME-SSM constructs a multiexpert learning framework (MELF) to capture shared features from multimodal images. Specifically, it extracts features from various transformations of the input image and employs a learnable soft router to dynamically fuse these features, thereby enriching feature representations and improving registration performance. Second, ME-SSM introduces a state space model (SSM), Mamba, for feature extraction, which employs a multidirectional cross-scanning strategy to efficiently capture global contextual relationships with linear complexity. ME-SSM can expand the receptive field, enhance image registration accuracy, and avoid incurring high computational costs. In addition, ME-SSM uses a multilevel feature aggregation (MFA) module to enhance the multiscale feature fusion and interaction. Extensive experiments have demonstrated the effectiveness and advantages of our proposed ME-SSM on optical and SAR image registration. Specifically, ME-SSM improves the correct matching rate (CMR) by 7.14% and 1.95% based on thresholds 1 and 3, respectively, on the SEN1-2 dataset, and increases the CMR by 2.12% based on threshold 3 on the OS dataset. The code is available at https://github.com/Miraitowa515/ME-SSM
This article addresses the problem of multimodal image registration between in-flight synthetic aperture radar (SAR) imagery and high-resolution optical reference maps, a key challenge for autonomous navigation in global positioning system (GPS)-denied environments. Despite recent advances in deep learning-based registration, robustness under severe sensor degradation and generalization across diverse real-world geographies remain underexplored. To tackle this issue, we introduce a continental-scale dataset over Europe, covering 4.9 million $\mathrm {km}^{2}$ with multitemporal and multiseasonal SAR and optical imagery. Two experimental protocols are proposed: spatial separation and temporal separation. To simulate a terrain-aided navigation (TAN) scenario, SAR images are corrupted with a Gaussian blur and speckle noise. We evaluate four recent deep neural networks and demonstrate the benefits of combining their backbones and loss functions to improve registration accuracy. Detailed experiments on spatial and temporal separation protocols indicate that the right combination of architectural elements can lead to performance gains exceeding 120% in severely degraded conditions. In particular, our findings reveal that the pseudo-Siamese OSMNet network, when trained with a cross-entropy (CE) loss, demonstrates the highest robustness.
The registration of optical images and synthetic aperture radar (SAR) images is a challenging technology in the field of earth observation remote sensing. By precisely registering images obtained from different sensors in the same area, it lays the foundation for supporting all-weather remote sensing applications. Traditional methods utilize the consistency of phase and gradient to extract common manual features for optical and SAR image registration. However, there is a significant nonlinear radiation difference between optical images and SAR images, which seriously restricts the accuracy and precision of image registration. To improve the accuracy of image registration, an efficient feature registration and position registration algorithm based on local depth feature descriptors is adopted for image registration. Depth feature descriptors are generated through a complex deep convolutional neural network composed of multiple dense convolutional blocks and cross-stage partial networks to achieve position registration of SAR images and optical images. At the same time, the Delaunay triangulation method and the piecewise affine distortion algorithm are combined to affine the SAR image onto the optical image, achieving the fusion of the optical image and the SAR image. The experimental results show that the dense convolutional network has a good effect in feature extraction, especially in urban area feature extraction. Meanwhile, the method of SAR and optical image feature registration and fusion based on deep learning has excellent matching and fusion effects, with all results within one pixel, which can provide favorable support for environmental dynamic monitoring, target dynamic changes and military reconnaissance.
The spatial location registration of optical and synthetic aperture radar (SAR) images is a key prerequisite for the fusion of their complementary information. However, due to the significant differences in radiometric and structural information caused by the different imaging principles, the registration of optical and SAR images faces many difficulties. In this paper, a deep learning-based approach is investigated under the framework of template matching. A feature extraction backbone network based on deep and shallow feature concatenation (DSFC) is proposed. And template matching of images is performed based on pixel-by-pixel sliding search. We conducted method comparison experiments and cross-validation experiments on optical and SAR images with different resolutions and surface types, which fully verified the significant superiority of this paper's method in terms of the registration accuracy of optical and SAR image.
Multimodal remote sensing image registration faces severe challenges due to geometric and radiometric differences, particularly between optical and synthetic aperture radar (SAR) images. These inherent disparities make extracting highly repeatable cross-modal feature points difficult. Current methods typically rely on image intensity extreme responses or network regression without keypoint supervision for feature point detection. Moreover, they not only lack explicit keypoint annotations as supervision signals but also fail to establish a clear and consistent definition of what constitutes a reliable feature point in cross-modal scenarios. To overcome this limitation, we propose PLISA—a novel heterogeneous image registration method. PLISA integrates two core components: an automated pseudo-labeling module (APLM) and a pseudo-twin interaction network (PTIF). The APLM introduces an innovative labeling strategy that explicitly defines keypoints as corner points, thereby generating consistent pseudo-labels for dual-modality images and effectively mitigating the instability caused by the absence of supervised keypoint annotations. These pseudo-labels subsequently train the PTIF, which adopts a pseudo-twin architecture incorporating a cross-modal interactive attention (CIA) module to effectively reconcile cross-modal commonalities and distinctive characteristics. Evaluations on the SEN1-2 dataset and OSdataset demonstrate PLISA’s state-of-the-art cross-modal feature point repeatability while maintaining robust registration accuracy across a range of challenging conditions, including rotations, scale variations, and SAR-specific speckle noise.
Synthetic aperture radar (SAR) and optical imagery are complementary methods in Earth observation. However, traditional similarity measures struggle to accurately align these heterogeneous images due to sensor differences and modality disparities. We propose a cosine similarity template matching network to address this challenge. Our approach leverages spatial search operations and cosine similarity to effectively quantify similarities between SAR and optical images. We introduce a pooling heatmap loss with label transform operation to facilitate smoother convergence. This method precisely identifies matching regions in heterogeneous datasets, significantly outperforming state-of-the-art methods. Moreover, we construct comprehensive datasets comprising spring, summer, fall, and winter subsets derived from SEN1-2 datasets, each containing diverse SAR and optical image pairs. These datasets serve as benchmarks for evaluating template matching algorithms in heterogeneous image scenarios, setting the stage for further advancements in template matching research.
Fast and Robust Optical-to-SAR Remote Sensing Image Registration Using Region-Aware Phase Descriptor
Automatic registration of optical and synthetic aperture radar (SAR) remote sensing images has been extensively researched for decades. However, fast and robust multimodal image registration is still a challenge due to the significant region differences in many remote sensing scenarios (e.g., city and sea region coexist in single harbor imaging). To address this problem, this article proposes a novel optical-to-SAR registration method called FED-HOPC, which integrates the fast Fourier transform (FFT) and weighted edge density (WED) map into the histogram of oriented phase congruency (HOPC)-based registration framework. FFT is used to accelerate the 3-D normalized cross correlation (NCC) for coarse matching. WED map is employed to detect adaptive block-based interest points, which are used to build region-aware HOPC descriptors for the following fine registration. To evaluate the performance of FED-HOPC, extensive experiments on four optical-to-SAR registration datasets are conducted. The results demonstrate that FED-HOPC effectively registers optical-to-SAR images with significant region differences and outperforms several state-of-the-art (SOTA) methods, including HOPC, modality-independent neighborhood descriptor (MIND), modality independent region descriptor (MIRD), and channel feature of orientated gradient (CFOG).
Registration of optical and synthetic aperture radar (SAR) image pairs is a fundamental task in various remote sensing applications, including image fusion, target localization, and object detection. Unlike homogeneous image pairs, optical and SAR image pairs exhibit a significant modality gap, making it exceptionally challenging to extract consistent and reliable features. Particularly for optical and SAR image pairs with substantial geometric differences, few methods can achieve high-precision registration. To address this challenging task, we introduce a novel registration framework, called OS3Flow, leveraging on the implicit symmetry between heterogeneous image pairs to extract high-quality semi-dense flow estimations. We start by training the network in a multitask manner using a standard flow regression loss as well as a symmetry loss with reverse input order. A confidence mask thus can be generated to measure the similarity between predictions at inference time. We then perform a linear regression upon selected flows with high confidence to estimate the parameters of underlying affine transformation. Under large transformations, our proposed method achieves an average registration error of less than three pixels on the public OS dataset and Wuhan University-optical (WHU-OPT)-SAR dataset, demonstrating superior accuracy and robustness compared to state-of-the-art methods.
This study presents a radar-optical fusion detection method for unmanned aerial vehicles (UAVs) in maritime environments. Radar and camera technologies are integrated to improve the detection capabilities of the platforms. The proposed method involves generating regions of interest (ROI) by projecting radar traces onto optical images through matrix transformation and geometric centroid registration. The generated ROI are matched with YOLO detection boxes using the intersection-over-union (IoU) algorithm, enabling radar-optical fusion detection. A modified algorithm, called SPN-YOLOv7-tiny, is developed to address the challenge of detecting small UAV targets that are easily missed in images. In this algorithm, the convolutional layers in the backbone network are replaced with a space-to-depth convolution, and a small object detection layer is added. In addition, the loss function was replaced with a normalized weighted distance loss function. Experimental results demonstrate that compared to the original YOLOv7-tiny method, SPN-YOLOv7-tiny achieves an improved mAP@0.5 (mean average precision at an IoU threshold of 0.5) from 0.852 to 0.93, while maintaining a high frame rate of 135.1 frames per second. Moreover, the proposed radar-optical fusion detection method achieves an accuracy of 96.98%, surpassing the individual detection results of the radar and camera. The proposed method effectively addresses the detection challenges posed by closely spaced overlapping targets on a radar chart.
No abstract available
Despite an emerging interest in MIMO radar, the utilization of its complementary strengths in combination with optical depth sensors has so far been limited to far-field applications, due to the challenges that arise from mutual sensor calibration in the near field. In fact, most related approaches in the autonomous industry propose target-based calibration methods using corner reflectors that have proven to be unsuitable for the near field. In contrast, we propose a novel, joint calibration approach for optical RGB-D sensors and MIMO radars that is designed to operate in the radar’s near-field range, within decimeters from the sensors. Our pipeline consists of a bespoke calibration target, allowing for automatic target detection and localization, followed by the spatial calibration of the two sensor coordinate systems through target registration. We validate our approach using two different depth sensing technologies from the optical domain. The experiments show the efficiency and accuracy of our calibration for various target displacements, as well as its robustness of our localization in terms of signal ambiguities.
Radar-based machine learning pipelines require extensive annotated datasets. However, producing large volumes of precise labels remains prohibitively laborious and prone to inconsistency, as radar signals lack a direct visual correspondence. To address this limitation, we introduce a fully automated, multi-modal annotation pipeline built around our custom RadarBox that co-registers a FMCW MIMO radar with an Azure Kinect RGB-D camera. Precise spatial calibration and hardware-level synchronization yield exact pixel-to-radar alignment. RGB images undergo panoptic segmentation to generate per-pixel human masks, which are fused with depth measurements to reconstruct a voxelized surface mesh. We extract 3D joint positions from the Kinect Body Tracking SDK and apply a bidirectional Kalman filter to derive precise per-joint positions and velocity vectors free from sudden, non-physiological fluctuations. These enhanced labels are projected into 5D radar cube slices and target lists through robust spatio-temporal association. As a demonstration, we train a deep neural network on annotated radar target lists for indoor people localization, achieving a mean positional error of 0.31 m and 91.8% occupancy accuracy, even under occlusion. Unlike prior semi-automatic or heuristic-based methods, our approach delivers consistent 5D labels at scale, bridging spatial, temporal, and Doppler dimensions, and thus paves the way for large-scale, learning-based radar sensing in human-centered applications.
Toward sensor fusion between 360° cameras and LiDAR sensors, this study proposes a spatio-temporal calibration method that uses the trajectory of an observed object per the local frame of each sensor and automatically aligns them through loss reduction optimization. Dynamic object observations can vary in resolution depending on the sensor quality (e.g., sparseness of point clouds from either few-beam LiDARs or radar) and/or position relative to the sensor system (e.g. infrastructure-based sensing). In conditions of low-resolution observations, it can be challenging to calibrate sensors either by traditional known-target methods or by recent feature-based methods. Assuming the spatial observations from a 16-ring LiDAR to be the ground-truth, the presented calibration methodology is able to achieve mean average error of about 7 cm when trajectories are spatio-temporally aligned, despite of the sparseness of the 16-ring LiDAR.
Autonomous vehicles need perfect operation to prevent accidental collisions while driving through dynamic and uncertain conditions. The paper introduces DriveSafeCore as an end-to-end high-precision system that employs multimodal sensor fusion with spatio-temporal modeling and risk assessment together with reinforcement-based policy learning for autonomous driving collision avoidance. Real-time environmental awareness is achieved through the combined assessment of LiDAR along with Radar alongside RGB camera data using Kalman Filtering and Time-Synchronized Early Fusion methods. Scene understanding emerges from a 3D U-Net with appended ConvLSTM modules which optimizes performance while a Graph-based Recurrent Neural Network predicts movements of multiple agents. Risk evaluation processes occur through TTC forecast predictions based on Transformers while Monte Carlo Dropout implements uncertainty detection. The system controls decision-making through Deep Q-Network functionality which integrates emergency override layers for safety supervision. The implemented system demonstrates real-time performance while achieving a 98.56% accuracy rate in collision detection alongside a 1.23% false positive rate and completes 42.3 ms worth of inference time and delivers a 33.5 FPS throughout rate. Real-world evaluations establish DriveSafeCore as a leading system for enhancing safety in dangerous conditions which include sudden pedestrian encounters and vehicle intercepting behaviors. The system defines new performance peaks in autonomous vehicle safety operations.
Remote photoplethysmography (rPPG) enables noncontact heart rate (HR) estimation from facial videos. Despite recent advances, single-modality methods remain vulnerable to motion, illumination changes, and modality-specific degradations. We address these limitations with a multimodal framework that explicitly leverages complementary RGB and infrared (IR, thermal or NIR) streams. Built on a 3-D SwiftFormer backbone, the method integrates three modules: 1) a context-aware temporal difference convolution (CTDC) that amplifies motion-sensitive cues via multiscale temporal differencing; 2) a bidirectional cross-attention (BCA) that enables hierarchical information exchange between modalities; and 3) a cross-modal gating fusion (CMGF) that adaptively combines features using a temperature-scaled logit-difference gate. Training is guided by a hybrid objective over time and frequency, augmented with a scheduled soft-DTW alignment term. Extensive experiments on two public datasets demonstrate consistent improvements over state-of-the-art baselines, with ablation studies confirming the contributions of CTDC, BCA, CMGF, and soft-DTW. These results highlight the effectiveness of explicit cross-modal interaction and adaptive fusion for robust, accurate remote HR (rHR) estimation.
Most existing RGB-Event trackers rely on strictly aligned datasets, overlooking the asynchronous spatio-temporal resolutions common in real-world scenarios. This methodological limitation impedes effective RGB-Event feature alignment and ultimately degrades tracking performance. To overcome this limitation, we propose AlignTrack, a novel tracking framework built upon a Top-Down Alignment (TDA) strategy inspired by the human visual system. Our TDA framework follows an encode-decode-align paradigm: it first encodes multimodal features to generate target-related priors, which are then progressively decoded to guide a subsequent feature alignment pass. Within this framework, we introduce two key innovations: (1) a Cross-Prior Attention (CPA) module that effectively generates and integrates cross-modal priors, and (2) a Cross-Modal Semantic Alignment (CSA) loss that maximizes mutual information to enforce semantic consistency between modalities. Extensive experiments show that AlignTrack achieves state-of-the-art performance on four challenging RGB-Event tracking benchmarks, demonstrating its robustness in both aligned and unaligned scenarios. Ablation studies further validate the significant contribution of each proposed component.
Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.
Traditional methods for human localization and pose estimation (HPE), which mainly rely on RGB images as an input modality, confront substantial limitations in real-world applications due to privacy concerns. In contrast, radar-based HPE methods emerge as a promising alternative, characterized by distinctive attributes such as through-wall recognition and privacy-preserving, rendering the method more conducive to practical deployments. This paper presents a Radar Tensor-based human pose (RT-Pose) dataset and an open-source benchmarking framework. The RT-Pose dataset comprises 4D radar tensors, LiDAR point clouds, and RGB images, and is collected for a total of 72k frames across 240 sequences with six different complexity-level actions. The 4D radar tensor provides raw spatio-temporal information, differentiating it from other radar point cloud-based datasets. We develop an annotation process using RGB images and LiDAR point clouds to accurately label 3D human skeletons. In addition, we propose HRRadarPose, the first single-stage architecture that extracts the high-resolution representation of 4D radar tensors in 3D space to aid human keypoint estimation. HRRadarPose outperforms previous radar-based HPE work on the RT-Pose benchmark. The overall HRRadarPose performance on the RT-Pose dataset, as reflected in a mean per joint position error (MPJPE) of 9.91cm, indicates the persistent challenges in achieving accurate HPE in complex real-world scenarios. RT-Pose is available at https://huggingface.co/datasets/uwipl/RT-Pose.
Because of societal changes, human activity recognition, part of home care systems, has become increasingly important. Camera-based recognition is mainstream but has privacy concerns and is less accurate under dim lighting. In contrast, radar sensors do not record sensitive information, avoid the invasion of privacy, and work in poor lighting. However, the collected data are often sparse. To address this issue, we propose a novel Multimodal Two-stream GNN Framework for Efficient Point Cloud and Skeleton Data Alignment (MTGEA), which improves recognition accuracy through accurate skeletal features from Kinect models. We first collected two datasets using the mmWave radar and Kinect v4 sensors. Then, we used zero-padding, Gaussian Noise (GN), and Agglomerative Hierarchical Clustering (AHC) to increase the number of collected point clouds to 25 per frame to match the skeleton data. Second, we used Spatial Temporal Graph Convolutional Network (ST-GCN) architecture to acquire multimodal representations in the spatio-temporal domain focusing on skeletal features. Finally, we implemented an attention mechanism aligning the two multimodal features to capture the correlation between point clouds and skeleton data. The resulting model was evaluated empirically on human activity data and shown to improve human activity recognition with radar data only. All datasets and codes are available in our GitHub.
The intelligent transportation system (ITS) is inseparable from people’s lives, and the development of artificial intelligence has made intelligent video surveillance systems more widely used. In practical traffic scenarios, the detection and tracking of vehicle targets is an important core aspect of intelligent surveillance systems and has become a hot topic of research today. However, in practical applications, there is a wide variety of targets and often interference factors such as occlusion, while a single sensor is unable to collect a wealth of information. In this paper, we propose an improved data matching method to fuse the video information obtained from the camera with the millimetre-wave radar information for the alignment and correlation of multi-target data in the spatial dimension, in order to address the problem of poor recognition alignment caused by mutual occlusion between vehicles and external environmental disturbances in intelligent transportation systems. The spatio-temporal alignment of the two sensors is first performed to determine the conversion relationship between the radar and pixel coordinate systems, and the calibration on the timeline is performed by Lagrangian interpolation. An improved Hausdorff distance matching algorithm is proposed for the data dimension to calculate the similarity between the data collected by the two sensors, to determine whether they are state descriptions of the same target, and to match the data with high similarity to delineate the region of interest (ROI) for target vehicle detection.
There has been exciting recent progress in using radar as a sensor for robot navigation due to its increased robustness to varying environmental conditions. However, within these different radar perception systems, ground penetrating radar (GPR) remains under-explored. By measuring structures beneath the ground, GPR can provide stable features that are less variant to ambient weather, scene, and lighting changes, making it a compelling choice for long-term spatio-temporal mapping. In this work, we present the CMU-GPR dataset--an open-source ground penetrating radar dataset for research in subsurface-aided perception for robot navigation. In total, the dataset contains 15 distinct trajectory sequences in 3 GPS-denied, indoor environments. Measurements from a GPR, wheel encoder, RGB camera, and inertial measurement unit were collected with ground truth positions from a robotic total station. In addition to the dataset, we also provide utility code to convert raw GPR data into processed images. This paper describes our recording platform, the data format, utility scripts, and proposed methods for using this data.
Coarse to fine image segmentation in real time is very essential in improving dynamic object detection in autonomous vehicles, to safely navigate in various environments. We propose an Adaptive Multi-Modal Vision Transformer (AMVT) framework that employs hybrid transformer + CNN fusion, spatio-temporal attention, and multi-modal sensor fusion, hybrid CNN + Transformer fusion that requires both modalities to be present at all times, multi-modal sensor fusion, as well as a variety of spatio-temporal and intensity attention mechanisms to achieve high precision, segmentation in varying conditions. AMVT segments differently depending on the dynamic feedback loops provided by the environment in low light, fog, occlusion, etc. The Adaptive Real Time Object Refinement (AROR) module based on the use of edge enhanced LiDAR data is a novel scheme to improve the object boundaries, whereas the Self Adaptive Sensor Fusion Network (SASFN) dynamically weights RGB, LiDAR and radar input data to obtain the best detection accuracy. For inference with low latency, the model is optimized for Edge AI, on which it achieves less than 10 ms per frame for segmentation with no loss of accuracy. It is evaluated against new benchmark datasets and shows over 90% mean Intersection over Union (mIoU) superior to current deep learning based methods in dynamic object detection. Further, a Meta-Learning Adversarial Defense (MLAD) module quenches adversarial attacks in the real world. Finally, this research offers a scalable, efficient, and resilient segmentation framework for next generation’s autonomous driving systems that resolves the key problems in autonomy perception and decision making. Future work is to extend AMVT for working under extreme environmental conditions as well as to learn reinforcement learning based on adaptive driving policies.
Aiming at the problem of multi-target track correlation caused by system deviation, noise interference and dimension mismatch in multi-sensor systems, this paper proposes a high-accuracy track correlation method for missing-dimensional targets. The method constructs training samples by performing time alignment and dimensional unification on data from different sensors, then applies Kalman filtering for noise reduction. A dual-branch spatio-temporal feature extraction network is designed: one branch employs a Long Short-Term Memory-based model to extract historical trajectory information features, while the other uses a Graph Convolutional Network-based model to extract spatial topological features. Features extracted from these two branches are then fused and concatenated to form the spatio-temporal features of the trajectory. Finally, spatiotemporal features from radar and electronic warfare tracks are input into a multi-layer perceptron-based similarity calculation module. Combined with the Hungarian algorithm, this achieves global optimal matching. Experimental results on simulation datasets demonstrate that this method effectively improves track association accuracy and stability. Ablation experiments further validate the effectiveness of each module. This research not only provides a novel solution for associating tracks of missing-dimensional target trajectories but also offers valuable insights for future related studies.
Multisource data fusion has emerged as a pivotal technique for overcoming the inherent limitations of single-modality ship tracking systems in maritime environments. This paper proposes a robust ship trajectory monitoring framework. The framework integrates AIS, radar and vision sensor streams by using advanced spatio-temporal alignment strategies and dynamic sensor reliability weighting. The technique also uses density-aware clustering for real-time trajectory classification and adaptive noise cancelation thru an improved Kalman filter. Large-scale experimental evaluation on the Singapore Strait Integrated Annotation Dataset shows that the accuracy of trajectory reconstruction and the reliability of anomaly detection are greatly improved. The system achieves sub-second end-to-end processing latency thru hardware-accelerated computation while preserving detection performance. Thru systematic analysis of the trade-off between latency and accuracy, the suitability of the framework for real-time maritime surveillance in busy traffic and adverse environmental conditions is further demonstrated. These findings lay a new technical foundation for intelligent ship trajectory monitoring and contribute to regulatory compliance and risk mitigation in congested waters.
This paper presents the ShanghaiTech Mapping Robot, a state-of-the-art unmanned ground vehicle (UGV) designed for collecting comprehensive multi-sensor datasets to support research in robotics, Simultaneous Localization and Mapping (SLAM), computer vision, and autonomous driving. The robot is equipped with a wide array of sensors including RGB cameras, RGB-D cameras, event-based cameras, IR cameras, LiDARs, mmWave radars, IMUs, ultrasonic range finders, and a GNSS RTK receiver. The sensor suite is integrated onto a specially designed mechanical structure with a centralized power system and a synchronization mechanism to ensure spatial and temporal alignment of the sensor data. A 16-node on-board computing cluster handles sensor control, data collection, and storage. We describe the hardware and software architecture of the robot in detail and discuss the calibration procedures for the various sensors and investigate the interference for LiDAR and RGB-D sensors. The capabilities of the platform are demonstrated through an extensive outdoor dataset collected in a diverse campus environment. Experiments with two LiDAR-based and two RGB-based SLAM approaches showcase the potential of the dataset to support development and benchmarking for robotics. To facilitate research, we make the dataset publicly available along with the associated robot sensor calibration data: https://slam-hive.net/wiki/ShanghaiTech_Datasets
Gait recognition is an emerging biometric technology that enables non-intrusive and hard-to-spoof human identification. However, most existing methods are confined to short-range, unimodal settings and fail to generalize to long-range and cross-distance scenarios under real-world conditions. To address this gap, we present LRGait, the first LiDAR-Camera multimodal benchmark designed for robust long-range gait recognition across diverse outdoor distances and environments. We further propose EMGaitNet, an end-to-end framework tailored for long-range multimodal gait recognition. To bridge the modality gap between RGB images and point clouds, we introduce a semantic-guided fusion pipeline. A CLIP-based Semantic Mining (SeMi) module first extracts human body-part-aware semantic cues, which are then employed to align 2D and 3D features via a Semantic-Guided Alignment (SGA) module within a unified embedding space. A Symmetric Cross-Attention Fusion (SCAF) module hierarchically integrates visual contours and 3D geometric features, and a Spatio-Temporal (ST) module captures global gait dynamics. Extensive experiments on various gait datasets validate the effectiveness of our method.
Accurate image registration is critical for lunar exploration, enabling surface mapping, resource localization, and mission planning. Aligning data from diverse lunar sensors -- optical (e.g., Orbital High Resolution Camera, Narrow and Wide Angle Cameras), hyperspectral (Imaging Infrared Spectrometer), and radar (e.g., Dual-Frequency Synthetic Aperture Radar, Selene/Kaguya mission) -- is challenging due to differences in resolution, illumination, and sensor distortion. We evaluate five feature matching algorithms: SIFT, ASIFT, AKAZE, RIFT2, and SuperGlue (a deep learning-based matcher), using cross-modality image pairs from equatorial and polar regions. A preprocessing pipeline is proposed, including georeferencing, resolution alignment, intensity normalization, and enhancements like adaptive histogram equalization, principal component analysis, and shadow correction. SuperGlue consistently yields the lowest root mean square error and fastest runtimes. Classical methods such as SIFT and AKAZE perform well near the equator but degrade under polar lighting. The results highlight the importance of preprocessing and learning-based approaches for robust lunar image registration across diverse conditions.
Autonomous driving holds great promise in addressing traffic safety concerns by leveraging artificial intelligence and sensor technology. Multi-Object Tracking plays a critical role in ensuring safer and more efficient navigation through complex traffic scenarios. This paper presents a novel deep learning-based method that integrates radar and camera data to enhance the accuracy and robustness of Multi-Object Tracking in autonomous driving systems. The proposed method leverages a Bi-directional Long Short-Term Memory network to incorporate long-term temporal information and improve motion prediction. An appearance feature model inspired by FaceNet is used to establish associations between objects across different frames, ensuring consistent tracking. A tri-output mechanism is employed, consisting of individual outputs for radar and camera sensors and a fusion output, to provide robustness against sensor failures and produce accurate tracking results. Through extensive evaluations of real-world datasets, our approach demonstrates remarkable improvements in tracking accuracy, ensuring reliable performance even in low-visibility scenarios.
Autonomous vehicles should be capable of operating in all types of weather conditions. Drivable road region detection is a core component of the perception stack of self-driving vehicles. Current approaches for detecting road regions perform well in good weather but lack in inclement weather conditions. In this paper, we examine the effect of inclement weather on the camera-based state-of-the-art deep learning approaches and introduce a new camera and automotive radar-based multimodal deep learning model to efficiently detect drivable road regions in all weather conditions. We also propose a novel approach to overcome the sparse resolution problem of automotive radars and a way to effectively use it in higher precision tasks such as image segmentation. To validate our work, we have augmented the nuScenes data with rain and fog to add challenging weather conditions. Experimental results show that the performance of the state-of-the-art techniques drops 18% in bad weather conditions while our proposed method improves the performance by 12% compared to the state-of-the-art.
This paper proposes a contrastive learning-based pretraining method for using millimeter-wave radar signals for pose estimation. By integrating the camera and radar data in the model pretraining, we first utilize K-means clustering algorithm to divide the data into distinct clusters from the spatial distribution. We propose a contrastive camera-radar-images pertaining (CCRP) technique by conducting contrastive learning on camera-captured coordinates and radar signal heatmaps of the associated cluster. The trained radar heatmap encoder network is based on a vision transformer capable of extracting highly distinctive features, reduces the training difficulty of the subsequent pose estimation network, and improves performance with fine-tuning. Based on the HIBER dataset, the proposed CCRP achieved leading results a marked performance improvement by 31% compared to other unsupervised pertaining methods.
No abstract available
Sensors such as cameras, lidars, and radars are crucial to understanding driving situations in autonomous vehicles. These sensors are susceptible to external and internal abnormalities, potentially leading to severe traffic accidents. A radar sensor is inevitably affected by the obstruction caused by small objects, which can cause the system to malfunction. This paper presents a deep learning approach for detecting anomalies in radar data. The accuracy of anomaly detection is improved by using radar-camera fusion. Our proposed model detects the data anomaly by calculating the deviation from the standard radar cross section (RCS) range. The result demonstrates that the model is capable of identifying the normal range of radar signal and anomaly signal under several different obtained features situations. It enables the detection of potential hazards and warns of dangers to drivers and higher-level control systems, creating a more resilient environment for ensuring autonomous driving safety.
Multisensor information fusion technology has been widely used in the perception of unmanned aerial vehicle environments. However, the perception accuracy needs to be improved in practice since multiple sensors have consistency limitations and fused data have limited utility. A deep-learning method based on the multistage fusion of millimeter-wave radar and camera is proposed in this article. In the data preprocessing stage, the radar reflection point and image pixels are fused in a Gaussian-weighted way to obtain the salient image. The salient density map of each pixel relative to the radar reflection point is calculated. Then, the threshold is set to segment the salient density map to complete visual target detection. In the detection stage based on deep learning, a network structure is designed to fuse the salient image and visual target detection images at different convolution depths. The classification, location, and size of targets are regressed by training. In the postdecoding stage, the radar reflection point is fused for local nonmaximum suppression. The nonmaximum suppression operation is started from the radar reflection point. Different from typical detection methods, the proposed method improves detection accuracy by fusing the feature information of the radar and camera in a multistage process. The experimental results demonstrated that mAP0.90 increased by 3.9% and 4.3%. For complex scenarios, mAP0.50 improved by 2.4%, mAP0.75 improved by 4.9%, and mAP0.90 improved by 6.9%, indicating that the proposed method is effective compared with the state-of-the-art model (YOLOv8).
Autonomous Vehicles (AV) require a very highly accurate perception system to reduce the likelihood of road accidents, which are most commonly caused by unrecognizable targets, human mistakes, and other avoidable reasons. This is achieved through the use of Modern AV such as cameras, radars, and LIDARs. One of the most commonly used sensors in automotive industries and traffic control applications was the Millimeter-wave radar for its high performance. However, these types of sensors are expensive, and suffer from disclosing false alarms. A recent approach is using object detection and classification algorithms along with a car-mounted camera to solve this issue. The fusion of camera and Radar measurements provides a much more efficient detection system. In this paper, we introduce a more robust approach that fuses camera and radar outputs using neural networks and provides more reliable level accuracy for low-quality radar readings. In our approach, we use only the box size (box height and box width) predictions of the YOLO-v4, with simulated noise radar readings to classify car types. The proposed method can learn to improve object detection of radar measurements and furthermore classify car types with 60.0% accuracy when 10% noise is present in the radar readings. Our proposed method shows that it is possible to use cheaper radar sensors, along with a budget camera, and still provide predictions of car types.
Object detection in camera images, using deep learning has been proven successfully in recent years. Rising detection rates and computationally efficient network structures are pushing this technique towards application in production vehicles. Nevertheless, the sensor quality of the camera is limited in severe weather conditions and through increased sensor noise in sparsely lit areas and at night. Our approach enhances current 2D object detection networks by fusing camera data and projected sparse radar data in the network layers. The proposed CameraRadarFusion Net (CRF-Net) automatically learns at which level the fusion of the sensor data is most beneficial for the detection result. Additionally, we introduce BlackIn, a training strategy inspired by Dropout, which focuses the learning on a specific sensor type. We show that the fusion network is able to outperform a state-of-the-art image-only network for two different datasets. The code for this research will be made available to the public at: https://github.com/TUMFTM/CameraRadarFusionNet
For safe driving, it is essential to accept reliable information from recognition sensors. In this paper, we present a deep learning model that classifies whether radar signals coming in are normal or abnormal. The abnormal signal is defined as noise from the radar and all signals received when the radar fails or is in trouble. It is difficult to determine whether reflected signals are normal or not based only on radar data. Therefore, the camera and radar sensors are used together, considering the radar cross section (RCS) distribution varies by the angle and distance of the object. The proposed model uses data received from camera and radar sensors to determine the normality of object signals. The model shows an accuracy of 96.24%. Through the results of this study, the reliability of radar signals can be determined in the actual driving environment, thereby ensuring the safety of vehicles and pedestrians.
No abstract available
This work proposes a first-of-its-kind SLAM architecture fusing an event-based camera and a Frequency Modulated Continuous Wave (FMCW) radar for drone navigation. Each sensor is processed by a bio-inspired Spiking Neural Network (SNN) with continual Spike-Timing-Dependent Plasticity (STDP) learning, as observed in the brain. In contrast to most learning-based SLAM systems, our method does not require any offline training phase, but rather the SNN continuously learns features from the input data on the fly via STDP. At the same time, the SNN outputs are used as feature descriptors for loop closure detection and map correction. We conduct numerous experiments to benchmark our system against state-of-the-art RGB methods and we demonstrate the robustness of our DVS-Radar SLAM approach under strong lighting variations.
Skeletal detection-based analysis of human behavior is of significant value for health monitoring. This study leverages 4-D millimeter-wave (mmWave) radar technology to conduct continuous indoor skeletal analysis. Initially, we introduce the radar-visual human activity dataset (RVHAD), an extensive benchmark comprising 240000 radar frames that capture various human actions, including standing, sitting, and falling. In addition, we propose a fully automated frame correlation labeling technique capable of annotating radar frames autonomously, even in instances of visual system failure. Subsequently, we develop the spatiotemporal constrained human skeletal analysis network (STC-HSANet). This network employs a 3-D Siamese plain pyramid network (3D-SPPN) to generate multilevel collaborative features, integrates salient features through a context information interaction module (CIIM), and refines the decoded keypoint locations using a positional modulation strategy (PMS). Our experimental results demonstrate that STC-HSANet surpasses current state-of-the-art methods, offering robust performance even under conditions of visual impairment. Our code and dataset can be found at: https://github.com/zylofor/STC-HSANet.
Semantic scene segmentation from a bird’s-eye-view (BEV) perspective plays a crucial role in facilitating planning and decision-making for mobile robots. Although recent vision-only methods have demonstrated notable advancements in performance, they often struggle under adverse illumination conditions such as rain or nighttime. While active sensors offer a solution to this challenge, the prohibitively high cost of LiDARs remains a limiting factor. Fusing camera data with automotive radars poses a more inexpensive alternative but has received less attention in prior research. In this work, we aim to advance this promising avenue by introducing BEVCar, a novel approach for joint BEV object and map segmentation. The core novelty of our approach lies in first learning a point-based encoding of raw radar data, which is then leveraged to efficiently initialize the lifting of image features into the BEV space. We perform extensive experiments on the nuScenes dataset and demonstrate that BEVCar outperforms the current state of the art. Moreover, we show that incorporating radar information significantly enhances robustness in challenging environmental conditions and improves segmentation performance for distant objects. To foster future research, we provide the weather split of the nuScenes dataset used in our experiments, along with our code and trained models at http://bevcar.cs.uni-freiburg.de.
With the rapid development of deep learning technology, traditional single sensors have certain limitations in the detection of moving vehicles. To solve this problem, this paper proposes a vehicle detection method based on the fusion of radar and camera information. Specifically, first, the data collected by the radar and camera are processed to extract the desired areas; then features are extracted from these areas, and the D-S evidence theory is used to detect and identify the targets, thereby obtaining accurate vehicle detection results. Finally, we verified and compared the proposed fusion model on the public dataset KITTI. The experimental results show that this method performs well in terms of detection performance and accuracy.
No abstract available
Environment perception using camera, radar, and/or lidar sensors has significantly improved in the last few years because of deep learning-based methods. However, a large group of these methods fall into the category of supervised learning, which requires a considerable amount of annotated data. Due to uncertainties in multi-sensor data, automating the data labeling process is extremely challenging; hence, it is performed manually to a large extent. Even though full automation of such a process is difficult, semi-automation can be a significant step to ease this process. However, the available work in this regard is still very limited; hence, in this paper, a novel semi-automatic annotation methodology is developed for labeling RGB camera images and 3D automotive radar point cloud data using a smart infrastructure-based sensor setup. This paper also describes a new method for 3D radar background subtraction to remove clutter and a new object category, GROUP, for radar-based object detection for closely located vulnerable road users. To validate the work, a dataset named INFRA-3DRC is created using this methodology, where 75 % of the labels are automatically generated. In addition, a radar cluster classifier and an image classifier are developed, trained, and tested on this dataset, achieving accuracy of 98.26% and 94.86%, respectively. The dataset and Python scripts are available at https://fraunhoferivi.github.io/INFRA-3DRC-Dataset/.
Autonomous driving has triggered the evolution of multimodal sensor fusion systems due to the needs to provide safety, reliability, and real-time environmental awareness. The study proposes a visual perception framework called FusionNet, which is a deep learning-based visual perception framework that has an intermediary fusion approach (enabled by transformers) that combines RGB camera, LiDAR, and radar data. In contrast to classic early or late fusion techniques, FusionNet uses modality-specific encoders and cross-attention layers to mutually adjust and merge semantic and geometric features dynamically. The massive test on the KITTI and nuScenes data sets have shown that FusionNet not only performs better in terms of increasing the mean Average Precision (mAP) than unimodal systems, but it also offers such an improvement in particularly adverse scenarios, like fog, low light, occlusion, among others, in which the unimodal systems do not perform well. The model is real-time capable with a time of 59 milliseconds per frame and it is robust under different weather conditions and in cases of bad sensors. Also, FusionNet has better localization quality on large IoU thresholds and could resist modality dropout training. These findings point to the future promise of deep multimodal fusion as a constituent building block of the future of autonomous vehicle perception systems capable of faithful deployment in a wide range of urban and environmental contexts.
Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vessels (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments. Moreover, the real-world experiments with deployment of NanoMVG on embedded edge device of USV demonstrates its fast inference speed for real-time perception and capability of boasting ultra-low power consumption for long endurance.
最终分组涵盖了相机与雷达配准从底层几何标定(外参、时空同步)到中层非刚性形变校正(SAR匹配、非刚性点集配准),再到高层深度感知(BEV特征融合、语义增强)的完整技术路径。研究重点正从早期的手工特征匹配转向基于深度学习的端到端特征级对齐,尤其是在自动驾驶的3D感知与复杂环境下的生物医疗感知、特种机器人领域展现出极高的应用价值。