单目深度估计,稀疏雷达丰富,点云拼接
单目深度估计的自监督学习与轻量化架构
该组文献专注于从单张RGB图像或视频序列中恢复深度。研究重点包括自监督学习框架(如处理尺度模糊、边界清晰度)、针对嵌入式设备的轻量化Transformer设计、以及在特定垂直领域(如内窥镜、全景图、果园)的鲁棒性分析与应用。
- Lightweight Monocular Depth Estimation via Token-Sharing Transformer(Dong-Jae Lee, Jae Young Lee, Hyounguk Shon, Eojindl Yi, Yeong-Hun Park, Sung-Sik Cho, Junmo Kim, 2023, ArXiv Preprint)
- Depth estimation on embedded computers for robot swarms in forest(Chaoyue Niu, Danesh Tarapore, Klaus-Peter Zauner, 2020, ArXiv Preprint)
- Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances(V. Guizilini, Jie Li, Rares Ambrus, Sudeep Pillai, Adrien Gaidon, 2019, ArXiv)
- Monocular Depth Estimators: Vulnerabilities and Attacks(Alwyn Mathew, Aditya Prakash Patra, Jimson Mathew, 2020, ArXiv Preprint)
- Manydepth2: Motion-Aware Self-Supervised Monocular Depth Estimation in Dynamic Scenes(Kaichen Zhou, Jia-Wang Bian, Jian-Qing Zheng, Jiaxing Zhong, Qian Xie, Niki Trigoni, Andrew Markham, 2023, ArXiv Preprint)
- Deep Neural Networks for Accurate Depth Estimation with Latent Space Features(Siddiqui Muhammad Yasir, Hyunsik Ahn, 2025, ArXiv Preprint)
- Edge-Aware Monocular Dense Depth Estimation with Morphology(Zhi Li, Xiaoyang Zhu, Haitao Yu, Qi Zhang, Yongshi Jiang, 2021, 2020 25th International Conference on Pattern Recognition (ICPR))
- Coarse-to-fine Planar Regularization for Dense Monocular Depth Estimation(Stephan Liwicki, C. Zach, O. Mikšík, Philip H. S. Torr, 2016, No journal)
- Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation(Ning-Hsu Wang, Yu-Lun Liu, 2024, ArXiv Preprint)
- Dense Depth Estimation in Monocular Endoscopy With Self-Supervised Learning Methods(Xingtong Liu, Ayushi Sinha, M. Ishii, Gregory Hager, A. Reiter, R. Taylor, M. Unberath, 2019, IEEE Transactions on Medical Imaging)
- Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy(Xingtong Liu, Ayushi Sinha, M. Unberath, M. Ishii, Gregory Hager, R. Taylor, A. Reiter, 2018, ArXiv)
- Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion(Hang Li, Shuai Liu, Bin Wang, Yuanhao Wu, 2024, Applied Sciences)
- MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments(Runze Li, Pan Ji, Yi Xu, Bir Bhanu, 2022, ArXiv Preprint)
- Adaptive confidence thresholding for monocular depth estimation(Hyesong Choi, Hunsang Lee, Sunkyung Kim, Sunok Kim, Seungryong Kim, Kwanghoon Sohn, Dongbo Min, 2020, ArXiv Preprint)
- InSpaceType: Reconsider Space Type in Indoor Monocular Depth Estimation(Cho-Ying Wu, Quankai Gao, Chin-Cheng Hsu, Te-Lin Wu, Jing-Wen Chen, Ulrich Neumann, 2023, ArXiv Preprint)
- 3D Visual Illusion Depth Estimation(Chengtang Yao, Zhidan Liu, Jiaxi Zeng, Lidong Yu, Yuwei Wu, Yunde Jia, 2025, ArXiv Preprint)
- Monocular dense reconstruction by depth estimation fusion(Tian Chen, Wendong Ding, Dapeng Zhang, Xilong Liu, 2018, 2018 Chinese Control And Decision Conference (CCDC))
- Monocular Depth Estimation with Affinity, Vertical Pooling, and Label Enhancement(Yukang Gan, Xiangyu Xu, Wenxiu Sun, Liang Lin, 2018, No journal)
- Unsupervised Monocular Depth and Ego-motion Learning with Structure and Semantics(Vincent Casser, Soeren Pirk, Reza Mahjourian, Anelia Angelova, 2019, ArXiv Preprint)
- OrchardDepth: Precise Metric Depth Estimation of Orchard Scene from Monocular Camera Images(Zhichao Zheng, Henry Williams, Bruce A. MacDonald, 2025, ArXiv)
- Depth as Points: Center Point-based Depth Estimation(Zhiheng Tu, Xinjian Huang, Yong He, Ruiyang Zhou, Bo Du, Weitao Wu, 2025, ArXiv Preprint)
- SPDepth: Enhancing Self-Supervised Indoor Monocular Depth Estimation via Self-Propagation(Xiaotong Guo, Huijie Zhao, Shuwei Shao, Xudong Li, Baochang Zhang, Na Li, 2024, Future Internet)
- Deep Triple-Supervision Learning Unannotated Surgical Endoscopic Video Data for Monocular Dense Depth Estimation(Wenkang Fan, Kaiyun Zhang, Hong Shi, Jianhua Chen, Yinran Chen, Xióngbiao Luó, 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- OrchardDepth++: Binned KL-Flood Regularization for Monocular Depth Estimation of Orchard Scene(Zhichao Zheng, Henry Williams, Trevor Gee, Bruce A. MacDonald, 2025, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- ProbeMDE: Uncertainty-Guided Active Proprioception for Monocular Depth Estimation in Surgical Robotics(Britton Jordan, Jordan Thompson, Jesse F. d'Almeida, Hao Li, Nithesh Kumar, Susheela Sharma Stern, I. Oguz, Robert J. Webster, Daniel S. Brown, Alan Kuntz, James M. Ferguson, 2025, ArXiv)
基于生成式模型与视觉基座模型的深度感知增强
该组文献代表了当前最前沿的研究趋势,利用扩散模型(Diffusion Models)、视觉基座模型(VFM/Foundation Models)以及高斯泼溅(Gaussian Splatting)技术,提升深度估计的零样本泛化能力、几何一致性以及在稀疏视角下的3D重建质量。
- Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator(Xiankang He, Dongyan Guo, Hongji Li, Ruibo Li, Ying Cui, Chi Zhang, 2025, ArXiv Preprint)
- Depth Pro: Sharp Monocular Metric Depth in Less Than a Second(Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, Vladlen Koltun, 2024, ArXiv Preprint)
- Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion(Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke, Alexander Becker, Konrad Schindler, Anton Obukhov, 2024, ArXiv)
- Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion(Shenglun Chen, Xinzhu Ma, Hong Zhang, Haojie Li, Zhihui Wang, 2025, IEEE Transactions on Image Processing)
- StarryGazer: Leveraging Monocular Depth Estimation Models for Domain-Agnostic Single Depth Image Completion(Sangmin Hong, Suyoung Lee, K. Lee, 2025, ArXiv)
- Consistent3D: Diffusion-Driven Sparse View Completion and 3D Reconstruction with Geometric Priors(Qi Tan, Rong Wei, Zhiyu Xi, Jingqing Yang, 2025, 2025 International Conference on Digital Image Computing: Techniques and Applications (DICTA))
- MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping With Depth Smooth Regularization(Pengcheng Zhu, Yaoming Zhuang, Baoquan Chen, Li Li, Chen-Fei Wu, Zhanlin Liu, 2024, IEEE Robotics and Automation Letters)
- SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion(Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, Lihua Xie, 2025, ArXiv)
- Align3R: Aligned Monocular Depth Estimation for Dynamic Videos(Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, Yuan Liu, 2024, ArXiv Preprint)
- Diving into the Fusion of Monocular Priors for Generalized Stereo Matching(Chengtang Yao, Lidong Yu, Zhidan Liu, Jiaxi Zeng, Yuwei Wu, Yunde Jia, 2025, ArXiv Preprint)
- Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes(Rui Li, Dong Gong, Wei Yin, Hao Chen, Yu Zhu, Kaixuan Wang, Xiaozhi Chen, Jinqiu Sun, Yanning Zhang, 2023, ArXiv Preprint)
- Distilling Monocular Foundation Model for Fine-grained Depth Completion(Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Consistent Video Depth Estimation(Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, Johannes Kopf, 2020, ArXiv Preprint)
稀疏深度补全与多尺度稠密化技术
此类文献研究如何将来自LiDAR、VIO或形态学操作生成的稀疏深度样本,结合RGB图像引导转化为稠密深度图。核心技术涉及多尺度特征融合、边界一致性增强、自监督补全框架以及在自动驾驶和微操作任务中的实时应用。
- AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation(Yangchao Wu, Tian Yu Liu, Hyoungseob Park, S. Soatto, Dong Lao, Alex Wong, 2023, No journal)
- Depth Completion with Morphological Operations: An Intermediate Approach to Enhance Monocular Depth Estimation(R. Q. Mendes, E. G. Ribeiro, N. D. S. Rosa, V. Grassi, 2020, 2020 Latin American Robotics Symposium (LARS), 2020 Brazilian Symposium on Robotics (SBR) and 2020 Workshop on Robotics in Education (WRE))
- SelfDeco: Self-Supervised Monocular Depth Completion in Challenging Indoor Environments(Jaehoon Choi, Dongki Jung, Yonghan Lee, Deok-Won Kim, Dinesh Manocha, Dong-hwan Lee, 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA))
- Efficiently Fusing Sparse Lidar for Enhanced Self-Supervised Monocular Depth Estimation(Yue Wang, Mingrong Gong, L. Xia, Qieshi Zhang, Jun Cheng, 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- Sparse depth densification for monocular depth estimation(Zhen Liang, Tiyu Fang, Yanzhu Hu, Yingjian Wang, 2023, Multimedia Tools and Applications)
- Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion(Huadong Li, Minhao Jing, Jin Wang, Shichao Dong, Jiajun Liang, Haoqiang Fan, Renhe Ji, 2024, No journal)
- Robust Monocular Visual-Inertial Depth Completion for Embedded Systems(Nate Merrill, Patrick Geneva, Guoquan Huang, 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA))
- MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion(Lina Liu, Xibin Song, Jiadai Sun, Xiaoyang Lyu, Lin Li, Yong Liu, Liangjun Zhang, 2023, IEEE Robotics and Automation Letters)
- Multi-Modal Masked Pre-Training for Monocular Panoramic Depth Completion(Zhiqiang Yan, Xiang Li, Kun Wang, Zhenyu Zhang, Jun Yu Li, Jian Yang, 2022, No journal)
- To Complete or to Estimate, That is the Question: A Multi-Task Approach to Depth Completion and Monocular Depth Estimation(Amir Atapour-Abarghouei, T. Breckon, 2019, 2019 International Conference on 3D Vision (3DV))
- Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation(Minseok Seo, Wonjun Lee, Jaehyuk Jang, Changick Kim, 2026, ArXiv Preprint)
- Adaptive LiDAR Sampling and Depth Completion using Ensemble Variance(Eyal Gofer, Shachar Praisler, Guy Gilboa, 2020, ArXiv Preprint)
- Weakly-Supervised Depth Completion during Robotic Micromanipulation from a Monocular Microscopic Image(Han Yang, Yufei Jin, Guanqiao Shan, Yibin Wang, Yongbin Zheng, Jiangfan Yu, Yu Sun, Zhuoran Zhang, 2024, 2024 IEEE International Conference on Robotics and Automation (ICRA))
- Sparse-to-Continuous: Enhancing Monocular Depth Estimation using Occupancy Maps(N. D. S. Rosa, V. Guizilini, V. Grassi, 2018, 2019 19th International Conference on Advanced Robotics (ICAR))
- Indoor Depth Completion with Boundary Consistency and Self-Attention(Yu-Kai Huang, Tsung-Han Wu, Yueh-Cheng Liu, Winston H. Hsu, 2019, ArXiv Preprint)
- Spacecraft Depth Completion Based on the Gray Image and the Sparse Depth Map(Xiang Liu, Hao Wang, Zhiqiang Yan, Yu Chen, Xinlong Chen, Weichun Chen, 2022, IEEE Transactions on Aerospace and Electronic Systems)
- HGAN: monocular 3D object depth completion method via hierarchical geometric-aware network(Chengcheng Li, Xili Xie, Shuo Zhang, Jiancong Chen, 2025, Measurement Science and Technology)
- NDDepth: Normal-Distance Assisted Monocular Depth Estimation and Completion(Shuwei Shao, Zhongcai Pei, Weihai Chen, Peter C. Y. Chen, Zhengguo Li, 2023, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion(Lina Liu, Xibin Song, Xiaoyang Lyu, Junwei Diao, Mengmeng Wang, Yong Liu, Liangjun Zhang, 2020, ArXiv)
- Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image(Fangchang Ma, S. Karaman, 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA))
- OASIS-DC: Generalizable Depth Completion via Output-level Alignment of Sparse-Integrated Monocular Pseudo Depth(Jaehyeon Cho, Jhonghyun An, 2026, ArXiv Preprint)
- Zero-shot Depth Completion via Test-time Alignment with Affine-invariant Depth Prior(Lee Hyoseok, Kyeong Seon Kim, Kwon Byung-Ki, Tae-Hyun Oh, 2025, ArXiv Preprint)
- Sparse Auxiliary Networks for Unified Monocular Depth Prediction and Completion(V. Guizilini, Rares Ambrus, Wolfram Burgard, Adrien Gaidon, 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- PCTNet: 3D Point Cloud and Transformer Network for Monocular Depth Estimation(Yu Hong, Xiaolong Liu, H. Dai, Wenqi Tao, 2022, 2022 10th International Conference on Information and Education Technology (ICIET))
- SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments(Hongjie Zhang, G. Billings, Stefan Williams, 2025, ArXiv)
毫米波雷达与视觉的多模态融合感知
该组文献集中于毫米波雷达(含4D成像雷达)与摄像头的深度融合。利用雷达的全天候鲁棒性解决视觉的尺度模糊和恶劣天气下的失效问题,涵盖了雷达点云上采样、伪点云生成、语义引导融合及跨模态校准技术。
- Semantic-Guided Depth Completion From Monocular Images and 4D Radar Data(Zecheng Li, Yuying Song, Fuyuan Ai, Chunyi Song, Zhiwei Xu, 2024, IEEE Transactions on Intelligent Vehicles)
- RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation(Xingrui Qin, Wentao Zhao, Chuan Cao, Yihe Niu, Houcheng Jiang, Jingchuan Wang, 2025, ArXiv)
- MMCRF: multi-modal camera-radar fusion for 3D object detection via polar-coordinate feature alignment(Tiezhen Jiang, Runjie Kang, Qingzhu Li, 2025, Engineering Research Express)
- CRAB: Camera-Radar Fusion for Reducing Depth Ambiguity in Backward Projection Based View Transformation(I. Lee, Sihwan Hwang, Youngseok Kim, Wonjun Kim, Sanmin Kim, Dongsuk Kum, 2025, 2025 IEEE International Conference on Robotics and Automation (ICRA))
- Depth Estimation Based on MMwave Radar and Camera Fusion with Attention Mechanisms and Multi-Scale Features for Autonomous Driving Vehicles(Zhaohuan Zhu, Feng Wu, Wenqing Sun, Quanying Wu, Feng Liang, Wuhan Zhang, 2025, Electronics)
- R4Dyn: Exploring Radar for Self-Supervised Monocular Depth Estimation of Dynamic Scenes(Stefano Gasperini, Patrick Koch, Vinzenz Dallabetta, Nassir Navab, Benjamin Busam, Federico Tombari, 2021, ArXiv Preprint)
- TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation(Huawei Sun, Zixu Wang, Hao Feng, Julius Ott, Lorenzo Servadei, Robert Wille, 2025, Trans. Mach. Learn. Res.)
- PCGNet: Point Cloud Generation Network for 3D Perception Using Monocular Images and Radar(Zecheng Li, Fuyuan Ai, Yuying Song, Wei Wu, Chunyi Song, Zhiwei Xu, 2025, IEEE Transactions on Intelligent Vehicles)
- Depth Estimation from Monocular Images and Sparse Radar Data(Juan Lin, Dengxin Dai, L. Gool, 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- CaRaFFusion: Improving 2D Semantic Segmentation With Camera-Radar Point Cloud Fusion and Zero-Shot Image Inpainting(Huawei Sun, Bora Kunter Sahin, Georg Stettinger, Maximilian Bernhard, Matthias Schubert, Robert Wille, 2025, IEEE Robotics and Automation Letters)
- FGRFlow: Learning Fine-Grained Rigidity Scene Flow from 4D Radar Point Cloud(Mingliang Zhai, Yiheng Wang, Haidong Hu, Chi-man Pun, Hao Gao, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Concept for an Automatic Annotation of Automotive Radar Data Using AI-segmented Aerial Camera Images(Marcel Hoffmann, Sandro Braun, Oliver Sura, Michael Stelzig, Christian Schüßler, Knut Graichen, Martin Vossiek, 2023, ArXiv Preprint)
- A novel tracking system for human following robots with fusion of MMW radar and monocular vision(Yipeng Zhu, Tao Wang, Shiqiang Zhu, 2021, Ind. Robot)
- UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block(Luoxi Jing, Dian-xi Shi, Zhe Liu, Songchang Jin, Chunping Qiu, Ziteng Qiao, Yuxian Li, Jianqiang Xia, 2025, No journal)
- Implementation of Radar-Camera fusion for Efficient Object Detection and Distance Estimation in Autonomous Vehicles(P. R, Pranav Sharma N, Tejasvi P C, T. P. Mithun, B. Reddy, 2025, 2025 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE))
- Monocular Visual-Inertial Depth Estimation(Diana Wofk, René Ranftl, Matthias Müller, V. Koltun, 2023, 2023 IEEE International Conference on Robotics and Automation (ICRA))
- Research on Intelligent 3D Reconstruction System Integrating Transformer and Adaptive Point Cloud Registration(Chuying Lu, 2025, International Journal of Big Data Intelligent Technology)
- Depth Estimation From Monocular Images And Sparse Radar Using Deep Ordinal Regression Network(Chen-Chou Lo, P. Vandewalle, 2021, 2021 IEEE International Conference on Image Processing (ICIP))
- RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale(Han Li, Yukai Ma, Yaqing Gu, Kewei Hu, Yong Liu, Xingxing Zuo, 2024, 2024 IEEE International Conference on Robotics and Automation (ICRA))
- Metrically Scaled Monocular Depth Estimation through Sparse Priors for Underwater Robots(Luca Ebner, G. Billings, Stefan Williams, 2023, 2024 IEEE International Conference on Robotics and Automation (ICRA))
- MonoComb: A Sparse-to-Dense Combination Approach for Monocular Scene Flow(René Schuster, C. Unger, D. Stricker, 2020, Proceedings of the 4th ACM Computer Science in Cars Symposium)
- RaViDeep: Target Detection Based on Deep Fusion of Radar and Vision in Berthing Scenarios(Yuying Song, Jingxuan Wu, Wei Wu, Chunyi Song, Zhiwei Xu, Ming Zhang, 2025, IEEE Transactions on Intelligent Vehicles)
- Radar-Camera Pixel Depth Association for Depth Completion(Yunfei Long, Daniel Morris, Xiaoming Liu, Marcos Castro, Punarjay Chakravarty, Praveen Narayanan, 2021, ArXiv Preprint)
- Depth-Aware Fusion Method Based on Image and 4D Radar Spectrum for 3D Object Detection(Yue Sun, Yeqiang Qian, Chunxiang Wang, Ming Yang, 2024, 2024 IEEE International Conference on Robotics and Biomimetics (ROBIO))
- Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs(Davide Nadalini, Manuele Rusci, Elia Cereda, Luca Benini, Francesco Conti, Daniele Palossi, 2025, ArXiv Preprint)
- VIMD: Monocular Visual-Inertial Motion and Depth Estimation(Saimouli Katragadda, Guoquan Huang, 2025, ArXiv)
- C4RFNet: Camera and 4D-Radar Fusion Network for Point Cloud Enhancement(Wenbo Wang, Wei Wang, Xixin Yu, Weibin Zhang, 2025, IEEE Sensors Journal)
- How Much Depth Information can Radar Contribute to a Depth Estimation Model?(Chen-Chou Lo, Patrick Vandewalle, 2022, ArXiv Preprint)
- SFGFusion: Surface Fitting Guided 3D Object Detection with 4D Radar and Camera Fusion(Xiaozhi Li, Huijun Di, Jian Li, Feng Liu, Wei Liang, 2025, ArXiv)
- The Oxford Radar RobotCar Dataset: A Radar Extension to the Oxford RobotCar Dataset(Dan Barnes, Matthew Gadd, Paul Murcutt, Paul Newman, Ingmar Posner, 2019, ArXiv Preprint)
- Diffusion-Based Point Cloud Super-Resolution for mmWave Radar Data(Kai Luan, Chenghao Shi, Neng Wang, Yuwei Cheng, Huimin Lu, Xieyuanli Chen, 2024, 2024 IEEE International Conference on Robotics and Automation (ICRA))
- A Deep Learning-based Radar and Camera Sensor Fusion Architecture for Object Detection(Felix Nobis, Maximilian Geisslinger, Markus Weber, Johannes Betz, Markus Lienkamp, 2020, ArXiv Preprint)
- Supervising radar depth completion using the monocular depth large model.(Jiming Chen, Zili Zhou, Zhu Yu, Fuyi Zhang, Jiacheng Ying, Si-Yuan Cao, Hui Shen, 2025, Applied optics)
- CamRaDepth: Semantic Guided Depth Estimation Using Monocular Camera and Sparse Radar for Automotive Perception(Florian Sauerbeck, Dan Halperin, Lukas Connert, Johannes Betz, 2023, IEEE Sensors Journal)
- 3-D Grid-Based VDBSCAN Clustering and Radar—Monocular Depth Completion(Feipeng Chen, Lihui Wang, Pinzhang Zhao, Jian Wei, 2024, IEEE Sensors Journal)
- Radar Meets Vision: Robustifying Monocular Metric Depth Prediction for Mobile Robotics(Marco Job, Thomas Stastny, Tim Kazik, Roland Siegwart, Michael Pantic, 2024, ArXiv Preprint)
- Expanding Sparse Radar Depth Based on Joint Bilateral Filter for Radar-Guided Monocular Depth Estimation(Chen-Chou Lo, Patrick Vandewalle, 2024, Sensors (Basel, Switzerland))
- mmWave Radar and Image Fusion for Depth Completion: a Two-Stage Fusion Network(Tieshuai Song, Bin Yang, Jun Wang, Guidong He, Zhao Dong, Fengjun Zhong, 2024, 2024 27th International Conference on Information Fusion (FUSION))
- Sparse-to-Dense Hint Guided Stereo-LiDAR Fusion(Ang Li, Dexin Zuo, Anning Hu, Wenxian Yu, Danping Zou, 2025, IEEE Transactions on Circuits and Systems for Video Technology)
点云配准、拼接与跨模态空间对齐
该组文献探讨了三维点云之间的几何对齐与拼接问题。包括跨源点云配准、部分重叠场景下的匹配、基于强化学习或元学习的配准优化,以及将图像与点云进行跨模态对齐的技术,旨在实现多源数据在空间上的一致性。
- Planning with Learned Dynamic Model for Unsupervised Point Cloud Registration(Haobo Jiang, Jin Xie, Jianjun Qian, Jian Yang, 2021, ArXiv Preprint)
- A Systematic Approach for Cross-source Point Cloud Registration by Preserving Macro and Micro Structures(Xiaoshui Huang, Jian Zhang, Lixin Fan, Qiang Wu, Chun Yuan, 2016, ArXiv Preprint)
- Sparse Point Cloud Registration Network with Semantic Supervision in Wilderness Scenes(Zhichao Zhang, Feng Lu, Youchun Xu, Jinsheng Chen, Yulin Ma, 2024, Elektronika ir Elektrotechnika)
- An Improved Keypoint Detection Method for Radar Point Cloud Registration in Urban Environments(Yongqiang Wang, Di Zhang, Lihua Ni, Qun Wan, 2023, IGARSS 2023 - 2023 IEEE International Geoscience and Remote Sensing Symposium)
- Point Cloud Registration using Representative Overlapping Points(Lifa Zhu, Dongrui Liu, Changwei Lin, Rui Yan, Francisco Gómez-Fernández, Ninghua Yang, Ziyong Feng, 2021, ArXiv Preprint)
- Deep Models with Fusion Strategies for MVP Point Cloud Registration(Lifa Zhu, Changwei Lin, Dongrui Liu, Xin Li, Francisco Gómez-Fernández, 2021, ArXiv Preprint)
- Geometry-to-Image Synthesis-Driven Generative Point Cloud Registration(Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng, 2025, ArXiv Preprint)
- PCRDiffusion: Diffusion Probabilistic Models for Point Cloud Registration(Yue Wu, Yongzhe Yuan, Xiaolong Fan, Xiaoshui Huang, Maoguo Gong, Qiguang Miao, 2023, ArXiv Preprint)
- APR: Online Distant Point Cloud Registration Through Aggregated Point Cloud Reconstruction(Quan Liu, Yunsong Zhou, Hongzi Zhu, Shan Chang, Minyi Guo, 2023, ArXiv Preprint)
- Mining and Transferring Feature-Geometry Coherence for Unsupervised Point Cloud Registration(Kezheng Xiong, Haoen Xiang, Qingshan Xu, Chenglu Wen, Siqi Shen, Jonathan Li, Cheng Wang, 2024, ArXiv)
- MAC-I2P: I2P Registration with Modality Approximation and Cone–Block–Point Matching(Yunda Sun, Lin Zhang, Shengjie Zhao, 2025, Applied Sciences)
- A Novel Multimodal Fusion Framework Based on Point Cloud Registration for Near-Field 3D SAR Perception(Tianjiao Zeng, Wensi Zhang, Xu Zhan, Xiaowo Xu, Ziyang Liu, Baoyou Wang, Xiaoling Zhang, 2024, Remote. Sens.)
- A Dynamical Perspective on Point Cloud Registration(Heng Yang, 2020, ArXiv Preprint)
- ALCReg: Active Label Correction for Partial Point Cloud Registration(Zongyi Xu, Xinqi Jiang, Xinyu Gao, Shanshan Zhao, Qianni Zhang, Weisheng Li, Xinbo Gao, 2025, 2025 IEEE International Conference on Multimedia and Expo (ICME))
- Low-overlap point cloud registration with pseudo-structural features for pose calibration of industrial robots(Xueqi Wang, Yinhua Liu, Yanzheng Li, Yuwei Lu, Dongmei Yang, 2025, Optics & Laser Technology)
- Automatic marker-free registration based on similar tetrahedras for single-tree point clouds(Jing Ren, Pei Wang, Hanlong Li, Yuhan Wu, Yuhang Gao, Wenxin Chen, Mingtai Zhang, Lingyun Zhang, 2024, ArXiv Preprint)
- A Point Cloud Registration Algorithm Based on Weighting Strategy for 3D Indoor Spaces(Wenshan Lv, Haifeng Zhang, Weiren Chen, Xiaoming Li, Shengtian Sang, 2024, Applied Sciences)
- mmReg: Centimeter-Level and Real-Time mmWave Radar Point Cloud Registration for Multivehicle Sensing(Kaikai Deng, Ling Xing, Honghai Wu, Yizong Wang, Leiyang Xu, Yue Ling, Jianping Gao, 2026, IEEE Internet of Things Journal)
- S&CNet: A Enhanced Coarse-to-fine Framework For Monocular Depth Completion(Lei Zhang, Weihai Chen, Chao Hu, 2019, ArXiv)
- Pseudo Label Learning for Partial Point Cloud Registration(Wenping Ma, Yifan Sun, Yue Wu, Yue Zhang, Hao Zhu, Biao Hou, Licheng Jiao, 2025, IEEE Transactions on Visualization and Computer Graphics)
- From pseudo- to non-correspondences: Robust point cloud registration via thickness-guided self-correction(Yifei Tian, Xiangyun Li, Jieming Yin, 2026, Comput. Graph.)
- SDFReg: Learning Signed Distance Functions for Point Cloud Registration(Leida Zhang, Zhengda Lu, Kai Liu, Yiqun Wang, 2023, ArXiv Preprint)
- End-to-End LiDAR optimization for 3D point cloud registration(Siddhant Katyan, Marc-André Gardner, Jean-François Lalonde, 2026, ArXiv Preprint)
- An Improved Registration Method for Radar Point Cloud in Weakly Structured Texture Scenes of Urban Environments(Yongqiang Wang, Lihua Ni, Di Zhang, Qun Wan, 2024, IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium)
- RVGICP: Fast and Accurate 4-D Radar Point Cloud Registration Based on Voxelized GICP(Feipeng Chen, Lihui Wang, Kui Wang, 2026, IEEE Transactions on Aerospace and Electronic Systems)
- PCR-DAT: a new point cloud registration method for lidar inertial odometry via distance and Gauss distributed(Xiaosong Wang, Yuchen He, XianQi Cai, Wei Li, 2024, Intelligent Service Robotics)
- Planar Feature Constrained Point Cloud Registration Method with Minimal Overlap Structure(Hongyu Guo, J. Sha, Cong Tan, Ye Chen, 2024, 2024 IEEE 17th International Conference on Signal Processing (ICSP))
- Indirect Point Cloud Registration: Aligning Distance Fields Using a Pseudo Third Point Set(Yijun Yuan, A. Nüchter, 2022, IEEE Robotics and Automation Letters)
- 3D Meta-Registration: Learning to Learn Registration of 3D Point Clouds(Lingjing Wang, Yu Hao, Xiang Li, Yi Fang, 2020, ArXiv Preprint)
- Dense Point Cloud Mapping by Leveraging Neural-Based Monocular Depth Estimation(Luiz Eugênio Santos Araújo Filho, Kleber M. Cabral, Sidney N. Givigi, C. Nascimento, 2025, 2025 IEEE International systems Conference (SysCon))
视频深度一致性与SLAM集成系统
此类文献关注深度估计在动态环境和连续视频流中的表现。重点在于通过集成SLAM系统(如ORB-SLAM3)提供尺度初始化,并利用时间相干性、光流引导或运动感知来确保深度预测在时间轴上的稳定性和全局一致性。
- Real-Time Consistent Monocular Depth Recovery System for Dynamic Environments(Gan Huang, Xiaokun Pan, Hengxu Lin, Ziyang Zhang, Weiwei Xu, Guofeng Zhang, 2025, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- 360ORB-SLAM: A Visual SLAM System for Panoramic Images with Depth Completion Network(Yichen Chen, Yuqi Pan, Ruyu Liu, Haoyu Zhang, Guodao Zhang, Bo Sun, Jianhua Zhang, 2024, 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD))
- CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion(Jinzhou Lin, Jie Zhou, Wenhao Xu, Rongtao Xu, Changwei Wang, Shunpeng Chen, Kexue Fu, Yihua Shao, Li Guo, Shibiao Xu, 2025, ArXiv)
- Depth-Consistent Monocular Visual Trajectory Estimation for AUVs(Yangyang Wang, Xiaokai Liu, Dongbing Gu, Jie Wang, Xianping Fu, 2025, IEEE Internet of Things Journal)
本报告综合了三维视觉感知领域的前沿研究,形成了从底层深度估计到高层空间对齐的完整技术链条。研究核心已从单一的单目深度预测演进为多模态融合(雷达+视觉)与大模型驱动(扩散模型+基座模型)的双轨并行模式。深度补全技术有效解决了传感器稀疏性问题,而点云配准与SLAM集成则保障了大规模场景下的空间一致性。这些技术的融合应用,正显著提升自动驾驶、机器人导航及精密医疗在复杂、动态及恶劣环境下的感知精度与鲁棒性。
总计136篇相关文献
In recent years, radar depth completion has made significant advances in developing backbone networks and high-quality datasets. However, less attention has been paid to optimizing the supervision manner. In this work, we propose a novel supervision method, to the best of our knowledge, using a relative-to-metric conversion (R2MC) module to leverage the generalization capability of the monocular depth large model (MDLM). The R2MC module employs sparse LiDAR data to obtain metric depth scales through pixelwise local mapping while preserving the generalization capability of the MDLM. The experimental results illustrate that our R2MC module can be combined with different backbones and improve their performance compared to their original supervision manners.
Depth completion aims to recover dense depth maps from sparse depth measurements. It is a fundamental challenge in computer vision that is faced in numerous applications, such as robotics, UAVs, and autonomous driving. With the recent advancement in hardware, millimeter-wave (mmWave) radar technology is now commonly employed in high-level perception tasks of autonomous driving. However, compared to LiDAR point clouds, mmWave radar point clouds tend to be sparse and include several ghost targets. To address these challenges, we propose a semantic-guided network architecture to perform depth diffusion by utilizing the consistent relationship between semantics and depth. Our method first utilizes semantic features from images to extend the height measurements, and then transforms the global depth completion task into a series of category-specific depth diffusion tasks to learn the semantic-depth priors and accommodate radar sparsity. In addition, a gradient-guided Smooth-Edge loss is developed to explicitly constrain the semantic regional consistency and border discontinuity. Meanwhile, a 4D dataset that features Camera, 4D Radar, and LiDAR measurement in diverse scenes is collected to conduct experiments. Extensive experiments demonstrate the superior performance of our proposed method over existing monocular and fusion-based approaches.
Accurate depth estimation is crucial in enabling unmanned aerial vehicles (UAVs) to 3-D perception, mapping, and navigation, deciding the successful completion of the flight mission. To achieve accurate perception of dense depth, we propose a novel radar-monocular associated depth completion method. First, we introduce the 3-D grid-based variable DBSCAN (3-D grid-based VDBSCAN) clustering method, which incorporates an adaptive neighbor search radius, variable density threshold, and elliptical cylindrical search region. This approach effectively addresses the problem of high variation in 4-D radar data density and the focus of measurements on the surface of the object. Second, we use Midas to estimate the inverse depth from the monocular and propose a region-search-based nearest neighbor matching approach for the fusion of radar and camera. Finally, we establish an experiment platform and conduct a thorough qualitative and quantitative evaluation. The experimental results demonstrate the robustness of the proposed 3-D grid-based VDBSCAN method in both static and dynamic scenes, with a mean recall of 0.937 and a mean V-Measure of 0.943. Moreover, the proposed data fusion method effectively recovers the depth of the scene, yielding a mean absolute relative (Abs Rel) error of 0.134 in our experiment platform.
Radar data can provide additional depth information for monocular depth estimation. It provides a cost-effective solution and is robust in various weather conditions, particularly when compared with lidar. Given the sparse and limited vertical field of view of radar signals, existing methods employ either a vertical extension of radar points or the training of a preprocessing neural network to extend sparse radar points under lidar supervision. In this work, we present a novel radar expansion technique inspired by the joint bilateral filter, tailored for radar-guided monocular depth estimation. Our approach is motivated by the synergy of spatial and range kernels within the joint bilateral filter. Unlike traditional methods that assign a weighted average of nearby pixels to the current pixel, we expand sparse radar points by calculating a confidence score based on the values of spatial and range kernels. Additionally, we propose the use of a range-aware window size for radar expansion instead of a fixed window size in the image plane. Our proposed method effectively increases the number of radar points from an average of 39 points in a raw radar frame to an average of 100 K points. Notably, the expanded radar exhibits fewer intrinsic errors when compared with raw radar and previous methodologies. To validate our approach, we assess our proposed depth estimation model on the nuScenes dataset. Comparative evaluations with existing radar-guided depth estimation models demonstrate its state-of-the-art performance.
Depth completion upgrades sparse depth measurements into dense depth maps guided by a conventional image. Existing methods for this highly ill-posed task operate in tightly constrained settings and tend to struggle when applied to images outside the training domain or when the available depth measurements are sparse, irregularly distributed, or of varying density. Inspired by recent advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation guided by sparse measurements. Our method, Marigold-DC, builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. The method exhibits excellent zero-shot generalization across a diverse range of environments and handles even extremely sparse guidance effectively. Our results suggest that contemporary monocular depth priors greatly robustify depth completion: it may be better to view the task as recovering dense depth from (dense) image pixels, guided by sparse depth; rather than as inpainting (sparse) depth, guided by an image. Project website: https://MarigoldDepthCompletion.github.io/
No abstract available
Our research aims to generate robust, dense 3-D depth maps for robotics, especially autonomous driving applications. Since cameras output 2-D images and active sensors such as LiDAR or radar produce sparse depth measurements, dense depth maps need to be estimated. Recent methods based on visual transformer networks have outperformed conventional deep learning approaches in various computer vision tasks, including depth prediction, but have focused on the use of a single camera image. This article explores the potential of visual transformers applied to the fusion of monocular images, semantic segmentation, and projected sparse radar reflections for robust monocular depth estimation. The addition of a semantic segmentation branch is used to add object-level understanding and is investigated in a supervised and unsupervised manner. We evaluate our new depth estimation approach on the nuScenes dataset where it outperforms existing state-of-the-art camera-radar depth estimation methods. We show that models can benefit from an additional segmentation branch during the training process by transfer learning even without running segmentation at inference. Further studies are needed to investigate the usage of 4-D-imaging radars and enhanced ground-truth generation in more detail. The related code is available as open-source software under https://github.com/TUMFTM/CamRaDepth.
Remarkable progress has been achieved by current depth completion approaches, which produce dense depth maps from sparse depth maps and corresponding color images. However, the performances of these approaches are limited due to the insufficient feature extractions and fusions. In this work, we propose an efficient multi-modal feature fusion based depth completion framework (MFF-Net), which can efficiently extract and fuse features with different modals in both encoding and decoding processes, thus more depth details with better performance can be obtained. In specific, the encoding process contains three branches where different modals of features from both color and sparse depth input can be extracted, and a multi-feature channel shuffle is utilized to enhance these features thus features with better representation abilities can be obtained. Meanwhile, the decoding process contains two branches to sufficiently fuse the extracted multi-modal features, and a multi-level weighted combination is employed to further enhance and fuse features with different modals, thus leading to more accurate and better refined depth maps. Extensive experiments on different benchmarks demonstrate that we achieve state-of-the-art among online methods. Meanwhile, we further evaluate the predicted dense depth by RGB-D SLAM, which is a commonly used downstream robotic perception task, and higher accuracy on vehicle's trajectory can be obtained in KITTI odometry dataset, which demonstrates the high quality of our depth prediction and the potential of improving the related downstream tasks with depth completion results.
Unsupervised depth completion and estimation methods are trained by minimizing reconstruction error. Block artifacts from resampling, intensity saturation, and occlusions are amongst the many undesirable by-products of common data augmentation schemes that affect image reconstruction quality, and thus the training signal. Hence, typical augmentations on images viewed as essential to training pipelines in other vision tasks have seen limited use beyond small image intensity changes and flipping. The sparse depth modality in depth completion have seen even less use as intensity transformations alter the scale of the 3D scene, and geometric transformations may decimate the sparse points during resampling. We propose a method that unlocks a wide range of previously-infeasible geometric augmentations for unsupervised depth completion and estimation. This is achieved by reversing, or ``undo''-ing, geometric transformations to the coordinates of the output depth, warping the depth map back to the original reference frame. This enables computing the reconstruction losses using the original images and sparse depth maps, eliminating the pitfalls of naive loss computation on the augmented inputs and allowing us to scale up augmentations to boost performance. We demonstrate our method on indoor (VOID) and outdoor (KITTI) datasets, where we consistently improve upon recent methods across both datasets as well as generalization to four other datasets. Code available at: https://github.com/alexklwong/augundo.
We integrate sparse radar data into a monocular depth estimation model and introduce a novel preprocessing method for reducing the sparseness and limited field of view provided by radar. We explore the intrinsic error of different radar modalities and show our proposed method results in more data points with reduced error. We further propose a novel method for estimating dense depth maps from monocular 2D images and sparse radar measurements using deep learning based on the deep ordinal regression network by Fu et al. Radar data are integrated by first converting the sparse 2D points to a height-extended 3D measurement and then including it into the network using a late fusion approach. Experiments are conducted on the nuScenes dataset. Our experiments demonstrate state-of-the-art performance in both day and night scenes.
Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images, which distills geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Besides, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results demonstrate that models trained with our two-stage distillation framework achieve state-of-the-art performance, ranking first place on the KITTI benchmark. Code is available at https://github.com/Sharpiless/DMD3C
Estimating scene geometry from data obtained with cost-effective sensors is key for robots and self-driving cars. In this paper, we study the problem of predicting dense depth from a single RGB image (monodepth) with optional sparse measurements from low-cost active depth sensors. We introduce Sparse Auxiliary Networks (SANs), a new module enabling monodepth networks to perform both the tasks of depth prediction and completion, depending on whether only RGB images or also sparse point clouds are available at inference time. First, we decouple the image and depth map encoding stages using sparse convolutions to process only the valid depth map pixels. Second, we inject this information, when available, into the skip connections of the depth prediction network, augmenting its features. Through extensive experimental analysis on one indoor (NYUv2) and two outdoor (KITTI and DDAD) benchmarks, we demonstrate that our proposed SAN architecture is able to simultaneously learn both tasks, while achieving a new state of the art in depth prediction by a significant margin.
In this paper, we explore the possibility of achieving a more accurate depth estimation by fusing monocular images and Radar points using a deep neural network. We give a comprehensive study of the fusion between RGB images and Radar measurements from different aspects and proposed a working solution based on the observations. We find that the noise existing in Radar measurements is one of the main key reasons that prevents one from applying the existing fusion methods developed for LiDAR data and images to the new fusion problem between Radar data and images. The experiments are conducted on the nuScenes dataset, which is one of the first datasets which features Camera, Radar, and LiDAR recordings in diverse scenes and weather conditions. Extensive experiments demonstrate that our method outperforms existing fusion methods. We also provide detailed ablation studies to show the effectiveness of each component in our method.
Accurate and dense scene depth perception is critical for applications such as autonomous driving and robotic navigation. However, due to the limited geometric cues provided by inherently sparse depth data acquired from sensors, significant challenges remain in completing depth information by integrating monocular RGB images to reconstruct object depth in a coherent 3D space. Traditional data augmentation strategies lack geometric awareness, often causing depth discontinuities at the foreground-background boundaries, leading to edge blurring and artifacts that distort geometric relationships in mixed regions. Additionally, mainstream depth estimation frameworks focus too much on global features, making it difficult to model the complex spatial relationships between foreground objects and the background, resulting in the loss of foreground details and ambiguity in background depth. To address these challenges, we propose a hierarchical geometric-aware depth completion network (HGAN) that consists of two key modules: the geometric consistency-aware enhancement module (GCAM) and the geometric relationship decomposition modeling module (GRDM). Specifically, the GCAM constructs a geometric consistency map between the foreground and background regions based on depth similarity and employs adaptive weights to guide foreground-background feature fusion. This enhances the boundary modeling capabilities, significantly improving the structural clarity and continuity of the depth map. The GRDM introduces a geometric relationship decomposition mechanism that explicitly separates depth feature mapping into two orthogonal subspaces: Range Space and Null Space. The Range Space models global scene consistency constraints, ensuring the structural coherence of depth estimation, whereas the Null Space focuses on reconstructing local residual details, effectively enhancing the perception of foreground object edges and fine details. The experimental results show that our method outperforms previous approaches in terms of both efficiency and accuracy on the KITTI and NYU-Depth V2 datasets, with HGAN reducing RMSE by 7.5% on KITTI and 2.3% on NYUv2 compared to CompletionFormer.
Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning-based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagate sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.
The problem of depth completion involves predicting a dense depth image from a single sparse depth map and an RGB image. Unsupervised depth completion methods have been proposed for various datasets where ground truth depth data is unavailable and supervised methods cannot be applied. However, these models require auxiliary data to estimate depth values, which is far from real scenarios. Monocular depth estimation (MDE) models can produce a plausible relative depth map from a single image, but there is no work to properly combine the sparse depth map with MDE for depth completion; a simple affine transformation to the depth map will yield a high error since MDE are inaccurate at estimating depth difference between objects. We introduce StarryGazer, a domain-agnostic framework that predicts dense depth images from a single sparse depth image and an RGB image without relying on ground-truth depth by leveraging the power of large MDE models. First, we employ a pre-trained MDE model to produce relative depth images. These images are segmented and randomly rescaled to form synthetic pairs for dense pseudo-ground truth and corresponding sparse depths. A refinement network is trained with the synthetic pairs, incorporating the relative depth maps and RGB images to improve the model's accuracy and robustness. StarryGazer shows superior results over existing unsupervised methods and transformed MDE results on various datasets, demonstrating that our framework exploits the power of MDE models while appropriately fixing errors using sparse depth information.
Dense depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, for guiding the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1\% compared to dense-supervised methods. RaCalNet is composed of two key modules. The Radar Recalibration module performs radar point screening and pixel-wise displacement refinement, producing accurate and reliable depth priors from sparse radar inputs. These priors are then used by the Metric Depth Optimization module, which learns to infer scene-level scale priors and fuses them with monocular depth predictions to achieve metrically accurate outputs. This modular design enhances structural consistency and preserves fine-grained geometric details. Despite relying solely on sparse supervision, RaCalNet produces depth maps with clear object contours and fine-grained textures, demonstrating superior visual quality compared to state-of-the-art dense-supervised methods. Quantitatively, it achieves performance comparable to existing methods on the ZJU-4DRadarCam dataset and yields a 34.89\% RMSE reduction in real-world deployment scenarios. We plan to gradually release the code and models in the future at https://github.com/818slam/RaCalNet.git.
We present a novel algorithm for self-supervised monocular depth completion. Our approach is based on training a neural network that requires only sparse depth measurements and corresponding monocular video sequences without dense depth labels. Our self-supervised algorithm is designed for challenging indoor environments with textureless regions, glossy and transparent surfaces, moving people, longer and diverse depth ranges and scenes captured by complex ego-motions. Our novel architecture leverages both deep stacks of sparse convolution blocks to extract sparse depth features and pixel-adaptive convolutions to fuse image and depth features. We compare with existing approaches in NYUv2, KITTI and NAVERLABS indoor datasets, and observe 5 - 34 % improvements in root- means-square error (RMSE) reduction.
Obtaining three-dimensional information, especially the z-axis depth information, is crucial for robotic micromanipulation. Due to the unavailability of depth sensors such as lidars in micromanipulation setups, traditional depth acquisition methods such as depth from focus or depth from defocus directly infer depth from microscopic images and suffer from poor resolution. Alternatively, micromanipulation tasks obtain accurate depth information by detecting the contact between an end-effector and an object (e.g., a cell). Despite its high accuracy, only sparse depth data can be obtained due to its low efficiency. This paper aims to address the challenge of acquiring dense depth information during robotic cell micromanipulation. A weakly-supervised depth completion network is proposed to take cell images and sparse depth data obtained by contact detection as input to generate a dense depth map. A two-stage data augmentation method is proposed to augment the sparse depth data, and the depth map is optimized by a network refinement method. The experimental results show that the MAE value of the depth prediction error is less than 0.3 µm, which proves the accuracy and effectiveness of the method. This deep learning network pipeline can be seamlessly integrated with the robotic micromanipulation tasks to provide accurate depth information.
No abstract available
Over the past few years, monocular depth estimation and completion have been paid more and more attention from the computer vision community because of their widespread applications. In this paper, we introduce novel physics (geometry)-driven deep learning frameworks for these two tasks by assuming that 3D scenes are constituted with piece-wise planes. Instead of directly estimating the depth map or completing the sparse depth map, we propose to estimate the surface normal and plane-to-origin distance maps or complete the sparse surface normal and distance maps as intermediate outputs. To this end, we develop a normal-distance head that outputs pixel-level surface normal and distance. Afterthat, the surface normal and distance maps are regularized by a developed plane-aware consistency constraint, which are then transformed into depth maps. Furthermore, we integrate an additional depth head to strengthen the robustness of the proposed frameworks. Extensive experiments on the NYU-Depth-v2, KITTI and SUN RGB-D datasets demonstrate that our method exceeds in performance prior state-of-the-art monocular depth estimation and completion competitors.
No abstract available
Monocular depth estimation is essential for applications such as autonomous navigation and 3D reconstruction. However, achieving accurate and temporally consistent depth estimation in dynamic environments remains challenging due to scale ambiguity, sensitivity to dynamic objects, and inconsistent depth predictions. Traditional SLAM-based methods ensure global consistency but perform poorly in dynamic scenes, while deep learning-based approaches suffer from the absence of absolute scale and temporal stability. To address these issues, we propose a Real-Time Consistent Monocular Depth Recovery System that combines ORB-SLAM3 for sparse depth initialization, a ViT-based depth completion network, and a motion segmentation module to improve robustness in dynamic environments. Additionally, we introduce a dual-weight fusion module that adaptively balances RGB semantic features and geometric depth priors, ensuring high accuracy and consistency. Our system jointly optimizes both static and dynamic regions to produce globally scale-consistent dense depth maps with improved temporal stability. Extensive experiments on benchmark datasets demonstrate that our approach outperforms existing methods in terms of depth accuracy, temporal consistency, and robustness in dynamic scenes, while maintaining real-time performance.
With the advent of the Industry 4.0 era and the increasing performance requirements for AR/VR applications and vision assistance and inspection systems in recent years, visual simultaneous localization and mapping (vSLAM) is a fundamental task in computer vision and robotics. However, traditional vSLAM systems are limited by the camera’s narrow field-of-view, resulting in challenges such as sparse feature distribution and lack of dense depth information. To overcome these limitations, this paper proposes a 360ORB-SLAM system for panoramic images that combines with a depth completion network. The system extracts feature points from the panoramic image, utilizes a panoramic triangulation module to generate sparse depth information, and employs a depth completion network to obtain a dense panoramic depth map. Experimental results on our novel panoramic dataset constructed based on Carla demonstrate that the proposed method achieves superior scale accuracy compared to existing monocular SLAM methods and effectively addresses the challenges of feature association and scale ambiguity. The integration of the depth completion network enhances system stability and mitigates the impact of dynamic elements on SLAM performance.
In this paper, we formulate a potentially valuable panoramic depth completion (PDC) task as panoramic 3D cameras often produce 360{\deg} depth with missing data in complex scenes. Its goal is to recover dense panoramic depths from raw sparse ones and panoramic RGB images. To deal with the PDC task, we train a deep network that takes both depth and image as inputs for the dense panoramic depth recovery. However, it needs to face a challenging optimization problem of the network parameters due to its non-convex objective function. To address this problem, we propose a simple yet effective approach termed M{^3}PT: multi-modal masked pre-training. Specifically, during pre-training, we simultaneously cover up patches of the panoramic RGB image and sparse depth by shared random mask, then reconstruct the sparse depth in the masked regions. To our best knowledge, it is the first time that we show the effectiveness of masked pre-training in a multi-modal vision task, instead of the single-modal task resolved by masked autoencoders (MAE). Different from MAE where fine-tuning completely discards the decoder part of pre-training, there is no architectural difference between the pre-training and fine-tuning stages in our M$^{3}$PT as they only differ in the prediction density, which potentially makes the transfer learning more convenient and effective. Extensive experiments verify the effectiveness of M{^3}PT on three panoramic datasets. Notably, we improve the state-of-the-art baselines by averagely 26.2% in RMSE, 51.7% in MRE, 49.7% in MAE, and 37.5% in RMSElog on three benchmark datasets.
We present Consistent3D, a training-free framework for high-fidelity 3D reconstruction from sparse views, leveraging point-cloud-guided video diffusion model to generate geometrically consistent novel views for 3D Gaussian Splatting. To address the severe multi-view inconsistency commonly observed in diffusion-based novel view synthesis, we introduce a geometry-guided pipeline. Specifically, we leverage the point cloud-estimated from sparse input images-as 3D priors to guide the video diffusion process, enabling both geometrically plausible and frame-consistent novel view synthesis. Furthermore, to achieve geometrically consistent depth, we introduce a Local Depth Alignment (LDA) strategy that adjusts monocular depth estimates into a scale-aware representation. This strategy is performed relative to the sparse input point cloud prior, resolving the inherent scale ambiguity in monocular depth prediction. We then propose a consistency evaluation module that computes 2D reprojection error and 3D depth discrepancy across views, yielding confidence scores. These views with high confidence scores are fused into a regularized 3D Gaussian Splatting (3DGS) pipeline, where parameters of Gaussians are optimized under confidenceweighted RGB and depth constraints. By enforcing 3D consistency through point-cloud-guided video diffusion model and a comprehensive confidence-weighted 3DGS optimization-which integrates depth alignment, confidence prediction, incremental fusion, soft constraint optimization, and real-time visual refinement-Consistent3D achieves high-fidelity and visually smooth 3DGS reconstruction from sparse observations without requiring model training. Our framework effectively bridges the generative power of diffusion models with the representational efficiency of 3DGS, delivering geometrically consistent and visually compelling 3D scenes.
We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocular depth prediction with the dense metric scale induced from sparse and noisy Radar data. We propose a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner. Our proposed method significantly outperforms the state-of-the-art Radar-Camera depth estimation methods by reducing the mean absolute error (MAE) of depth estimation by 25.6% and 40.2% on the challenging nuScenes dataset and our self-collected ZJU-4DRadarCam dataset, respectively. Our code and dataset will be released at https://github.com/MMOCKING/RadarCam-Depth.
Monocular 3D Semantic Scene Completion (SSC) is a challenging yet promising task that aims to infer dense geometric and semantic descriptions of a scene from a single image. While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. Instead of random initialization, SplatSSC utilizes a dedicated depth branch composed of a Group-wise Multi-scale Fusion (GMF) module, which integrates multi-scale image and depth features to generate a sparse yet representative set of initial Gaussian primitives. To mitigate noise from outlier primitives, we develop the Decoupled Gaussian Aggregator (DGA), which enhances robustness by decomposing geometric and semantic predictions during the Gaussian-to-voxel splatting process. Complemented with a specialized Probability Scale Loss, our method achieves state-of-the-art performance on the Occ-ScanNet dataset, outperforming prior approaches by over 6.3% in IoU and 4.1% in mIoU, while reducing both latency and memory cost by more than 9.3%.
Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.
Robust three-dimensional scene understanding is now an ever-growing area of research highly relevant in many real-world applications such as autonomous driving and robotic navigation. In this paper, we propose a multi-task learning-based model capable of performing two tasks:- sparse depth completion (i.e. generating complete dense scene depth given a sparse depth image as the input) and monocular depth estimation (i.e. predicting scene depth from a single RGB image) via two sub-networks jointly trained end to end using data randomly sampled from a publicly available corpus of synthetic and real-world images. The first sub-network generates a sparse depth image by learning lower level features from the scene and the second predicts a full dense depth image of the entire scene, leading to a better geometric and contextual understanding of the scene and, as a result, superior performance of the approach. The entire model can be used to infer complete scene depth from a single RGB image or the second network can be used alone to perform depth completion given a sparse depth input. Using adversarial training, a robust objective function, a deep architecture relying on skip connections and a blend of synthetic and real-world training data, our approach is capable of producing superior high quality scene depth. Extensive experimental evaluation demonstrates the efficacy of our approach compared to contemporary state-of-the-art techniques across both problem domains.
In the context of self-driving cars, Convolutional Neural Networks (CNNs) have improved the Single Image Depth Estimation (SIDE) field by predicting maps with accurate depth information for autonomous navigation. However, these networks are generally trained with sparse samples, generated by LIDAR laser scans, and they run at a high computational cost, demanding powerful GPUs. In this paper, we address the SIDE and depth completion tasks jointly, focusing on the design of a lightweight method to be applied in real self-driving scenarios. We introduce a fast and efficient densification algorithm, based on closing morphology, and we also propose a deep network pipeline that uses the densified reference depth maps for training. When compared to state-of-the-art methods, our network has fewer parameters, higher inference speed and yet comparable accuracy. We conduct a series of experiments in the widely exploited and public available KITTI Depth Benchmark.
Pixel-wise depth completion using multi-sensor fusion is crucial in areas such as autonomous driving. While LiDAR and image fusion methods exhibit reliability, it can face challenges in adverse weather conditions, such as rain and fog. In contrast, mmWave radar, emerged in recent years, has stronger anti-interference capability. However, radar point typically features high sparsity. And mmWave radar has lower resolution in the height dimension, leading to increased errors when projected onto the image plane. To solve the problem, this paper proposes a two-stage fusion convolutional neural network. In the first stage, image features are utilized to filter the noisy radar point cloud and learn the mapping of radar points to image regions. In the second stage, we perform multiscale fusion of the image with the coarse depth map generated in the first stage to predict the missing depth values. Experiment results indicate that our improved strategy reduces the error of depth value estimation. Our network shows a 4.5% improvement in RMSE(root-mean-square error) compared to the previous method.
Perceiving the 3D structure of the spacecraft is a prerequisite for successfully executing many on-orbit space missions, and it can provide critical input for many downstream vision algorithms. In this paper, we propose to sense the 3D structure of spacecraft using light detection and ranging sensor (LIDAR) and a monocular camera. To this end, Spacecraft Depth Completion Network (SDCNet) is proposed to recover the dense depth map based on gray image and sparse depth map. Specifically, SDCNet decomposes the spacecraft depth completion task into foreground segmentation subtask and foreground depth completion subtask, which segments the spacecraft region first and then performs depth completion on the segmented foreground area. In this way, the background interference to foreground spacecraft depth completion is effectively avoided. Moreover, an attention-based feature fusion module is also proposed to aggregate the complementary information between different inputs, which deduces the correlation between different features along the channel and the spatial dimension sequentially. Besides, four metrics are also proposed to evaluate object-level depth completion performance. Finally, a large-scale satellite depth completion dataset is constructed for training and testing spacecraft depth completion algorithms. Empirical experiments on the dataset demonstrate the effectiveness of the proposed SDCNet, which achieves 0.225 m mean absolute error of interest and 0.778 m mean absolute truncation error, surpassing state-of-the-art methods by a large margin. The pose estimation experiment is also conducted based on the depth completion results, and the experimental results indicate that the predicted dense depth map could meet the needs of downstream vision tasks.
In this work we augment our prior state-of-the-art visual-inertial odometry (VIO) system, OpenVINS [1], to produce accurate dense depth by filling in sparse depth estimates (depth completion) from VIO with image guidance – all while focusing on enabling real-time performance of the full VIO+depth system on embedded devices. We show that noisy depth values with varying sparsity produced from a VIO system can not only hurt the accuracy of predicted dense depth maps, but also make them considerably worse than those from an image-only depth network with the same underlying architecture. We investigate this sensitivity on both an outdoor simulated and indoor handheld RGB-D dataset, and present simple yet effective solutions to address these shortcomings of depth completion networks. The key changes to our state-of-the-art VIO system required to provide high quality sparse depths for the network while still enabling efficient state estimation on embedded devices are discussed. A comprehensive computational analysis is performed over different embedded devices to demonstrate the efficiency and accuracy of the proposed VIO depth completion system.
Current radar-vision fusion techniques struggle to fully leverage the complementary data from sparse radar points and depth-deficient images, impacting their overall effectiveness. This paper proposes RaViDeep, a novel target detection method based on deep fusion of millimeter-wave radar and monocular image. Initially, a Semantic-based Point Cloud Registration (SPCR) module combines image semantics, radar hierarchical features, and Doppler data to enhance target spatial representation, yielding dense and stable semantic radar points. Subsequently, a Radar-Guided Depth Estimation (RGDE) module with Gaussian enhancement is introduced, utilizing accurate radar depth measurements to guide image depth estimation. This approach fosters a comprehensive scene understanding and effectively reduces measurement and calibration errors. Finally, a pseudo point cloud, generated by the estimated image depth, is integrated with the semantic radar points to facilitate target detection. Tailored for autonomous berthing tasks in wharf scenarios, a novel Non-Occupied Overlap (NOO) metric is developed. Experimental results demonstrate that RaViDeep surpasses state-of-the-art methods, achieving a 12.10% improvement in the NOO metric and a 13.90% improvement in recall. These results verify the superior performance and robustness of our method in practical wharf scenarios.
Point cloud registration is crucial for autonomous vehicle Radar-based localization, mapping, and perception. Compared to conventional Radar, 4-D Radar offers enhanced elevation information, facilitating the generation of 3-D point cloud akin to LiDAR. However, the inherent sparsity and high noise level of 4-D Radar data make direct adaptation of LiDAR-based registration methods ineffective. To address these challenges, we propose Radar voxelized generalized iterative closest point (RVGICP) algorithm for efficient and rapid point cloud registration. First, we introduce a dual-stage initial pose estimation module. To robustly handle large interscan displacements, it leverages Doppler velocity for an initial translation estimate. Complementing this, an efficient rotation estimation method based on point cloud histograms is proposed to accelerate convergence and mitigate the risk of local minima. Second, we extend the voxelized generalized iterative closest point algorithm by performing voxelization in the polar coordinates, enabling adaptive resolution for voxel matching. Third, a $ 3^{3}$-neighbors set-based voxel covariance estimation model is proposed, which better characterizes local geometric properties and avoids computationally nearest neighbor search. Extensive experiments on both open-source datasets and our experimental vehicle demonstrate that RVGICP achieves superior accuracy and robustness across various weather conditions while maintaining high computational efficiency.
mmReg: Centimeter-Level and Real-Time mmWave Radar Point Cloud Registration for Multivehicle Sensing
Multivehicle collaborative sensing has emerged as a new paradigm to boost the safety of autonomous vehicles. The cornerstone of this vision is the real-time and accurate registration of mmWave radar point clouds among multiple vehicles. To accomplish this, we design <italic>mmReg</italic>, an innovative system capable of achieving centimeter-level and real-time sensing fusion between vehicles. <italic>mmReg</italic> consists of three major components: 1) an <italic>SAR imaging-driven point cloud generation</italic> component leverages SAR imaging to image sparse and disordered radar point clouds to generate high-quality point clouds; 2) a <italic>motion-aware frame synchronization</italic> component can achieve the spatiotemporal alignment of point clouds between vehicles for effectively mitigating the impact of asynchronous radar frames; and 3) a <italic>shared object-based registration</italic> component can capture and understand the unique global position of shared objects, supporting real-time and accurate registration. We implement and evaluate <italic>mmReg</italic> on CARLA and real-world campus datasets. The results demonstrate that <italic>mmReg</italic> can improve the vehicle’s sensing range by 117% in an average of 99.91 ms, achieving a <inline-formula> <tex-math notation="LaTeX">$4.82\times $ </tex-math></inline-formula> improvement in accuracy.
The advancement of autonomous driving technology has driven a surge of applications in urban environments, where the precision of the registration of keypoint-based radar point clouds plays a crucial role in determining the overall performance of these applications. However, numerous inaccurate keypoints would be detected in the presence of multipath, beam spread, and noise, leading to the reduction of registration accuracy of the radar point cloud. To deal with this problem, we propose an improved keypoint detection method, where a novel candidate keypoint selection strategy and a threshold selection strategy are designed. The results of experiments based on a public radar dataset demonstrate that the proposed method can effectively reduce the number of inaccurate keypoints. Moreover, it achieves high precision on radar point cloud registration compared to the state-of-the-art method.
Partial point cloud registration plays a crucial role in computer vision and has widespread applications in 3D map construction, pose estimation, and high-precision localization. However, the collected point clouds often contain missing data due to hardware limitations and complex environments. Various partial registration algorithms have been proposed, most of which rely on estimating overlap regions. However, a significant proportion of these algorithms rely heavily on ground truth labels. Manual labeling is both time-consuming and labor-intensive, whereas algorithmic automatic labeling lacks sufficient accuracy. To tackle this issue, we present PSEudo Label learning for unsupervised partial point cloud registration (PSEL). This method utilizes complementary tasks to learn reliable pseudo labels for overlap regions and correspondences without depending on ground truth labels. The key idea is to use the complementarity between overlap estimation and registration to generate two types of pseudo labels based on the nearest points in pairs of aligned point clouds. These pseudo labels are then employed to supervise the learning of overlap regions and correspondences, gradually enhancing their accuracy throughout the learning process and ultimately establishing an unsupervised learning framework. PSEL consists of an overlap estimation module and a correspondence filtering module. The pseudo labels generated after registration are used to supervise both modules. Notably, the correspondence filtering module has two pipelines. The similarity and difference of the corresponding point features are used to eliminate false correspondences during the training and inference stages, respectively, with only the latter being optimized with pseudo labels. To validate the effectiveness of our registration method, we conducted experiments using the synthetic dataset ModelNet40, the indoor dataset 3DMatch, and the outdoor dataset KITTI.
No abstract available
Radar point cloud registration is vital for various applications in autonomous driving, robotics, and localization. However, existing methods have not adequately accounted for weakly structured texture scenes in urban environments, such as long-distance monotonous walls and open roads, which are prevalent in urban environments. Failures in point cloud registration in these scenarios can significantly affect the performance of the aforementioned applications. In order to improve the registration accuracy in weakly structured texture scenes of urban environments, we propose a novel mismatch elimination and an improved sample consensus registration method. The experimental results demonstrate that the proposed method achieves higher registration accuracy in weakly structured texture scenes than the compared methods.
No abstract available
Segmenting objects in an environment is a crucial task for autonomous driving and robotics, as it enables a better understanding of the surroundings of each agent. Although camera sensors provide rich visual details, they are vulnerable to adverse weather conditions. In contrast, radar sensors remain robust under such conditions, but often produce sparse and noisy data. Therefore, a promising approach is to fuse information from both sensors. In this work, we propose a novel framework to enhance camera-only baselines by integrating a diffusion model into a camera-radar fusion architecture. We leverage radar point features to create pseudo-masks using the Segment-Anything model, treating the projected radar points as point prompts. Additionally, we propose a noise reduction unit to denoise these pseudo-masks, which are further used to generate inpainted images that complete the missing information in the original images. Our method improves the camera-only segmentation baseline by 2.63% in mIoU and enhances our camera-radar fusion architecture by 1.48% in mIoU on the Waterscenes dataset. This demonstrates the effectiveness of our approach for semantic segmentation using camera-radar fusion under adverse weather conditions.
Scene flow estimation using 4D millimeter-wave radar has emerged as a prominent research focus for 3D dynamic perception. However, compared to LiDAR point clouds, the drastic sparsity of radar point clouds poses challenges in enforcing local rigidity constraints, which are crucial for accurate 3D motion estimation. To address this issue, we propose a novel Gaussian-based pseudo-point generation method that fully leverages two distinct yet complementary data modalities, 3D coordinates and Doppler velocity, to support multi-body rigidity assumptions, effectively capturing fine-grained and structured motion patterns from highly sparse radar point clouds. Furthermore, a velocity calibration mechanism is designed to improve the reliability of fine-grained rigid motion velocity estimation. In addition, a progressive fusion strategy is introduced to systematically integrate fine-grained rigid motion priors at multiple levels, enhancing the robustness of matching costs and motion features while effectively compensating for coarse flows. Experimental results on real-world radar scans from the View-of-Delft (VoD) dataset demonstrate the promising performance of our FGRFlow compared to other leading 4D radar-based approaches, validating the advantages of our design choices.
Deep point cloud registration methods encounter challenges due to partial overlaps and are heavily reliant on labeled data. In this paper, we propose ALCReg, an active label correction method for partial point cloud registration learning. ALCReg utilises a multimodal approach to generate pseudo labels, mitigating the cold-start issue in active learning. To ensure the diversity and representativeness of selected samples, we propose an inlier ratio based query strategy for manual correction. Furthermore, an innovative self-correction mechanism based on consistency is introduced, allowing the model to refine pseudo labels autonomously and further improve model performance. Experimental results on the 3DMatch and 3DLoMatch datasets demonstrate that ALCReg achieves comparable performance with the fully-supervised registration methods, even with only 5% of labeled samples, making it the first active learning method tailored for partial point cloud registration. Code is available at https://github.com/Jiang0903/ALCReg.
: With the rapid development of intelligent vehicle technology, 3D object detection and tracking play a crucial role in the field of autonomous driving. This article deeply studies the point cloud and image fusion 3D object detection and tracking algorithm in the field of autonomous driving. Laser radar and surround view camera are used as perception sensors, and based on deep learning theory and methods, the focus is on overcoming the difficulties of multi-modal data fusion in 3D detection and tracking under complex traffic conditions. This article proposes a series of innovative algorithms, including MaskSensing algorithm based on image instance segmentation, DeformFusion algorithm based on Transformer architecture, MixFusion algorithm with hybrid fusion strategy, and DeepTrack3D algorithm. These algorithms have achieved significant results on datasets such as nuScenes, effectively improving the accuracy and robustness of 3D object detection and tracking. In the future, further research is needed in areas such as temporal fusion, interactive fusion, and unsupervised learning to enhance the performance of autonomous driving technology.
In recent years, implicit functions have drawn attention in the field of 3D reconstruction and have successfully been applied with Deep Learning. However, for incremental reconstruction, implicit function-based registrations have been rarely explored. Inspired by the high precision of deep learning global feature registration, we propose to combine this with distance fields. We generalize the algorithm to a non-Deep Learning setting while retaining the accuracy. Our algorithm is more accurate than conventional models while, without any training, it achieves a competitive performance and faster speed, compared to Deep Learning-based registration models. The implementation is available on github1 for the research community.
Point cloud registration, a fundamental task in 3D vision, has achieved remarkable success with learning-based methods in outdoor environments. Unsupervised outdoor point cloud registration methods have recently emerged to circumvent the need for costly pose annotations. However, they fail to establish reliable optimization objectives for unsupervised training, either relying on overly strong geometric assumptions, or suffering from poor-quality pseudo-labels due to inadequate integration of low-level geometric and high-level contextual information. We have observed that in the feature space, latent new inlier correspondences tend to cluster around respective positive anchors that summarize features of existing inliers. Motivated by this observation, we propose a novel unsupervised registration method termed INTEGER to incorporate high-level contextual information for reliable pseudo-label mining. Specifically, we propose the Feature-Geometry Coherence Mining module to dynamically adapt the teacher for each mini-batch of data during training and discover reliable pseudo-labels by considering both high-level feature representations and low-level geometric cues. Furthermore, we propose Anchor-Based Contrastive Learning to facilitate contrastive learning with anchors for a robust feature space. Lastly, we introduce a Mixed-Density Student to learn density-invariant features, addressing challenges related to density variation and low overlap in the outdoor scenario. Extensive experiments on KITTI and nuScenes datasets demonstrate that our INTEGER achieves competitive performance in terms of accuracy and generalizability.
Based on the spatial feature association of point clouds from different 3D laser scanning data and the characteristics of regular plane features in urban structure scenes, a low overlap structure point cloud registration method based on plane feature constraint is proposed in this paper. Firstly, a planar feature descriptor is proposed based on the distribution and density of points in the plane feature extraction to reduce the influence of broken plane and pseudo plane on the registration accuracy. Furthermore, a registration method based on the planar feature constraints is designed. The method incorporates geometric feature descriptor based on planar information and mathematical equation of feature plane to solve the spatial transformation matrix iteratively. To verify the Correctness and effectiveness, the proposed method is compared with three main classic registration algorithms on structure scene data. The experimental results demonstrate that the proposed registration method achieves the satisfied results and is superior to other algorithms both in accuracy and efficiency.
This study introduces a pioneering multimodal fusion framework to enhance near-field 3D Synthetic Aperture Radar (SAR) imaging, crucial for applications like radar cross-section measurement and concealed object detection. Traditional near-field 3D SAR imaging struggles with issues like target–background confusion due to clutter and multipath interference, shape distortion from high sidelobes, and lack of color and texture information, all of which impede effective target recognition and scattering diagnosis. The proposed approach presents the first known application of multimodal fusion in near-field 3D SAR imaging, integrating LiDAR and optical camera data to overcome its inherent limitations. The framework comprises data preprocessing, point cloud registration, and data fusion, where registration between multi-sensor data is the core of effective integration. Recognizing the inadequacy of traditional registration methods in handling varying data formats, noise, and resolution differences, particularly between near-field 3D SAR and other sensors, this work introduces a novel three-stage registration process to effectively address these challenges. First, the approach designs a structure–intensity-constrained centroid distance detector, enabling key point extraction that reduces heterogeneity and accelerates the process. Second, a sample consensus initial alignment algorithm with SHOT features and geometric relationship constraints is proposed for enhanced coarse registration. Finally, the fine registration phase employs adaptive thresholding in the iterative closest point algorithm for precise and efficient data alignment. Both visual and quantitative analyses of measured data demonstrate the effectiveness of our method. The experimental results show significant improvements in registration accuracy and efficiency, laying the groundwork for future multimodal fusion advancements in near-field 3D SAR imaging.
The registration of laser point clouds in complex conditions in wilderness scenes is an important aspect in the research field of autonomous vehicle navigation. It serves as the foundation for solving problems such as environment reconstruction, map construction, navigation and positioning, and pose estimation during the motion process of autonomous vehicles using laser radar sensors. Due to the sparse structured features, uneven point cloud density, and high noise levels in wilderness scenes, achieving reliable and accurate point cloud registration is challenging. In this paper, we propose a semantic-supervised sparse point cloud registration network (S3PCRNet) aiming to achieve effective registration of laser point clouds in wilderness large-scale scenes. Firstly, a local feature aggregation module is designed to extract the local structural features of the point cloud. Then, based on rotation position encoding, a randomly grouped self-attention mechanism is proposed to obtain the global features of the point cloud through learning. A semantic information weight matrix is calculated to filter out negligible points. Subsequently, a semantic fusion feature module is utilised to find reliable correspondences between point clouds. Finally, the proposed method is trained and evaluated on both the RELLIS-3D dataset and a self-made Off-road-3D dataset.
Point cloud registration is a technology that aligns point cloud data from different viewpoints by computing coordinate transformations to integrate them into a specified coordinate system. Many cutting-edge fields, including autonomous driving, industrial automation, and augmented reality, require the registration of point cloud data generated by millimeter-wave radar for map reconstruction and path planning in 3D environments. This paper proposes a novel point cloud registration algorithm based on a weighting strategy to enhance the accuracy and efficiency of point cloud registration in 3D environments. This method combines a statistical weighting strategy with a point cloud registration algorithm, which can improve registration accuracy while also increasing computational efficiency. First, in 3D indoor spaces, we apply PointNet to the semantic segmentation of the point cloud. We then propose an objective weighting strategy to assign different weights to the segmented parts of the point cloud. The Iterative Closest Point (ICP) algorithm uses these weights as reference values to register the entire 3D indoor space’s point cloud. We also show a new way to perform nonlinear calculations that yield exact closed-form answers for the ICP algorithm in generalized 3D measurements. We test the proposed algorithm’s accuracy and efficiency by registering point clouds on public datasets of 3D indoor spaces. The results show that it works better in both qualitative and quantitative assessments.
No abstract available
In the landscape of recent technological advancements, the advent of 4-D millimeter-wave radar has ushered in a new era of data quality improvements, showcasing the potential to rival, or even surpass, Lidar systems. Despite its innovative prowess, the lower density and accuracy of 4-D millimeter-wave radar’s point clouds, in comparison to those generated by Lidar, pose significant limitations to the technology’s broader application. Addressing these constraints, our research introduces a comprehensive, end-to-end methodology for augmenting point cloud data through a fusion of monocular camera imagery and 4-D millimeter-wave radar. First, the monocular image is transformed into a pseudo-point cloud. Subsequently, features from both the radar-generated point clouds and the pseudo-point clouds are independently extracted and merged using two distinct feature extraction modules. To refine this process further, a novel loss function is designed, taking into account the global and local feature consistency between the reconstructed point cloud and the Lidar raw point cloud. The experimental results, particularly within the realms of object detection, illustrate a marked enhancement in point cloud quality over the baseline provided by native 4-D millimeter-wave radar outputs. Additionally, the application of this method to simultaneous localization and mapping (SLAM) demonstrates a significant improvement in accuracy, achieving a level of performance that is competitive with Lidar. Notably, the proposed method is low computational demand, enabling real-time inference within a mere 30 ms on resource-constrained platforms such as the NVIDIA Jetson Nano 2G.
The millimeter-wave radar sensor maintains stable performance under adverse environmental conditions, making it a promising solution for all-weather perception tasks, such as outdoor mobile robotics. However, the radar point clouds are relatively sparse and contain massive ghost points, which greatly limits the development of mmWave radar technology. In this paper, we propose a novel point cloud super-resolution approach for 3D mmWave radar data, named Radar-diffusion. Our approach employs the diffusion model defined by mean-reverting stochastic differential equations (SDE). Using our proposed new objective function with supervision from corresponding LiDAR point clouds, our approach efficiently handles radar ghost points and enhances the sparse mmWave radar point clouds to dense LiDAR-like point clouds. We evaluate our approach on two different datasets, and the experimental results show that our method outperforms the state-of-the-art baseline methods in 3D radar super-resolution tasks. Furthermore, we demonstrate that our enhanced radar point cloud is capable of downstream radar point-based registration tasks.
The misaligned geometric representation between images and point clouds and the different data densities limit the performance of I2P registration. The former hinders the learning of cross-modal features, and the latter leads to low-quality 2D–3D matching. To address these challenges, we propose a novel I2P registration framework called MAC-I2P, which is composed of a modality approximation module and a cone–block–point matching strategy. By generating pseudo-RGBD images, the module mitigates geometrical misalignment and converts 2D images into 3D space. In addition, it voxelizes the point cloud so that the features of the image and the point cloud can be processed in a similar way, thereby enhancing the repeatability of cross-modal features. Taking into account the different data densities and perception ranges between images and point clouds, the cone–block–point matching relaxes the strict one-to-one matching criterion by gradually refining the matching candidates. As a result, it effectively improves the 2D–3D matching quality. Notably, MAC-I2P is supervised by multiple matching objectives and optimized in an end-to-end manner, which further strengthens the cross-modal representation capability of the model. Extensive experiments conducted on KITTI Odometry and Oxford Robotcar demonstrate the superior performance of our MAC-I2P. Our approach surpasses the current state-of-the-art (SOTA) by 8∼63.2% in relative translation error (RTE) and 19.3∼38.5% in relative rotation error (RRE). The ablation experiments also confirm the effectiveness of each proposed component.
Accurate and efficient 3D mapping is essential for applications such as robotics, autonomous navigation, and augmented reality. Traditional mapping methods often rely on expensive depth sensors or sparse monocular Simultaneous Localization and Mapping (SLAM) techniques, both of which face limitations in certain environments. This paper proposes a novel pipeline that integrates neural network-based monocular depth estimation (MDE) into an RGB-D SLAM framework, enabling dense 3D mapping using only RGB images. The pipeline utilizes a neural network to infer depth maps from monocular RGB input, followed by a filtering module to ensure depth consistency based on visual odometry. The resulting dense depth maps are used alongside RGB data to generate detailed point clouds. Experimental evaluations conducted in indoor environments demonstrate that the proposed approach significantly enhances the volumetric density and geometric fidelity of 3D point cloud maps compared to traditional RGB-D SLAM systems having near real-time inference rate. The method also addresses key challenges such as sparsity and limited sensor range, while being modular and easy to adapt for other models and subsystems, laying the foundations for robust and cost-effective mapping solutions. Future work will explore extending the framework to outdoor environments and improving real-time performance as well robustness to out-of-distribution scenarios.
No abstract available
In this work, we address the problem of real-time dense depth estimation from monocular images for mobile underwater vehicles. We formulate a deep learning model that fuses sparse depth measurements from triangulated features to improve the depth predictions and solve the problem of scale ambiguity. To allow prior inputs of arbitrary sparsity, we apply a dense parameterization method. Our model extends recent state-of-the-art approaches to monocular image based depth estimation, using an efficient encoder-decoder backbone and modern lightweight transformer optimization stage to encode global context. The network is trained in a supervised fashion on the forward-looking underwater dataset, FLSea. Evaluation results on this dataset demonstrate significant improvement in depth prediction accuracy by the fusion of the sparse feature priors. In addition, without any retraining, our method achieves similar depth prediction accuracy on a downward looking dataset we collected with a diver operated camera rig, conducting a survey of a coral reef. The method achieves real-time performance, running at 24 FPS on a NVIDIA Jetson Xavier NX, 160 FPS on a NVIDIA RTX 2080 GPU and 7 FPS on a single Intel i9-9900K CPU core, making it suitable for direct deployment on embedded GPU systems. The implementation of this work is made publicly available at https://github.com/ebnerluca/uw_depth.
Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.
Monocular depth estimation (MDE) provides a useful tool for robotic perception, but its predictions are often uncertain and inaccurate in challenging environments such as surgical scenes where textureless surfaces, specular reflections, and occlusions are common. To address this, we propose ProbeMDE, a cost-aware active sensing framework that combines RGB images with sparse proprioceptive measurements for MDE. Our approach utilizes an ensemble of MDE models to predict dense depth maps conditioned on both RGB images and on a sparse set of known depth measurements obtained via proprioception, where the robot has touched the environment in a known configuration. We quantify predictive uncertainty via the ensemble's variance and measure the gradient of the uncertainty with respect to candidate measurement locations. To prevent mode collapse while selecting maximally informative locations to propriocept (touch), we leverage Stein Variational Gradient Descent (SVGD) over this gradient map. We validate our method in both simulated and physical experiments on central airway obstruction surgical phantoms. Our results demonstrate that our approach outperforms baseline methods across standard depth estimation metrics, achieving higher accuracy while minimizing the number of required proprioceptive measurements. Project page: https://brittonjordan.github.io/probe_mde/
Monocular depth estimation is a rudimentary problem for robotic perception systems and downstream applications. However, depth estimation from a single image is an inherently ill-posed problem due to data loss related to projection from 3D to 2D. Recent studies address the discrepancy between camera parameters by using learning-based methods and unifying the camera model to canonical camera space or bipolar representations, thus addressing the problem of training a metric depth model over different datasets with different camera parameters. In addition, the previous study, OrchardDepth, introduced the sparse-dense depth consistency loss function to learn the dense depth distribution through the city autonomous driving scene to improve model performance in the orchard. Instead of enforcing strict consistency between the sparse and dense depth, this work introduced the KL divergence to encourage the network to adapt to the depth distributions of different sensors and penalize deviations from reliable regions while tolerating errors in unreliable areas. Furthermore, we further enhance the depth consistency loss by integrating bins into the supervised discretised depth distribution. This method significantly improves the robustness and performance of our previous method. In addition, it improves the absolute relative error in the orchard dataset by 17.3% and 16.2% in contrast to SILog Loss and OrchardDepth baseline, respectively. Thus enhancing the new training paradigm for depth estimation in the orchard scene.
Monocular self-supervised depth estimation with a low-cost sensor is the mainstream solution to gathering dense depth maps for robots and autonomous driving. In this paper, based on the philosophy "less is more" (i.e., focusing only on valid pixels in sparse LiDAR), we propose a novel framework, Efficient Sparse Depth (EffisDepth), for predicting dense depth. The Sparse Feature Extractor (SFE) embedded in the proposed framework effectively handles sparse LiDAR by forming sparse tensors. The Slender Group Block (SGB) is the main building block in SFE, which extracts features from sparse tensors via a structure of two branches. Extensive experiments show that our method achieves state-of-the-art performance on the KITTI benchmark, demonstrating the effectiveness of each proposed component and the self-supervised learning framework.
No abstract available
Surface reconstruction is an essential way to expand surgical field of view during endoscopic surgery, but it certainly requires dense depth estimation of endoscopic video sequences. Unfortunately, such a dense depth recovery suffers from illumination variation, weak texture, and occlusion. To address these problems, this work proposes a new triple-supervision self-learning strategy that uses unannotated endoscopic video data to predict monocular endoscopic dense depth information. This strategy first employs an effective conventional method to estimate camera poses and sparse depth maps to establishing a sparse data self-supervision. Furthermore, our strategy still combines two consistency measures to supervise dense depth and photometric information. We evaluated our method on collected colonoscopic videos, with the experimental results showing that our triple-supervision learning framework works more effective and accurate than some current self-supervised and unsupervised learning methods.
Due to the existence of low-textured areas in indoor scenes, some self-supervised depth estimation methods have specifically designed sparse photometric consistency losses and geometry-based losses. However, some of the loss terms cannot supervise all the pixels, which limits the performance of these methods. Some approaches introduce an additional optical flow network to provide dense correspondences supervision, but overload the loss function. In this paper, we propose to perform depth self-propagation based on feature self-similarities, where high-accuracy depths are propagated from supervised pixels to unsupervised ones. The enhanced self-supervised indoor monocular depth estimation network is called SPDepth. Since depth self-similarities are significant in a local range, a local window self-attention module is embedded at the end of the network to propagate depths in a window. The depth of a pixel is weighted using the feature correlation scores with other pixels in the same window. The effectiveness of self-propagation mechanism is demonstrated in the experiments on the NYU Depth V2 dataset. The root-mean-squared error of SPDepth is 0.585 and the δ1 accuracy is 77.6%. Zero-shot generalization studies are also conducted on the 7-Scenes dataset and provide a more comprehensive analysis about the application characteristics of SPDepth.
We consider the problem of dense depth prediction from a sparse set of depth measurements and a single RGB image. Since depth estimation from monocular images alone is inherently ambiguous and unreliable, to attain a higher level of robustness and accuracy, we introduce additional sparse depth samples, which are either acquired with a low-resolution depth sensor or computed via visual Simultaneous Localization and Mapping (SLAM) algorithms. We propose the use of a single deep regression network to learn directly from the RGB-D raw data, and explore the impact of number of depth samples on prediction accuracy. Our experiments show that, compared to using only RGB images, the addition of 100 spatially random depth samples reduces the prediction root-mean-square error by 50% on the NYU-Depth-v2 indoor dataset. It also boosts the percentage of reliable prediction from 59 % to 92 % on the KITTI dataset. We demonstrate two applications of the proposed algorithm: a plug-in module in SLAM to convert sparse maps to dense maps, and super-resolution for LiDARs. Software22https://github.com/fangchangma/sparse-to-dense and video demonstration33https://www.youtube.com/watch?v=vNIIT_M7×7Y are publicly available.
This paper addresses the problem of single image depth estimation (SIDE), focusing on improving the quality of deep neural network predictions. In a supervised learning scenario, the quality of predictions is intrinsically related to the training labels, which guide the optimization process. For indoor scenes, structured-light-based depth sensors (e.g. Kinect) are able to provide dense, albeit short-range, depth maps. On the other hand, for outdoor scenes, LiDARs are considered the standard sensor, which comparatively provides much sparser measurements, especially in areas further away. Rather than modifying the neural network architecture to deal with sparse depth maps, this article introduces a novel densification method for depth maps, using the Hilbert Maps framework. A continuous occupancy map is produced based on 3D points from LiDAR scans, and the resulting reconstructed surface is projected into a 2D depth map with arbitrary resolution. Experiments conducted with various subsets of the KITTI dataset show a significant improvement produced by the proposed Sparse-to-Continuous technique, without the introduction of extra information into the training stage.
Monocular depth estimation is a rudimentary task in robotic perception. Recently, with the development of more accurate and robust neural network models and different types of datasets, monocular depth estimation has significantly improved performance and efficiency. However, most of the research in this area focuses on very concentrated domains. In particular, most of the benchmarks in outdoor scenarios belong to urban environments for the improvement of autonomous driving devices, and these benchmarks have a massive disparity with the orchard/vineyard environment, which is hardly helpful for research in the primary industry. Therefore, we propose OrchardDepth, which fills the gap in the estimation of the metric depth of the monocular camera in the orchard/vineyard environment. In addition, we present a new retraining method to improve the training result by monitoring the consistent regularization between dense depth maps and sparse points. Our method improves the RMSE of depth estimation in the orchard environment from 1.5337 to 0.6738, proving our method's validation.
Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR. In this paper, we develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking. At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale, instead of globally fitting an invariant affine model as in the prior work. The VIMD framework is highly modular, making it compatible with a variety of existing depth estimation backbones. We conduct extensive evaluations on the TartanAir and VOID datasets and demonstrate its zero-shot generalization capabilities on the AR Table dataset. Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image. This makes the proposed VIMD a practical solution for deployment in resource constrained settings, while its robust performance and strong generalization capabilities offer significant potential across a wide range of scenarios.
Dense depth maps play an important role in Computer Vision and AR (Augmented Reality). For CV applications, a dense depth map is the cornerstone of 3D reconstruction allowing real objects to be precisely displayed in the computer. And Dense depth maps can handle correct occlusion relationships between virtual content and real objects for better user experience in AR. However, the complicated computation limits the development of computing dense depth maps. We present a novel algorithm that produces low latency, spatio-temporally smooth dense depth maps using only a CPU. The depth maps exhibit sharp discontinuities at depth edges in low computational complexity ways. Our algorithm obtains the sparse SLAM reconstruction first, then extracts coarse depth edges from a down-sampled RGB image by morphology operations. Next, we thin the depth edges and align them with image edges. Finally, an effective initialization scheme and an improved optimization solver are adopted to accelerate convergence. We evaluate our proposal quantitatively and the result shows improvements on the accuracy of depth map with respect to other state-of-the-art and baseline techniques.
We present a self-supervised approach to training convolutional neural networks for dense depth estimation from monocular endoscopy data without a priori modeling of anatomy or shading. Our method only requires monocular endoscopic videos and a multi-view stereo method, e.g., structure from motion, to supervise learning in a sparse manner. Consequently, our method requires neither manual labeling nor patient computed tomography (CT) scan in the training and application phases. In a cross-patient experiment using CT scans as groundtruth, the proposed method achieved submillimeter mean residual error. In a comparison study to recent self-supervised depth estimation methods designed for natural video on in vivo sinus endoscopy data, we demonstrate that the proposed approach outperforms the previous methods by a large margin. The source code for this work is publicly available online at https://github.com/lppllppl920/EndoscopyDepthEstimation-Pytorch.
This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently, SLAM based on Gaussian Splatting has shown promising results. However, in monocular scenarios, the Gaussian maps reconstructed lack geometric accuracy and exhibit weaker tracking capability. To address these limitations, we jointly optimize sparse visual odometry tracking and 3D Gaussian Splatting scene representation for the first time. We obtain depth maps on visual odometry keyframe windows using a fast Multi-View Stereo (MVS) network for the geometric supervision of Gaussian maps. Furthermore, we propose a depth smooth loss and Sparse-Dense Adjustment Ring (SDAR) to reduce the negative effect of estimated depth maps and preserve the consistency in scale between the visual odometry and Gaussian maps. We have evaluated our system across various synthetic and real-world datasets. The accuracy of our pose estimation surpasses existing methods and achieves state-of-the-art. Additionally, it outperforms previous monocular methods in terms of novel view synthesis and geometric reconstruction fidelities.
Visual trajectory estimation can endow autonomous underwater vehicles (AUVs) with environmental perception capabilities and has broad application prospects in the fields, such as oceanographic surveys, underwater construction, and marine ranching. However, due to featureless images and depth ambiguity issues caused by underwater multiple mediums environments, monocular visual trajectory estimation in complex underwater environments remains a challenging problem. In this article, we propose a monocular visual trajectory estimation method for AUVs, which can address the challenges of featureless and depth ambiguity by leveraging deep image representations and multiview geometry. Specifically, we design a bidirectional optical flow consistency scheme that selects sparse correspondences from monocular dense predictions to deal with the featureless images, and then achieve AUV trajectory estimation through epipolar constraints. Furthermore, we propose an iterative depth-consistent method, which solves the problem of depth ambiguity by aligning geometrically triangulated depths to the scale-consistent deep depths. We also develop a low-cost, agile, and portable AUV picking system with real-time trajectory estimation capabilities, and carry out extensive experiments in the Yellow Sea to test its performance. The experimental results demonstrate the effectiveness of the proposed method.
Estimating dense depth map from one image is a challenging task for computer vision. Because the same image can correspond to the infinite variety of 3D spaces. Neural networks have gradually achieved reasonable results on this task with the continuous development of deep learning. But the depth estimation method based on monocular cameras still has a gap in accuracy compared with multi-view or sensor-based methods. Thus, this paper proposes to supplement a limited number of sparse 3D point clouds combined with transformer processing to increase the accuracy of the monocular depth estimation model. The sparse 3D point clouds are used as supplementary geometric information and the 3D point clouds are input into the network with the RGB image. After five times integration, the multi-scale features are extracted, and then the swin transformer block is used to process the output feature map of the main network, further improving the accuracy. Experiments demonstrate that our network achieves better results than the best method on the current most commonly used dataset for monocular depth estimation, NYU Depth V2. However, the qualitative results are also better than the best method.
We present a self-supervised approach to training convolutional neural networks for dense depth estimation from monocular endoscopy data without a priori modeling of anatomy or shading. Our method only requires sequential data from monocular endoscopic videos and a multi-view stereo reconstruction method, e.g. structure from motion, that supervises learning in a sparse but accurate manner. Consequently, our method requires neither manual interaction, such as scaling or labeling, nor patient CT in the training and application phases. We demonstrate the performance of our method on sinus endoscopy data from two patients and validate depth prediction quantitatively using corresponding patient CT scans where we found submillimeter residual errors. (Link to the supplementary video: https://camp.lcsr.jhu.edu/miccai-2018-demonstration-videos/)
We present a visual-inertial depth estimation pipeline that integrates monocular depth estimation and visual- inertial odometry to produce dense depth estimates with metric scale. Our approach performs global scale and shift alignment against sparse metric depth, followed by learning-based dense alignment. We evaluate on the TartanAir and VOID datasets, observing up to 30% reduction in inverse RMSE with dense scale alignment relative to performing just global alignment alone. Our approach is especially competitive at low density; with just 150 sparse metric depth points, our dense- to-dense depth alignment method achieves over 50 % lower iRMSE over sparse-to-dense depth completion by KBNet, currently the state of the art on VOID. We demonstrate successful zero-shot transfer from synthetic TartanAir to real-world VOID data and perform generalization tests on NYUv2 and VCU-RVI. Our approach is modular and is compatible with a variety of monocular depth estimation models.
Contrary to the ongoing trend in automotive applications towards usage of more diverse and more sensors, this work tries to solve the complex scene flow problem under a monocular camera setup, i.e. using a single sensor. Towards this end, we exploit the latest achievements in single image depth estimation, optical flow, and sparse-to-dense interpolation and propose a monocular combination approach (MonoComb) to compute dense scene flow. MonoComb uses optical flow to relate reconstructed 3D positions over time and interpolates occluded areas. This way, existing monocular methods are outperformed in dynamic foreground regions which leads to the second best result among the competitors on the challenging KITTI 2015 scene flow benchmark.
Dense depth estimation from a single image is a key problem in computer vision, with exciting applications in a multitude of robotic tasks. Initially viewed as a direct regression problem, requiring annotated labels as supervision at training time, in the past few years a substantial amount of work has been done in self-supervised depth training based on strong geometric cues, both from stereo cameras and more recently from monocular video sequences. In this paper we investigate how these two approaches (supervised & self-supervised) can be effectively combined, so that a depth model can learn to encode true scale from sparse supervision while achieving high fidelity local accuracy by leveraging geometric cues. To this end, we propose a novel supervised loss term that complements the widely used photometric loss, and show how it can be used to train robust semi-supervised monocular depth estimation models. Furthermore, we evaluate how much supervision is actually necessary to train accurate scale-aware monocular depth models, showing that with our proposed framework, very sparse LiDAR information, with as few as 4 beams (less than 100 valid depth values per image), is enough to achieve results competitive with the current state-of-the-art.
Dense and accurate reconstruction plays a fundamental role in mobile robot's environment perception and navigation. It's also necessary for obstacle avoidance and path planning of mobile robots. We propose a method to incrementally reconstruct the scene from monocular sequence by fusing the depth from geometry computation and generative adversarial networks (GAN) prediction. The depth from geometry triangulation is precise but sparse, while the depth from GAN is dense but unscaled. In this paper, we combine the advantages from two methods with a linear model optimized by graph structure. Experiments showed that our proposed method gives precise dense reconstruction in real time.
No abstract available
Depth estimation plays a crucial role in 3D scene understanding and is extensively used in a wide range of vision tasks. Image-based methods struggle in challenging scenarios, while event cameras offer high dynamic range and temporal resolution but face difficulties with sparse data. Combining event and image data provides significant advantages, yet effective integration remains challenging. Existing CNN-based fusion methods struggle with occlusions and depth disparities due to limited receptive fields, while Transformer-based fusion methods often lack deep modality interaction. To address these issues, we propose UniCT Depth, an event-image fusion method that unifies CNNs and Transformers to model local and global features. We propose the Convolution-compensated ViT Dual SA (CcViT-DA) Block, designed for the encoder, which integrates Context Modeling Self-Attention (CMSA) to capture spatial dependencies and Modal Fusion Self-Attention (MFSA) for effective cross-modal fusion. Furthermore, we design the tailored Detail Compensation Convolution (DCC) Block to improve texture details and enhances edge representations. Extensive experiments show that UniCT Depth outperforms existing image, event, and fusion-based monocular depth estimation methods across key metrics.
Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: https://github.com/harborsarah/TRIDE
Millimeter-wave (mmWave) radar is widely used in autonomous driving thanks to its robustness under harsh weather conditions. However, compared to LiDAR point clouds, mmWave radar point clouds tend to be sparse and include several “ghost” targets. Existing methods attempt to optimize radar signal processing to generate superior radar point cloud. Nonetheless, these methods are primarily designed for simple scenarios. How to obtain superior radar point clouds for complex autonomous driving scenarios is still unexplored. In this article, we propose a novel point cloud generation network that generates superior radar point clouds by utilizing complementary advantage of radar and camera. Technically, our approach comprises three key steps. First, to obtain precise relationships between neighboring pixels, the mmWave range-azimuth map (RAM) is enhanced to extract the radar affinity matrix. Second, to obtain dense image bird's-eye-view (BEV) features, image features are diffused by learning local details from the radar affinity matrix. Third, to obtain superior radar point clouds, radar features and diffused image features are combined in a unified BEV representation, and optimized by K-nearest-neighbor (KNN) graph. To verify the effectiveness of the proposed methods, we collect a real-world dataset under autonomous driving scenarios to conduct experiments. The results demonstrate that our method outperforms exisiting ones in producing high-quality radar point clouds. And as a consequence, the radar point cloud generated using our methods significantly improve the performance of object detection task.
Safety and reliability are crucial for the public acceptance of autonomous driving. To ensure accurate and reliable environmental perception, intelligent vehicles must exhibit accuracy and robustness in various environments. Millimeter-wave radar, known for its high penetration capability, can operate effectively in adverse weather conditions such as rain, snow, and fog. Traditional 3D millimeter-wave radars can only provide range, Doppler, and azimuth information for objects. Although the recent emergence of 4D millimeter-wave radars has added elevation resolution, the radar point clouds remain sparse due to Constant False Alarm Rate (CFAR) operations. In contrast, cameras offer rich semantic details but are sensitive to lighting and weather conditions. Hence, this paper leverages these two highly complementary and cost-effective sensors, 4D millimeter-wave radar and camera. By integrating 4D radar spectra with depth-aware camera images and employing attention mech-anisms, we fuse texture-rich images with depth-rich radar data in the Bird's Eye View (BEV) perspective, enhancing 3D object detection. Additionally, we propose using GAN-based networks to generate depth images from radar spectra in the absence of depth sensors, further improving detection accuracy.
One challenge in stereo-LiDAR fusion arises from the sparsity and non-uniform distribution of LiDAR data. Existing methods expand sparse LiDAR data to produce semi-dense hints as guidance for fusion. However, the absence of depth cues beyond the expanded areas may still limit performance. To address this challenge, we propose a novel sparse-to-dense hint guided stereo-LiDAR fusion method. The key idea is to use a dense hint map generated by a lightweight network as guidance, with sparse LiDAR points and a monocular image as inputs. The dense hints are then employed to construct and explicitly regularize a multi-modal cost volume via integrating the geometric cues from the hints and the visual information from the images to produce better stereo prediction. The construction and aggregation of cost volume follow a well-designed coarse-to-fine strategy along with a pixel-wise search range adjustment module, facilitating fast computation while preserving fine details. Finally, a confidence-based fusion module is performed to adaptively produce the ultimate prediction based on the monocular and stereo estimations. The experimental results show that our method significantly outperforms existing methods with high inference efficiency across multiple benchmark datasets. To contribute to the community, we will release the code at: https://github.com/LiAngLA66/DG-Fusion
Autonomous driving vehicles have strong path planning and obstacle avoidance capabilities, which provide great support to avoid traffic accidents. Autonomous driving has become a research hotspot worldwide. Depth estimation is a key technology in autonomous driving as it provides an important basis for accurately detecting traffic objects and avoiding collisions in advance. However, the current difficulties in depth estimation include insufficient estimation accuracy, difficulty in acquiring depth information using monocular vision, and an important challenge of fusing multiple sensors for depth estimation. To enhance depth estimation performance in complex traffic environments, this study proposes a depth estimation method in which point clouds and images obtained from MMwave radar and cameras are fused. Firstly, a residual network is established to extract the multi-scale features of the MMwave radar point clouds and the corresponding image obtained simultaneously from the same location. Correlations between the radar points and the image are established by fusing the extracted multi-scale features. A semi-dense depth estimation is achieved by assigning the depth value of the radar point to the most relevant image region. Secondly, a bidirectional feature fusion structure with additional fusion branches is designed to enhance the richness of the feature information. The information loss during the feature fusion process is reduced, and the robustness of the model is enhanced. Finally, parallel channel and position attention mechanisms are used to enhance the feature representation of the key areas in the fused feature map, the interference of irrelevant areas is suppressed, and the depth estimation accuracy is enhanced. The experimental results on the public dataset nuScenes show that, compared with the baseline model, the proposed method reduces the average absolute error (MAE) by 4.7–6.3% and the root mean square error (RMSE) by 4.2–5.2%.
3D object detection is essential for autonomous driving. As an emerging sensor, 4D imaging radar offers advantages as low cost, long-range detection, and accurate velocity measurement, making it highly suitable for object detection. However, its sparse point clouds and low resolution limit object geometric representation and hinder multi-modal fusion. In this study, we introduce SFGFusion, a novel camera-4D imaging radar detection network guided by surface fitting. By estimating quadratic surface parameters of objects from image and radar data, the explicit surface fitting model enhances spatial representation and cross-modal interaction, enabling more reliable prediction of fine-grained dense depth. The predicted depth serves two purposes: 1) in an image branch to guide the transformation of image features from perspective view (PV) to a unified bird's-eye view (BEV) for multi-modal fusion, improving spatial mapping accuracy; and 2) in a surface pseudo-point branch to generate dense pseudo-point cloud, mitigating the radar point sparsity. The original radar point cloud is also encoded in a separate radar branch. These two point cloud branches adopt a pillar-based method and subsequently transform the features into the BEV space. Finally, a standard 2D backbone and detection head are used to predict object labels and bounding boxes from BEV features. Experimental results show that SFGFusion effectively fuses camera and 4D radar features, achieving superior performance on the TJ4DRadSet and view-of-delft (VoD) object detection benchmarks.
Recently, camera-radar fusion-based 3D object detection methods in bird's eye view (BEV) have gained attention due to the complementary characteristics and cost-effectiveness of these sensors. Previous approaches using forward projection struggle with sparse BEV feature generation, while those employing backward projection overlook depth ambiguity, leading to false positives. In this paper, to address the aforementioned limitations, we propose a novel camera-radar fusion-based 3D object detection and segmentation model named CRAB (Camera-Radar fusion for reducing depth Ambiguity in Backward projection-based view transformation), using a backward projection that leverages radar to mitigate depth ambiguity. During the view transformation, CRAB aggregates perspective view image context features into BEV queries. It improves depth distinction among queries along the same ray by combining the dense but unreliable depth distribution from images with the sparse yet precise depth information from radar occupancy. We further introduce spatial cross-attention with a feature map containing radar context information to enhance the comprehension of the 3D scene. When evaluated on the nuScenes open dataset, our proposed approach achieves a state-of-the-art performance among backward projection-based camera-radar fusion methods with 62.4% NDS and 54.0% mAP in 3D object detection.
Autonomous perception systems demand robust performance across diverse conditions. While visual sensors provide rich semantic information, their performance degrades significantly under adverse weather and lighting. Conversely, millimeter-wave radar sensors offer strong all-weather robustness and direct velocity sensing but produce sparse, low-resolution data with limited semantics, hindering precise object detection. To address this issue, this paper proposes the Multi- Modal Camera-Radar Fusion (MMCRF) method. This approach skips traditional signal processing by directly utilizing raw radar Range-Doppler (RD) data. Meanwhile, an independent image processing network is responsible for handling camera data. The first step involves the projection of the images onto a polar coordinate grid within a Bird’s Eye View (BEV) perspective. Then, depth features are extracted through a specifically designed encoder-decoder network. These visual features are then deeply fused with Range-Azimuth (RA) features from the radar RD spectrum for object detection. In terms of accuracy advantages, this method significantly outperforms existing fusion detection frameworks in distance and azimuth error metrics. The distance error (RE) reaches 0.11 m, which is 8.3% lower than the current optimal method; the azimuth error (AE) is 0.09, 18.2% lower than that of the sub AP of 96.12%, an AR of 92.23%, and real-time inference at 58.91 FPS.
Autonomous vehicles depends on various sensors to accurately perceive their surroundings, ensuring safe and efficient navigation. These sensors include radar, lidar, cameras, and ultrasonic sensors, each offering unique strengths. Radar provides precise distance and velocity measurements in diverse conditions, while cameras offer detailed visual information for object recognition. State-of-the-art perception systems increasingly utilize sensor fusion to combine these strengths, addressing the limitations of individual sensors and enhancing overall vehicle perception capabilities. Challenges in radar-camera fusion include misalignment of sensor data due to differing spatial resolutions and difficulties in effectively integrating radar's sparse data with dense image data. Existing models like CDMC and RODNet struggle with limited precision or recall due to suboptimal fusion strategies. This work employs the Radar Multiple-perspectives Convolutional Neural Network (RAMP-CNN) architecture, which leverages the radar and camera data fusion through Conventional Neural Network (CNN) to improve perception. Radar data preprocessing involves steps like 3D Fast Fourier Transform, while image preprocessing includes gamma correction and noise reduction. The fusion process combines 2D image proposals with radar data, significantly enhancing object detection and distance estimation accuracy. Simulation results demonstrate the efficacy of this fusion approach. Performance metrics such as Precision, Recall, F1-Score, Mean Squared Error (MSE), Mean Absolute Error (MAE), R2 Score, and Mean Intersection over Union (Mean IoU) highlight the model's effectiveness. While precision is high, recall indicates room for improvement. Low error metrics suggest accurate distance estimations, and the R2 score confirms the model's strong explanatory power. This fusion method represents a significant advancement in autonomous vehicle perception by increasing Average Recall (AR) by 3% and Average Precision (AP) by 16% enabling more reliable and accurate navigation in complex environments.
Depth estimation represents a prevalent research focus within the realm of computer vision. Existing depth estimation methodologies utilizing LiDAR (Light Detection and Ranging) technology typically obtain sparse depth data and are associated with elevated hardware expenses. Multi-view image-matching techniques necessitate prior knowledge of camera intrinsic parameters and frequently encounter challenges such as depth inconsistency, loss of details, and the blurring of edges. To tackle these challenges, the present study introduces a monocular depth estimation approach based on an end-to-end convolutional neural network. Specifically, a DNET backbone has been developed, incorporating dilated convolution and feature fusion mechanisms within the network architecture. By integrating semantic information from various receptive fields and levels, the model’s capacity for feature extraction is augmented, thereby enhancing its sensitivity to nuanced depth variations within the image. Furthermore, we introduce a loss function optimization algorithm specifically designed to address class imbalance, thereby enhancing the overall predictive accuracy of the model. Training and validation conducted on the NYU Depth-v2 (New York University Depth Dataset Version 2) and KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) datasets demonstrate that our approach outperforms other algorithms in terms of various evaluation metrics.
Purpose This paper aims to develop a robust person tracking method for human following robots. The tracking system adopts the multimodal fusion results of millimeter wave (MMW) radars and monocular cameras for perception. A prototype of human following robot is developed and evaluated by using the proposed tracking system. Design/methodology/approach Limited by angular resolution, point clouds from MMW radars are too sparse to form features for human detection. Monocular cameras can provide semantic information for objects in view, but cannot provide spatial locations. Considering the complementarity of the two sensors, a sensor fusion algorithm based on multimodal data combination is proposed to identify and localize the target person under challenging conditions. In addition, a closed-loop controller is designed for the robot to follow the target person with expected distance. Findings A series of experiments under different circumstances are carried out to validate the fusion-based tracking method. Experimental results show that the average tracking errors are around 0.1 m. It is also found that the robot can handle different situations and overcome short-term interference, continually track and follow the target person. Originality/value This paper proposed a robust tracking system with the fusion of MMW radars and cameras. Interference such as occlusion and overlapping are well handled with the help of the velocity information from the radars. Compared to other state-of-the-art plans, the sensor fusion method is cost-effective and requires no additional tags with people. Its stable performance shows good application prospects in human following robots.
Despite advancements in self-supervised monocular depth estimation, challenges persist in dynamic scenarios due to the dependence on assumptions about a static world. In this paper, we present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds, all while maintaining computational efficiency. To tackle the challenges posed by dynamic content, we incorporate optical flow and coarse monocular depth to create a pseudo-static reference frame. This frame is then utilized to build a motion-aware cost volume in collaboration with the vanilla target frame. Furthermore, to improve the accuracy and robustness of the network architecture, we propose an attention-based depth network that effectively integrates information from feature maps at different resolutions by incorporating both channel and non-local attention mechanisms. Compared to methods with similar computational costs, Manydepth2 achieves a significant reduction of approximately five percent in root-mean-square error for self-supervised monocular depth estimation on the KITTI-2015 dataset. The code could be found at https://github.com/kaichen-z/Manydepth2.
Depth completion, predicting dense depth maps from sparse depth measurements, is an ill-posed problem requiring prior knowledge. Recent methods adopt learning-based approaches to implicitly capture priors, but the priors primarily fit in-domain data and do not generalize well to out-of-domain scenarios. To address this, we propose a zero-shot depth completion method composed of an affine-invariant depth diffusion model and test-time alignment. We use pre-trained depth diffusion models as depth prior knowledge, which implicitly understand how to fill in depth for scenes. Our approach aligns the affine-invariant depth prior with metric-scale sparse measurements, enforcing them as hard constraints via an optimization loop at test-time. Our zero-shot depth completion method demonstrates generalization across various domain datasets, achieving up to a 21\% average performance improvement over the previous state-of-the-art methods while enhancing spatial understanding by sharpening scene details. We demonstrate that aligning a monocular affine-invariant depth prior with sparse metric measurements is a proven strategy to achieve domain-generalizable depth completion without relying on extensive training data. Project page: https://hyoseok1223.github.io/zero-shot-depth-completion/.
Self-supervised monocular depth estimation has seen significant progress in recent years, especially in outdoor environments. However, depth prediction results are not satisfying in indoor scenes where most of the existing data are captured with hand-held devices. As compared to outdoor environments, estimating depth of monocular videos for indoor environments, using self-supervised methods, results in two additional challenges: (i) the depth range of indoor video sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues for training; (ii) the indoor sequences recorded with handheld devices often contain much more rotational motions, which cause difficulties for the pose network to predict accurate relative camera poses. In this work, we propose a novel framework-MonoIndoor++ by giving special considerations to those challenges and consolidating a set of good practices for improving the performance of self-supervised monocular depth estimation for indoor environments. First, a depth factorization module with transformer-based scale regression network is proposed to estimate a global depth scale factor explicitly, and the predicted scale factor can indicate the maximum depth values. Second, rather than using a single-stage pose estimation strategy as in previous methods, we propose to utilize a residual pose estimation module to estimate relative camera poses across consecutive frames iteratively. Third, to incorporate extensive coordinates guidance for our residual pose estimation module, we propose to perform coordinate convolutional encoding directly over the inputs to pose networks. The proposed method is validated on a variety of benchmark indoor datasets, i.e., EuRoC MAV, NYUv2, ScanNet and 7-Scenes, demonstrating the state-of-the-art performance.
Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming $\approx$300mW. In its normal operation, this setup feeds a tiny 107 k-parameter $μ$PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect pseudo-labels when the system is placed in a new environment. Then, the fine-tuning task is performed entirely on the MCU, using the new data. To optimize our backpropagation-based on-device training, we introduce a novel memory-driven sparse update scheme, which minimizes the fine-tuning memory to 1.2 MB, 2.2x less than a full update, while preserving accuracy (i.e., only 2% and 1.5% drops on the KITTI and NYUv2 datasets). Our in-field tests demonstrate, for the first time, that ODL for MDE can be performed in 17.8 minutes on the IoT node, reducing the root mean squared error from 4.9 to 0.6m with only 3 k self-labeled samples, collected in a real-life deployment scenario.
Depth estimation features are helpful for 3D recognition. Commodity-grade depth cameras are able to capture depth and color image in real-time. However, glossy, transparent or distant surface cannot be scanned properly by the sensor. As a result, enhancement and restoration from sensing depth is an important task. Depth completion aims at filling the holes that sensors fail to detect, which is still a complex task for machine to learn. Traditional hand-tuned methods have reached their limits, while neural network based methods tend to copy and interpolate the output from surrounding depth values. This leads to blurred boundaries, and structures of the depth map are lost. Consequently, our main work is to design an end-to-end network improving completion depth maps while maintaining edge clarity. We utilize self-attention mechanism, previously used in image inpainting fields, to extract more useful information in each layer of convolution so that the complete depth map is enhanced. In addition, we propose boundary consistency concept to enhance the depth map quality and structure. Experimental results validate the effectiveness of our self-attention and boundary consistency schema, which outperforms previous state-of-the-art depth completion work on Matterport3D dataset. Our code is publicly available at https://github.com/tsunghan-wu/Depth-Completion.
We present an approach which takes advantage of both structure and semantics for unsupervised monocular learning of depth and ego-motion. More specifically, we model the motion of individual objects and learn their 3D motion vector jointly with depth and ego-motion. We obtain more accurate results, especially for challenging dynamic scenes not addressed by previous approaches. This is an extended version of Casser et al. [AAAI'19]. Code and models have been open sourced at https://sites.google.com/corp/view/struct2depth.
Zero-shot depth completion has gained attention for its ability to generalize across environments without sensor-specific datasets or retraining. However, most existing approaches rely on diffusion-based test-time optimization, which is computationally expensive due to iterative denoising. Recent visual-prompt-based methods reduce training cost but still require repeated forward--backward passes through the full frozen network to optimize input-level prompts, resulting in slow inference. In this work, we show that adapting only the decoder is sufficient for effective test-time optimization, as depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. Based on this insight, we propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision. Our approach achieves state-of-the-art performance, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Extensive experiments on five indoor and outdoor datasets demonstrate consistent improvements over prior methods, highlighting the practicality of fast zero-shot depth completion.
Indoor monocular depth estimation has attracted increasing research interest. Most previous works have been focusing on methodology, primarily experimenting with NYU-Depth-V2 (NYUv2) Dataset, and only concentrated on the overall performance over the test set. However, little is known regarding robustness and generalization when it comes to applying monocular depth estimation methods to real-world scenarios where highly varying and diverse functional \textit{space types} are present such as library or kitchen. A study for performance breakdown into space types is essential to realize a pretrained model's performance variance. To facilitate our investigation for robustness and address limitations of previous works, we collect InSpaceType, a high-quality and high-resolution RGBD dataset for general indoor environments. We benchmark 12 recent methods on InSpaceType and find they severely suffer from performance imbalance concerning space types, which reveals their underlying bias. We extend our analysis to 4 other datasets, 3 mitigation approaches, and the ability to generalize to unseen space types. Our work marks the first in-depth investigation of performance imbalance across space types for indoor monocular depth estimation, drawing attention to potential safety concerns for model deployment without considering space types, and further shedding light on potential ways to improve robustness. See \url{https://depthcomputation.github.io/DepthPublic} for data and the supplementary document. The benchmark list on the GitHub project page keeps updates for the lastest monocular depth estimation methods.
Recent advances in zero-shot monocular depth estimation(MDE) have significantly improved generalization by unifying depth distributions through normalized depth representations and by leveraging large-scale unlabeled data via pseudo-label distillation. However, existing methods that rely on global depth normalization treat all depth values equally, which can amplify noise in pseudo-labels and reduce distillation effectiveness. In this paper, we present a systematic analysis of depth normalization strategies in the context of pseudo-label distillation. Our study shows that, under recent distillation paradigms (e.g., shared-context distillation), normalization is not always necessary, as omitting it can help mitigate the impact of noisy supervision. Furthermore, rather than focusing solely on how depth information is represented, we propose Cross-Context Distillation, which integrates both global and local depth cues to enhance pseudo-label quality. We also introduce an assistant-guided distillation strategy that incorporates complementary depth priors from a diffusion-based teacher model, enhancing supervision diversity and robustness. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, both quantitatively and qualitatively.
Recent monocular foundation models excel at zero-shot depth estimation, yet their outputs are inherently relative rather than metric, limiting direct use in robotics and autonomous driving. We leverage the fact that relative depth preserves global layout and boundaries: by calibrating it with sparse range measurements, we transform it into a pseudo metric depth prior. Building on this prior, we design a refinement network that follows the prior where reliable and deviates where necessary, enabling accurate metric predictions from very few labeled samples. The resulting system is particularly effective when curated validation data are unavailable, sustaining stable scale and sharp edges across few-shot regimes. These findings suggest that coupling foundation priors with sparse anchors is a practical route to robust, deployment-ready depth completion under real-world label scarcity.
While radar and video data can be readily fused at the detection level, fusing them at the pixel level is potentially more beneficial. This is also more challenging in part due to the sparsity of radar, but also because automotive radar beams are much wider than a typical pixel combined with a large baseline between camera and radar, which results in poor association between radar pixels and color pixel. A consequence is that depth completion methods designed for LiDAR and video fare poorly for radar and video. Here we propose a radar-to-pixel association stage which learns a mapping from radar returns to pixels. This mapping also serves to densify radar returns. Using this as a first stage, followed by a more traditional depth completion method, we are able to achieve image-guided depth completion with radar and video. We demonstrate performance superior to camera and radar alone on the nuScenes dataset. Our source code is available at https://github.com/longyunf/rc-pda.
While self-supervised monocular depth estimation in driving scenarios has achieved comparable performance to supervised approaches, violations of the static world assumption can still lead to erroneous depth predictions of traffic participants, posing a potential safety issue. In this paper, we present R4Dyn, a novel set of techniques to use cost-efficient radar data on top of a self-supervised depth estimation framework. In particular, we show how radar can be used during training as weak supervision signal, as well as an extra input to enhance the estimation robustness at inference time. Since automotive radars are readily available, this allows to collect training data from a variety of existing vehicles. Moreover, by filtering and expanding the signal to make it compatible with learning-based approaches, we address radar inherent issues, such as noise and sparsity. With R4Dyn we are able to overcome a major limitation of self-supervised depth estimation, i.e. the prediction of traffic participants. We substantially improve the estimation on dynamic objects, such as cars by 37% on the challenging nuScenes dataset, hence demonstrating that radar is a valuable additional sensor for monocular depth estimation in autonomous vehicles.
This work considers the problem of depth completion, with or without image data, where an algorithm may measure the depth of a prescribed limited number of pixels. The algorithmic challenge is to choose pixel positions strategically and dynamically to maximally reduce overall depth estimation error. This setting is realized in daytime or nighttime depth completion for autonomous vehicles with a programmable LiDAR. Our method uses an ensemble of predictors to define a sampling probability over pixels. This probability is proportional to the variance of the predictions of ensemble members, thus highlighting pixels that are difficult to predict. By additionally proceeding in several prediction phases, we effectively reduce redundant sampling of similar pixels. Our ensemble-based method may be implemented using any depth-completion learning algorithm, such as a state-of-the-art neural network, treated as a black box. In particular, we also present a simple and effective Random Forest-based algorithm, and similarly use its internal ensemble in our design. We conduct experiments on the KITTI dataset, using the neural network algorithm of Ma et al. and our Random Forest based learner for implementing our method. The accuracy of both implementations exceeds the state of the art. Compared with a random or grid sampling pattern, our method allows a reduction by a factor of 4-10 in the number of measurements required to attain the same accuracy.
Mobile robots require accurate and robust depth measurements to understand and interact with the environment. While existing sensing modalities address this problem to some extent, recent research on monocular depth estimation has leveraged the information richness, yet low cost and simplicity of monocular cameras. These works have shown significant generalization capabilities, mainly in automotive and indoor settings. However, robots often operate in environments with limited scale cues, self-similar appearances, and low texture. In this work, we encode measurements from a low-cost mmWave radar into the input space of a state-of-the-art monocular depth estimation model. Despite the radar's extreme point cloud sparsity, our method demonstrates generalization and robustness across industrial and outdoor experiments. Our approach reduces the absolute relative error of depth predictions by 9-64% across a range of unseen, real-world validation datasets. Importantly, we maintain consistency of all performance metrics across all experiments and scene depths where current vision-only approaches fail. We further address the present deficit of training data in mobile robotics environments by introducing a novel methodology for synthesizing rendered, realistic learning datasets based on photogrammetric data that simulate the radar sensor observations for training. Our code, datasets, and pre-trained networks are made available at https://github.com/ethz-asl/radarmeetsvision.
The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the local optima and noisy problem. In addition, we formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them. Our method fully exploits the monocular prior to support stereo matching results effectively and efficiently. We significantly improve the performance from the experiments when generalizing from SceneFlow to Middlebury and Booster datasets while barely reducing the efficiency.
Recently, several works have proposed fusing radar data as an additional perceptual signal into monocular depth estimation models because radar data is robust against varying light and weather conditions. Although improved performances were reported in prior works, it is still hard to tell how much depth information radar can contribute to a depth estimation model. In this paper, we propose radar inference and supervision experiments to investigate the intrinsic depth potential of radar data using state-of-the-art depth estimation models on the nuScenes dataset. In the inference experiment, the model predicts depth by taking only radar as input to demonstrate the inference capability using radar data. In the supervision experiment, a monocular depth estimation model is trained under radar supervision to show the intrinsic depth information that radar can contribute. Our experiments demonstrate that the model using only sparse radar as input can detect the shape of surroundings to a certain extent in the predicted depth. Furthermore, the monocular depth estimation model supervised by preprocessed radar achieves a good performance compared to the baseline model trained with sparse lidar supervision.
We propose a systematic approach for registering cross-source point clouds. The compelling need for cross-source point cloud registration is motivated by the rapid development of a variety of 3D sensing techniques, but many existing registration methods face critical challenges as a result of the large variations in cross-source point clouds. This paper therefore illustrates a novel registration method which successfully aligns two cross-source point clouds in the presence of significant missing data, large variations in point density, scale difference and so on. The robustness of the method is attributed to the extraction of macro and micro structures. Our work has three main contributions: (1) a systematic pipeline to deal with cross-source point cloud registration; (2) a graph construction method to maintain macro and micro structures; (3) a new graph matching method is proposed which considers the global geometric constraint to robustly register these variable graphs. Compared to most of the related methods, the experiments show that the proposed method successfully registers in cross-source datasets, while other methods have difficulty achieving satisfactory results. The proposed method also shows great ability in same-source datasets.
The main goal of point cloud registration in Multi-View Partial (MVP) Challenge 2021 is to estimate a rigid transformation to align a point cloud pair. The pairs in this competition have the characteristics of low overlap, non-uniform density, unrestricted rotations and ambiguity, which pose a huge challenge to the registration task. In this report, we introduce our solution to the registration task, which fuses two deep learning models: ROPNet and PREDATOR, with customized ensemble strategies. Finally, we achieved the second place in the registration track with 2.96546, 0.02632 and 0.07808 under the the metrics of Rot\_Error, Trans\_Error and MSE, respectively.
In recent years, terrestrial laser scanning technology has been widely used to collect tree point cloud data, aiding in measurements of diameter at breast height, biomass, and other forestry survey data. Since a single scan from terrestrial laser systems captures data from only one angle, multiple scans must be registered and fused to obtain complete tree point cloud data. This paper proposes a marker-free automatic registration method for single-tree point clouds based on similar tetrahedras. First, two point clouds from two scans of the same tree are used to generate tree skeletons, and key point sets are constructed from these skeletons. Tetrahedra are then filtered and matched according to similarity principles, with the vertices of these two matched tetrahedras selected as matching point pairs, thus completing the coarse registration of the point clouds from the two scans. Subsequently, the ICP method is applied to the coarse-registered leaf point clouds to obtain fine registration parameters, completing the precise registration of the two tree point clouds. Experiments were conducted using terrestrial laser scanning data from eight trees, each from different species and with varying shapes. The proposed method was evaluated using RMSE and Hausdorff distance, compared against the traditional ICP and NDT methods. The experimental results demonstrate that the proposed method significantly outperforms both ICP and NDT in registration accuracy, achieving speeds up to 593 times and 113 times faster than ICP and NDT, respectively. In summary, the proposed method shows good robustness in single-tree point cloud registration, with significant advantages in accuracy and speed compared to traditional ICP and NDT methods, indicating excellent application prospects in practical registration scenarios.
In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds, enabling geometry-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To this end, we introduce DepthMatch-ControlNet and LiDARMatch-ControlNet, two matching-specific, controllable 2D generative models. Specifically, for depth camera-based 3D registration with point clouds derived from the depth maps, DepthMatch-ControlNet leverages the depth-conditioned generation capabilities of ControlNet to synthesize perspective-view RGB images that are geometrically consistent with depth maps, ensuring accurate 2D-3D alignment. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, it further promotes cross-view feature interaction, guiding texture consistency generation. To address LiDAR-based 3D registration with point clouds captured by LiDAR sensors, LiDARMatch-ControlNet extends this framework by conditioning on paired equirectangular range maps projected from 360-degree LiDAR point clouds, generating corresponding panoramic RGB images. Our generative 3D registration paradigm is general and can be seamlessly integrated into a wide range of existing registration methods to improve their performance. Extensive experiments on the 3DMatch and ScanNet datasets (for depth-camera settings), as well as the Dur360BEV dataset (for LiDAR settings), demonstrate the effectiveness of our approach.
LiDAR sensors are a key modality for 3D perception, yet they are typically designed independently of downstream tasks such as point cloud registration. Conventional registration operates on pre-acquired datasets with fixed LiDAR configurations, leading to suboptimal data collection and significant computational overhead for sampling, noise filtering, and parameter tuning. In this work, we propose an adaptive LiDAR sensing framework that dynamically adjusts sensor parameters, jointly optimizing LiDAR acquisition and registration hyperparameters. By integrating registration feedback into the sensing loop, our approach optimally balances point density, noise, and sparsity, improving registration accuracy and efficiency. Evaluations in the CARLA simulation demonstrate that our method outperforms fixed-parameter baselines while retaining generalization abilities, highlighting the potential of adaptive LiDAR for autonomous perception and robotic applications.
Learning-based point cloud registration methods can handle clean point clouds well, while it is still challenging to generalize to noisy, partial, and density-varying point clouds. To this end, we propose a novel point cloud registration framework for these imperfect point clouds. By introducing a neural implicit representation, we replace the problem of rigid registration between point clouds with a registration problem between the point cloud and the neural implicit function. We then propose to alternately optimize the implicit function and the registration between the implicit function and point cloud. In this way, point cloud registration can be performed in a coarse-to-fine manner. By fully capitalizing on the capabilities of the neural implicit function without computing point correspondences, our method showcases remarkable robustness in the face of challenges such as noise, incompleteness, and density changes of point clouds.
We provide a dynamical perspective on the classical problem of 3D point cloud registration with correspondences. A point cloud is considered as a rigid body consisting of particles. The problem of registering two point clouds is formulated as a dynamical system, where the dynamic model point cloud translates and rotates in a viscous environment towards the static scene point cloud, under forces and torques induced by virtual springs placed between each pair of corresponding points. We first show that the potential energy of the system recovers the objective function of the maximum likelihood estimation. We then adopt Lyapunov analysis, particularly the invariant set theorem, to analyze the rigid body dynamics and show that the system globally asymptotically tends towards the set of equilibrium points, where the globally optimal registration solution lies in. We conjecture that, besides the globally optimal equilibrium point, the system has either three or infinite "spurious" equilibrium points, and these spurious equilibria are all locally unstable. The case of three spurious equilibria corresponds to generic shape of the point cloud, while the case of infinite spurious equilibria happens when the point cloud exhibits symmetry. Therefore, simulating the dynamics with random perturbations guarantees to obtain the globally optimal registration solution. Numerical experiments support our analysis and conjecture.
For many driving safety applications, it is of great importance to accurately register LiDAR point clouds generated on distant moving vehicles. However, such point clouds have extremely different point density and sensor perspective on the same object, making registration on such point clouds very hard. In this paper, we propose a novel feature extraction framework, called APR, for online distant point cloud registration. Specifically, APR leverages an autoencoder design, where the autoencoder reconstructs a denser aggregated point cloud with several frames instead of the original single input point cloud. Our design forces the encoder to extract features with rich local geometry information based on one single input point cloud. Such features are then used for online distant point cloud registration. We conduct extensive experiments against state-of-the-art (SOTA) feature extractors on KITTI and nuScenes datasets. Results show that APR outperforms all other extractors by a large margin, increasing average registration recall of SOTA extractors by 7.1% on LoKITTI and 4.6% on LoNuScenes. Code is available at https://github.com/liuQuan98/APR.
Point cloud registration is a fundamental problem in 3D computer vision. In this paper, we cast point cloud registration into a planning problem in reinforcement learning, which can seek the transformation between the source and target point clouds through trial and error. By modeling the point cloud registration process as a Markov decision process (MDP), we develop a latent dynamic model of point clouds, consisting of a transformation network and evaluation network. The transformation network aims to predict the new transformed feature of the point cloud after performing a rigid transformation (i.e., action) on it while the evaluation network aims to predict the alignment precision between the transformed source point cloud and target point cloud as the reward signal. Once the dynamic model of the point cloud is trained, we employ the cross-entropy method (CEM) to iteratively update the planning policy by maximizing the rewards in the point cloud registration process. Thus, the optimal policy, i.e., the transformation between the source and target point clouds, can be obtained via gradually narrowing the search space of the transformation. Experimental results on ModelNet40 and 7Scene benchmark datasets demonstrate that our method can yield good registration performance in an unsupervised manner.
We propose a new framework that formulates point cloud registration as a denoising diffusion process from noisy transformation to object transformation. During training stage, object transformation diffuses from ground-truth transformation to random distribution, and the model learns to reverse this noising process. In sampling stage, the model refines randomly generated transformation to the output result in a progressive way. We derive the variational bound in closed form for training and provide implementations of the model. Our work provides the following crucial findings: (i) In contrast to most existing methods, our framework, Diffusion Probabilistic Models for Point Cloud Registration (PCRDiffusion) does not require repeatedly update source point cloud to refine the predicted transformation. (ii) Point cloud registration, one of the representative discriminative tasks, can be solved by a generative way and the unified probabilistic formulation. Finally, we discuss and provide an outlook on the application of diffusion model in different scenarios for point cloud registration. Experimental results demonstrate that our model achieves competitive performance in point cloud registration. In correspondence-free and correspondence-based scenarios, PCRDifussion can both achieve exceeding 50\% performance improvements.
3D point cloud registration is a fundamental task in robotics and computer vision. Recently, many learning-based point cloud registration methods based on correspondences have emerged. However, these methods heavily rely on such correspondences and meet great challenges with partial overlap. In this paper, we propose ROPNet, a new deep learning model using Representative Overlapping Points with discriminative features for registration that transforms partial-to-partial registration into partial-to-complete registration. Specifically, we propose a context-guided module which uses an encoder to extract global features for predicting point overlap score. To better find representative overlapping points, we use the extracted global features for coarse alignment. Then, we introduce a Transformer to enrich point features and remove non-representative points based on point overlap score and feature matching. A similarity matrix is built in a partial-to-complete mode, and finally, weighted SVD is adopted to estimate a transformation matrix. Extensive experiments over ModelNet40 using noisy and partially overlapping point clouds show that the proposed method outperforms traditional and learning-based methods, achieving state-of-the-art performance. The code is available at https://github.com/zhulf0804/ROPNet.
Deep learning-based point cloud registration models are often generalized from extensive training over a large volume of data to learn the ability to predict the desired geometric transformation to register 3D point clouds. In this paper, we propose a meta-learning based 3D registration model, named 3D Meta-Registration, that is capable of rapidly adapting and well generalizing to new 3D registration tasks for unseen 3D point clouds. Our 3D Meta-Registration gains a competitive advantage by training over a variety of 3D registration tasks, which leads to an optimized model for the best performance on the distribution of registration tasks including potentially unseen tasks. Specifically, the proposed 3D Meta-Registration model consists of two modules: 3D registration learner and 3D registration meta-learner. During the training, the 3D registration learner is trained to complete a specific registration task aiming to determine the desired geometric transformation that aligns the source point cloud with the target one. In the meantime, the 3D registration meta-learner is trained to provide the optimal parameters to update the 3D registration learner based on the learned task distribution. After training, the 3D registration meta-learner, which is learned with the optimized coverage of distribution of 3D registration tasks, is able to dynamically update 3D registration learners with desired parameters to rapidly adapt to new registration tasks. We tested our model on synthesized dataset ModelNet and FlyingThings3D, as well as real-world dataset KITTI. Experimental results demonstrate that 3D Meta-Registration achieves superior performance over other previous techniques (e.g. FlowNet3D).
3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a 3D visual illusion depth estimation framework that uses common sense from the vision language model to adaptively fuse depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.
Accurately estimating depth in 360-degree imagery is crucial for virtual reality, autonomous navigation, and immersive media applications. Existing depth estimation methods designed for perspective-view imagery fail when applied to 360-degree images due to different camera projections and distortions, whereas 360-degree methods perform inferior due to the lack of labeled data pairs. We propose a new depth estimation framework that utilizes unlabeled 360-degree data effectively. Our approach uses state-of-the-art perspective depth estimation models as teacher models to generate pseudo labels through a six-face cube projection technique, enabling efficient labeling of depth in 360-degree images. This method leverages the increasing availability of large datasets. Our approach includes two main stages: offline mask generation for invalid regions and an online semi-supervised joint training regime. We tested our approach on benchmark datasets such as Matterport3D and Stanford2D3D, showing significant improvements in depth estimation accuracy, particularly in zero-shot scenarios. Our proposed training pipeline can enhance any 360 monocular depth estimator and demonstrates effective knowledge transfer across different camera projections and data types. See our project page for results: https://albert100121.github.io/Depth-Anywhere/
Robot swarms to date are not prepared for autonomous navigation such as path planning and obstacle detection in forest floor, unable to achieve low-cost. The development of depth sensing and embedded computing hardware paves the way for swarm of terrestrial robots. The goal of this research is to improve this situation by developing low cost vision system for small ground robots to rapidly perceive terrain. We develop two depth estimation models and evaluate their performance on Raspberry Pi 4 and Jetson Nano in terms of accuracy, runtime and model size of depth estimation models, as well as memory consumption, power draw, temperature, and cost of above two embedded on-board computers. Our research demonstrated that auto-encoder network deployed on Raspberry Pi 4 runs at a power consumption of 3.4 W, memory consumption of about 200 MB, and mean runtime of 13 ms. This can be to meet our requirement for low-cost swarm of robots. Moreover, our analysis also indicated multi-scale deep network performs better for predicting depth map from blurred RGB images caused by camera motion. This paper mainly describes depth estimation models trained on our own dataset recorded in forest, and their performance on embedded on-board computers.
Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multi-view cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.
Depth estimation plays a pivotal role in advancing human-robot interactions, especially in indoor environments where accurate 3D scene reconstruction is essential for tasks like navigation and object handling. Monocular depth estimation, which relies on a single RGB camera, offers a more affordable solution compared to traditional methods that use stereo cameras or LiDAR. However, despite recent progress, many monocular approaches struggle with accurately defining depth boundaries, leading to less precise reconstructions. In response to these challenges, this study introduces a novel depth estimation framework that leverages latent space features within a deep convolutional neural network to enhance the precision of monocular depth maps. The proposed model features dual encoder-decoder architecture, enabling both color-to-depth and depth-to-depth transformations. This structure allows for refined depth estimation through latent space encoding. To further improve the accuracy of depth boundaries and local features, a new loss function is introduced. This function combines latent loss with gradient loss, helping the model maintain the integrity of depth boundaries. The framework is thoroughly tested using the NYU Depth V2 dataset, where it sets a new benchmark, particularly excelling in complex indoor scenarios. The results clearly show that this approach effectively reduces depth ambiguities and blurring, making it a promising solution for applications in human-robot interaction and 3D scene reconstruction.
We present an algorithm for reconstructing dense, geometrically consistent depth for all pixels in a monocular video. We leverage a conventional structure-from-motion reconstruction to establish geometric constraints on pixels in the video. Unlike the ad-hoc priors in classical reconstruction, we use a learning-based prior, i.e., a convolutional neural network trained for single-image depth estimation. At test time, we fine-tune this network to satisfy the geometric constraints of a particular input video, while retaining its ability to synthesize plausible depth details in parts of the video that are less constrained. We show through quantitative validation that our method achieves higher accuracy and a higher degree of geometric consistency than previous monocular reconstruction methods. Visually, our results appear more stable. Our algorithm is able to handle challenging hand-held captured input videos with a moderate degree of dynamic motion. The improved quality of the reconstruction enables several applications, such as scene reconstruction and advanced video-based visual effects.
We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro
The perception of vehicles and pedestrians in urban scenarios is crucial for autonomous driving. This process typically involves complicated data collection, imposes high computational and hardware demands. To address these limitations, we first develop a highly efficient method for generating virtual datasets, which enables the creation of task- and scenario-specific datasets in a short time. Leveraging this method, we construct the virtual depth estimation dataset VirDepth, a large-scale, multi-task autonomous driving dataset. Subsequently, we propose CenterDepth, a lightweight architecture for monocular depth estimation that ensures high operational efficiency and exhibits superior performance in depth estimation tasks with highly imbalanced height-scale distributions. CenterDepth integrates global semantic information through the innovative Center FC-CRFs algorithm, aggregates multi-scale features based on object key points, and enables detection-based depth estimation of targets. Experiments demonstrate that our proposed method achieves superior performance in terms of both computational speed and prediction accuracy.
Recent advancements of neural networks lead to reliable monocular depth estimation. Monocular depth estimated techniques have the upper hand over traditional depth estimation techniques as it only needs one image during inference. Depth estimation is one of the essential tasks in robotics, and monocular depth estimation has a wide variety of safety-critical applications like in self-driving cars and surgical devices. Thus, the robustness of such techniques is very crucial. It has been shown in recent works that these deep neural networks are highly vulnerable to adversarial samples for tasks like classification, detection and segmentation. These adversarial samples can completely ruin the output of the system, making their credibility in real-time deployment questionable. In this paper, we investigate the robustness of the most state-of-the-art monocular depth estimation networks against adversarial attacks. Our experiments show that tiny perturbations on an image that are invisible to the naked eye (perturbation attack) and corruption less than about 1% of an image (patch attack) can affect the depth estimation drastically. We introduce a novel deep feature annihilation loss that corrupts the hidden feature space representation forcing the decoder of the network to output poor depth maps. The white-box and black-box test compliments the effectiveness of the proposed attack. We also perform adversarial example transferability tests, mainly cross-data transferability.
Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
Depth estimation is an important task in various robotics systems and applications. In mobile robotics systems, monocular depth estimation is desirable since a single RGB camera can be deployable at a low cost and compact size. Due to its significant and growing needs, many lightweight monocular depth estimation networks have been proposed for mobile robotics systems. While most lightweight monocular depth estimation methods have been developed using convolution neural networks, the Transformer has been gradually utilized in monocular depth estimation recently. However, massive parameters and large computational costs in the Transformer disturb the deployment to embedded devices. In this paper, we present a Token-Sharing Transformer (TST), an architecture using the Transformer for monocular depth estimation, optimized especially in embedded devices. The proposed TST utilizes global token sharing, which enables the model to obtain an accurate depth prediction with high throughput in embedded devices. Experimental results show that TST outperforms the existing lightweight monocular depth estimation methods. On the NYU Depth v2 dataset, TST can deliver depth maps up to 63.4 FPS in NVIDIA Jetson nano and 142.6 FPS in NVIDIA Jetson TX2, with lower errors than the existing methods. Furthermore, TST achieves real-time depth estimation of high-resolution images on Jetson TX2 with competitive results.
Self-supervised monocular depth estimation has become an appealing solution to the lack of ground truth labels, but its reconstruction loss often produces over-smoothed results across object boundaries and is incapable of handling occlusion explicitly. In this paper, we propose a new approach to leverage pseudo ground truth depth maps of stereo images generated from self-supervised stereo matching methods. The confidence map of the pseudo ground truth depth map is estimated to mitigate performance degeneration by inaccurate pseudo depth maps. To cope with the prediction error of the confidence map itself, we also leverage the threshold network that learns the threshold dynamically conditioned on the pseudo depth maps. The pseudo depth labels filtered out by the thresholded confidence map are used to supervise the monocular depth network. Furthermore, we propose the probabilistic framework that refines the monocular depth map with the help of its uncertainty map through the pixel-adaptive convolution (PAC) layer. Experimental results demonstrate superior performance to state-of-the-art monocular depth estimation methods. Lastly, we exhibit that the proposed threshold learning can also be used to improve the performance of existing confidence estimation approaches.
Concept for an Automatic Annotation of Automotive Radar Data Using AI-segmented Aerial Camera Images
This paper presents an approach to automatically annotate automotive radar data with AI-segmented aerial camera images. For this, the images of an unmanned aerial vehicle (UAV) above a radar vehicle are panoptically segmented and mapped in the ground plane onto the radar images. The detected instances and segments in the camera image can then be applied directly as labels for the radar data. Owing to the advantageous bird's eye position, the UAV camera does not suffer from optical occlusion and is capable of creating annotations within the complete field of view of the radar. The effectiveness and scalability are demonstrated in measurements, where 589 pedestrians in the radar data were automatically labeled within 2 minutes.
Object detection in camera images, using deep learning has been proven successfully in recent years. Rising detection rates and computationally efficient network structures are pushing this technique towards application in production vehicles. Nevertheless, the sensor quality of the camera is limited in severe weather conditions and through increased sensor noise in sparsely lit areas and at night. Our approach enhances current 2D object detection networks by fusing camera data and projected sparse radar data in the network layers. The proposed CameraRadarFusionNet (CRF-Net) automatically learns at which level the fusion of the sensor data is most beneficial for the detection result. Additionally, we introduce BlackIn, a training strategy inspired by Dropout, which focuses the learning on a specific sensor type. We show that the fusion network is able to outperform a state-of-the-art image-only network for two different datasets. The code for this research will be made available to the public at: https://github.com/TUMFTM/CameraRadarFusionNet.
In this paper we present The Oxford Radar RobotCar Dataset, a new dataset for researching scene understanding using Millimetre-Wave FMCW scanning radar data. The target application is autonomous vehicles where this modality is robust to environmental conditions such as fog, rain, snow, or lens flare, which typically challenge other sensor modalities such as vision and LIDAR. The data were gathered in January 2019 over thirty-two traversals of a central Oxford route spanning a total of 280km of urban driving. It encompasses a variety of weather, traffic, and lighting conditions. This 4.7TB dataset consists of over 240,000 scans from a Navtech CTS350-X radar and 2.4 million scans from two Velodyne HDL-32E 3D LIDARs; along with six cameras, two 2D LIDARs, and a GPS/INS receiver. In addition we release ground truth optimised radar odometry to provide an additional impetus to research in this domain. The full dataset is available for download at: ori.ox.ac.uk/datasets/radar-robotcar-dataset
本报告综合了三维视觉感知领域的前沿研究,形成了从底层深度估计到高层空间对齐的完整技术链条。研究核心已从单一的单目深度预测演进为多模态融合(雷达+视觉)与大模型驱动(扩散模型+基座模型)的双轨并行模式。深度补全技术有效解决了传感器稀疏性问题,而点云配准与SLAM集成则保障了大规模场景下的空间一致性。这些技术的融合应用,正显著提升自动驾驶、机器人导航及精密医疗在复杂、动态及恶劣环境下的感知精度与鲁棒性。