分类与识别定位方法
目标检测方法体系与任务综述/基准(YOLO演进、实时/半监督/旋转检测)
以综述/范式梳理为主的体系化工作,覆盖两阶段与一阶段检测框架、YOLO演进脉络、实时检测研究现状、半监督检测与旋转检测等任务方向;同时包含对目标检测与定位方法的系统综述、基准评测与领域外推研究,用于建立分类与识别定位方法的整体知识地图。
- Architecture review: Two-stage and one-stage object detection(Sara A. Mohammed, 2025, Franklin Open)
- A Review of YOLO Algorithm and Its Applications in Autonomous Driving Object Detection(Jiapei Wei, A. As’arry, Khairil Anas Md Rezali, Mohd Zuhri Mohamed Yusoff, Haohao Ma, Kunlun Zhang, 2025, IEEE Access)
- A Survey on Real-Time Object Detection on FPGAs(Seyed Hani Hozhabr, Roberto Giorgi, 2025, IEEE Access)
- Semi-Supervised Object Detection: A Survey on Progress from CNN to Transformer(Tahira Shehzadi, Ifza Ifza, Marcus Liwicki, Didier Stricker, Muhammad Zeshan Afzal, 2024, Sensors)
- Oriented object detection in optical remote sensing images using deep learning: a survey(Kunlin Wang, Zi Wang, Zhang Li, Ang Su, Xichao Teng, Minhao Liu, Qifeng Yu, 2023, Artificial Intelligence Review)
- A Benchmark Review of YOLO Algorithm Developments for Object Detection(Zhengmao Hua, K. Aranganadin, Cheng-Cheng Yeh, Xinhe Hai, Chen-Yun Huang, T. Leung, Hua-Yi Hsu, Yung-Chiang Lan, Ming-Chieh Lin, 2025, IEEE Access)
- Delving into YOLO Object Detection Models: Insights into Adversarial Robustness(Kyriakos D. Apostolidis, G. Papakostas, 2025, Electronics)
- A Decade of You Only Look Once (YOLO) for Object Detection: A Review(L. T. Ramos, A. Sappa, 2025, IEEE Access)
- Context in object detection: a systematic literature review(Mahtab Jamali, Paul Davidsson, Reza Khoshkangini, M. Ljungqvist, R. Mihailescu, 2025, Artificial Intelligence Review)
- Advances in Object Detection and Localization Techniques for Fruit Harvesting Robots(Xiaojie Shi, Shaowei Wang, Bo Zhang, Xinbing Ding, Peng Qi, Huixing Qu, Ning Li, Jie Wu, Huawei Yang, 2025, Agronomy)
- Research on object detection and recognition in remote sensing images based on YOLOv11(Lu-hao He, Yong-zhang Zhou, Lei Liu, Wei Cao, Jianhua Ma, 2025, Scientific Reports)
- A Systematic Review of YOLO-Based Object Detection in Medical Imaging: Advances, Challenges, and Future Directions(Zhenhui Cai, Kaiqing Zhou, Zhouhua Liao, 2025, Computers, Materials & Continua)
- YOLO Object Detection for Real-Time Fabric Defect Inspection in the Textile Industry: A Review of YOLOv1 to YOLOv11(Makara Mao, Min Hong, 2025, Sensors)
- Object Detection with Multimodal Large Vision-Language Models: An In-depth Review(Ranjan Sapkota, Manoj Karkee, 2025, ArXiv Preprint)
- YOLO-IOD: Towards Real Time Incremental Object Detection(Shizhou Zhang, Xueqiang Lv, Yinghui Xing, Qirui Wu, Di Xu, Chen Zhao, Yanning Zhang, 2025, ArXiv Preprint)
- YOLO Object Detection for Real-Time Fabric Defect Inspection in the Textile Industry: A Review of YOLOv1 to YOLOv11(Makara Mao, Min Hong, 2025, Sensors)
- A Benchmark Review of YOLO Algorithm Developments for Object Detection(Zhengmao Hua, K. Aranganadin, Cheng-Cheng Yeh, Xinhe Hai, Chen-Yun Huang, T. Leung, Hua-Yi Hsu, Yung-Chiang Lan, Ming-Chieh Lin, 2025, IEEE Access)
开放词汇与开放世界:类别无关掩码的实例识别/跟踪
聚焦开放词汇/开放世界的视频实例分割:采用类无关mask提议与实例token/CLIP引导的关联与分类机制,实现训练外类别的视频实例识别与跟踪,与封闭类别检测器的“框+类别”范式不同。
- OpenVIS: Open-vocabulary Video Instance Segmentation(Pinxue Guo, Hao Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Wenqiang Zhang, 2025, Proceedings of the AAAI Conference on Artificial Intelligence)
3D检测与几何表征/编码:用几何约束提升定位稳定性
共同强调3D/空间几何约束下的定位稳定性:通过边界框参数化与编码方式调整、利用多模态语义与物理/几何先验提升2D-3D对齐,或在多视角雷达空间中引入几何约束扩散/关联来增强空间定位鲁棒性。
- Rethinking the Encoding and Annotating of 3D Bounding Box: Corner-Aware 3D Object Detection from Point Clouds(Qinghao Meng, Junbo Yin, Jianbing Shen, Yunde Jia, 2025, ArXiv Preprint)
- From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning(Shuangzhi Li, Junlong Shen, Lei Ma, Xingyu Li, 2025, ArXiv Preprint)
- REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion(Ryoma Yataka, Pu Perry Wang, Petros Boufounos, Ryuhei Takahashi, 2025, ArXiv Preprint)
- From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning(Shuangzhi Li, Junlong Shen, Lei Ma, Xingyu Li, 2025, ArXiv Preprint)
几何建模与边界/框回归增强(条带/定向/边界引导与IoU类损失)
围绕“框回归与几何建模”的关键链路:针对条带/定向或细长目标等几何形态,使用各向异性/条带卷积与适应性框尺度;通过边界引导或IoU类回归改造提升收敛与定位精度,强调回归建模而非仅网络堆叠。
- Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection(Xinbin Yuan, Zhaohui Zheng, Yuxuan Li, Xialei Liu, Li Liu, Xiang Li, Qibin Hou, Ming-Ming Cheng, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects(Woojin Lee, Hyugjae Chang, Jaeho Moon, Jaehyup Lee, Munchurl Kim, 2025, ArXiv Preprint)
- BGSNet: A boundary-guided Siamese multitask network for semantic change detection from high-resolution remote sensing images(Jiang Long, Sicong Liu, Mengmeng Li, Hang Zhao, Yanmin Jin, 2025, ISPRS Journal of Photogrammetry and Remote Sensing)
- Enhancing Bounding Box Regression for Object Detection: Dimensional Angle Precision IoU-Loss(Hilmy Aliy Andra Putra, Aniati Murni Arymurthy, D. Chahyati, 2025, IEEE Access)
- Rethinking the Encoding and Annotating of 3D Bounding Box: Corner-Aware 3D Object Detection from Point Clouds(Qinghao Meng, Junbo Yin, Jianbing Shen, Yunde Jia, 2025, ArXiv Preprint)
旋转目标(姿态)分类与定位:空间对齐与旋转损失/表示
专注旋转/任意姿态目标的分类与定位:通过空间变换网络或专门的旋转检测框表示与各向异性建模,解决不同角度下的特征对齐与回归困难,并系统讨论旋转检测问题与损失/表示改进思路。
- Adaptive YOLOv6 with spatial Transformer Networks for accurate object detection and Multi-Angle classification in remote sensing images(G. Rajendran, G. Srinivasan, Niruban Rathakrishnan, 2025, Expert Systems with Applications)
- Oriented object detection in optical remote sensing images using deep learning: a survey(Kunlin Wang, Zi Wang, Zhang Li, Ang Su, Xichao Teng, Minhao Liu, Qifeng Yu, 2023, Artificial Intelligence Review)
- Enhancing Rotated Object Detection via Anisotropic Gaussian Bounding Box and Bhattacharyya Distance(Chien Thai, Mai Xuan Trang, Huong Ninh, Hoang Hiep Ly, Anh Son Le, 2025, ArXiv Preprint)
小目标与尺度自适应检测/定位(多尺度融合、zoom-in与动态检测头)
面向难条件中的“尺度自适应/小目标/多尺度定位”问题:通过zoom-in、自适应上采样、多尺度特征融合与改进检测头提升小目标召回;同时在复杂场景验证并结合轻量化与注意力/聚合策略,以在不同尺度与资源约束下保持定位精度。
- Adaptive Image Zoom-in with Bounding Box Transformation for UAV Object Detection(Tao Wang, Chenyu Lin, Chenwei Tang, Jizhe Zhou, Deng Xiong, Jianan Li, Jian Zhao, Jiancheng Lv, 2026, ArXiv Preprint)
- YOLOv8 with Post-Processing for Small Object Detection Enhancement(Jin-Kyu Ryu, Dong-Sik Kwak, Seungmin Choi, 2025, Applied Sciences)
- SOD-YOLO: Enhancing YOLO-Based Detection of Small Objects in UAV Imagery(Peijun Wang, Jinhua Zhao, 2025, ArXiv Preprint)
- Small Object Detection with YOLO: A Performance Analysis Across Model Versions and Hardware(Muhammad Fasih Tariq, Muhammad Azeem Javed, 2025, ArXiv Preprint)
- ESO-DETR: An Improved Real-Time Detection Transformer Model for Enhanced Small Object Detection in UAV Imagery(Yingfan Liu, Miao He, Bin Hui, 2025, Drones)
- RLRD-YOLO: An Improved YOLOv8 Algorithm for Small Object Detection from an Unmanned Aerial Vehicle (UAV) Perspective(Hanyun Li, Yi Li, Linsong Xiao, Yunfeng Zhang, Lihua Cao, Di Wu, 2025, Drones)
- HierLight-YOLO: A Hierarchical and Lightweight Object Detection Network for UAV Photography(Defan Chen, Yaohua Hu, Luchan Zhang, 2025, ArXiv Preprint)
- Small Object Detection for Birds with Swin Transformer(Da Huo, Marc A. Kastner, Tingwei Liu, Yasutomo Kawanishi, Takatsugu Hirayama, Takahiro Komamizu, Ichiro Ide, 2025, ArXiv Preprint)
- Small Object Detection Method for UAV Remote Sensing Images Based on αS-YOLO(Wei Hou, Haomeng Wu, Di Wu, Yulin Shen, Ze Liu, Lili Zhang, Jicai Li, 2025, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing)
- LRDS-YOLO enhances small object detection in UAV aerial images with a lightweight and efficient design(Yuqi Han, Chengcheng Wang, Hui Luo, Huihua Wang, Zaiqing Chen, Yuelong Xia, Lijun Yun, 2025, Scientific Reports)
- Optimized YOLOv8 for multi-scale object detection(Areeg Fahad Rasheed, M. Zarkoosh, 2024, Journal of Real-Time Image Processing)
- YOLOv11-based multi-task learning for enhanced bone fracture detection and classification in X-ray images(W. Wei, Yan Huang, Junchi Zheng, Yuanyong Rao, Yongping Wei, X. Tan, H. Ouyang, 2025, Journal of Radiation Research and Applied Sciences)
- SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network(Kuanxu Hou, 2026, ArXiv Preprint)
- HAF-YOLO: Dynamic Feature Aggregation Network for Object Detection in Remote-Sensing Images(Pengfei Zhang, Jian Liu, Jianqiang Zhang, Yiping Liu, Jiahao Shi, 2025, Remote Sensing)
- FLRNet: A bio-inspired three-stage network for Camouflaged Object Detection via filtering, localization and refinement(Yilin Zhao, Qing Zhang, Yuetong Li, 2025, Neurocomputing)
- Lightweight oriented object detection with Dynamic Smooth Feature Fusion Network(Iftikhar Ahmad, Wei Lu, Sibao Chen, Jin Tang, Bin Luo, 2025, Neurocomputing)
- Precision and speed: LSOD-YOLO for lightweight small object detection(Hezheng Wang, Jiahui Liu, Jian Zhao, Jianzhong Zhang, Dong Zhao, 2025, Expert Systems with Applications)
- Small object detection using hybrid evaluation metric with context decoupling(Kang Tong, Yiquan Wu, 2025, Multimedia Systems)
- Progressive class-aware instance enhancement for aircraft detection in remote sensing imagery(Tianjun Shi, J. Gong, Jianming Hu, Yu Sun, Guangzheng Bao, Pengfei Zhang, Junjie Wang, Xiyang Zhi, Wei Zhang, 2025, Pattern Recognition)
- REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion(Ryoma Yataka, Pu Perry Wang, Petros Boufounos, Ryuhei Takahashi, 2025, ArXiv Preprint)
实时与可部署识别定位:端到端多任务/实例分割与流式延迟
以YOLO/一阶段检测与工程端到端管线为主线,强调实时性与可部署性:包含轻量化、注意力/动态模块、边缘/车载/无人机等端侧推理;同时覆盖实时多任务、实例分割融合、流式延迟补偿与端到端系统集成。该组体现“检测定位方法→可运行系统”的落地导向。
- SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling(Guanghao Liao, Zhen Liu, Liyuan Cao, Yonghui Yang, Qi Li, 2026, ArXiv Preprint)
- MS-YOLO: Infrared Object Detection for Edge Deployment via MobileNetV4 and SlideLoss(Jiali Zhang, Thomas S. White, Haoliang Zhang, Wenqing Hu, Donald C. Wunsch, Jian Liu, 2025, ArXiv Preprint)
- Enhancing the YOLOv8 model for realtime object detection to ensure online platform safety(M. Jahan, Fokrul Islam Bhuiyan, Al Amin, M. Mridha, Mejdl S. Safran, Sultan Alfarhood, Dunren Che, 2025, Scientific Reports)
- Real-Time Object Detection and Classification using YOLO for Edge FPGAs(Rashed Al Amin, Roman Obermaisser, 2025, ArXiv Preprint)
- Object detection in real-time video surveillance using attention based transformer-YOLOv8 model(Divya Nimma, Omaia Al-Omari, Rahul Pradhan, Zoirov Ulmas, R. Krishna, Ts. Yousef A.Baker El-Ebiary, Vuda Sreenivasa Rao, 2025, Alexandria Engineering Journal)
- Real-time object detection using improvised YOLOv4 and feature mapping technique for autonomous driving(Kishore Kumar Anguchamy, V. Palanisamy, 2025, Expert Systems with Applications)
- Advanced vehicle monitoring in smart port utilizing deep denoising real-time object detectors integrated multi-resolution attention-augmented CRNN(A. Ta, L. Le, Linh Bui-Duy, 2025, Ain Shams Engineering Journal)
- A Multi-task Supervised Compression Model for Split Computing(Yoshitomo Matsubara, Matteo Mendula, Marco Levorato, 2025, ArXiv Preprint)
- CB-YOLO: Dense Object Detection of YOLO for Crowded Wheat Head Identification and Localization(Wenzhuo Chen, Qinxiu Gao, Shaohuang Bian, Baoxia Li, Junwei Guo, Dan Zhang, Cheng Yang, Wenzhuo Hu, F. Huang, 2024, Journal of Circuits, Systems and Computers)
- Cassava Crop Disease Prediction and Localization Using Object Detection(J. Kalezhi, Langtone Shumba, 2024, Crop Protection)
- GINSER: Geographic Information System Based Optimal Route Recommendation via Optimized Faster R-CNN(S. D. A. Selvasofia, B. SivaSankari, R. Dinesh, N. Muthukumaran, 2025, International Journal of Computational Intelligence Systems)
- Adaptive and soft constrained vision-map vehicle localization using Gaussian processes and instance segmentation(Bruno Henrique Groenner Barbosa, N. Bhatt, A. Khajepour, Ehsan Hashemi, 2025, Expert Systems with Applications)
- A deep learning framework for real-time multi-task recognition and measurement of concrete cracks(Gang Xu, Yingshui Zhang, Qingrui Yue, Xiaogang Liu, 2025, Advanced Engineering Informatics)
- Object Detection and Localization in Real-Time Using Image Processing and Deep Learning(Gaurav Bhakuni, Srikanth Srinivas, Sathish Rao, Ganesh Kumar Ayyalusamy, Saideep Nakka, Sandeep Kumar, 2025, 2025 International Conference on Engineering, Technology & Management (ICETM))
- A Multi-task Supervised Compression Model for Split Computing(Yoshitomo Matsubara, Matteo Mendula, Marco Levorato, 2025, ArXiv Preprint)
- YOLO-SRMX: A Lightweight Model for Real-Time Object Detection on Unmanned Aerial Vehicles(Shimin Weng, Han Wang, Jiashu Wang, Changming Xu, Ende Zhang, 2025, Remote Sensing)
- Improved real-time object detection method based on YOLOv8: a refined approach(Jiaqi Zhong, Huaming Qian, Huilin Wang, Wenna Wang, Yipeng Zhou, 2024, Journal of Real-Time Image Processing)
- Research on Automatic Driving Road Object Detection Algorithm Integrating Multi Scale Detection and Boundary Box Regression Optimization(L. Hao, 2025, 2025 4th International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE))
- Drone Image Localization by Faster R-CNN Algorithm and Detection Accuracy(Maysoon . Khazaal . Maaroof, M. Bouhlel, 2025, Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications)
- Resistance Spot Welding Defect Detection Based on Visual Inspection: Improved Faster R-CNN Model(Weijie Liu, Jie Hu, Jin Qi, 2024, Machines)
- HOMEY: Heuristic Object Masking with Enhanced YOLO for Property Insurance Risk Detection(Teerapong Panboonyuen, 2026, ArXiv Preprint)
- An Underwater Object Recognition System Based on Improved YOLOv11(Shun Cheng, Yan Han, Zhiqian Wang, Shaojin Liu, Bo Yang, Jianrong Li, 2025, Electronics)
- YOLO-DAFS: A Composite-Enhanced Underwater Object Detection Algorithm(Shengfu Luo, Chao Dong, Guixin Dong, Rongmin Chen, Bing Zheng, Ming Xiang, Peng Zhang, Zhanwei Li, 2025, Journal of Marine Science and Engineering)
- SU-YOLO: Spiking Neural Network for Efficient Underwater Object Detection(Chenyang Li, Wenxuan Liu, Guoqiang Gong, Xiaobo Ding, Xian Zhong, 2025, ArXiv Preprint)
- CIDNet: Cross-Scale Interference Mining Detection Network for underwater object detection(Gaoli Zhao, Kefei Zhang, Liangzhi Wang, Wenyi Zhao, Weidong Zhang, 2025, Knowledge-Based Systems)
- GMS-YOLO: A Lightweight Real-Time Object Detection Algorithm for Pedestrians and Vehicles Under Foggy Conditions(Yafei Chen, Yong Wang, Zhengming Zou, Wenxiu Dan, 2025, IEEE Internet of Things Journal)
- Assessing YOLO models for real-time object detection in urban environments for advanced driver-assistance systems (ADAS)(R. Ayachi, Yahia Said, M. Afif, Aadil Alshammari, Manel Hleili, Abdessalem Ben Abdelali, 2025, Alexandria Engineering Journal)
- Real-Time Multi-Task Deep Learning Model for Polyp Detection, Characterization, and Size Estimation(Phanukorn Sunthornwetchapong, Kasichon Hombubpha, K. Tiankanon, S. Aniwan, Pasit Jakkrawankul, N. Nupairoj, P. Vateekul, R. Rerknimitr, 2025, IEEE Access)
- DRespNeT: A UAV Dataset and YOLOv8-DRN Model for Aerial Instance Segmentation of Building Access Points for Post-Earthquake Search-and-Rescue Missions(Aykut Sirma, Angelos Plastropoulos, Gilbert Tang, Argyrios Zolotas, 2025, ArXiv Preprint)
- YOLOv8-TEA: Recognition Method of Tender Shoots of Tea Based on Instance Segmentation Algorithm(Wenbo Wang, Yidan Xi, Jinan Gu, Qiuyue Yang, Zhiyao Pan, Xinzhou Zhang, Gongyue Xu, Man Zhou, 2025, Agronomy)
- CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for Real-time Object Detection(Xiang Zhang, Chenchen Fu, Yufei Cui, Lan Yi, Yuyang Sun, Weiwei Wu, Xue Liu, 2025, ArXiv Preprint)
- Enhancing Object Detection with Privileged Information: A Model-Agnostic Teacher-Student Approach(Matthias Bartolo, Dylan Seychell, Gabriel Hili, Matthew Montebello, Carl James Debono, Saviour Formosa, Konstantinos Makantasis, 2026, ArXiv Preprint)
- Model compression for real-time object detection using rigorous gradation pruning(Defu Yang, M. I. Solihin, Yawen Zhao, Bingyu Cai, Chaoran Chen, Andika Aji Wijaya, C. Ang, Wei Hong Lim, 2024, iScience)
- You Sense Only Once Beneath: Ultra-Light Real-Time Underwater Object Detection(Jun Dong, Wenli Wu, Jintao Cheng, Xiaoyu Tang, 2025, ArXiv Preprint)
- HierLight-YOLO: A Hierarchical and Lightweight Object Detection Network for UAV Photography(Defan Chen, Yaohua Hu, Luchan Zhang, 2025, ArXiv Preprint)
- Advancing autonomous SLAM systems: Integrating YOLO object detection and enhanced loop closure techniques for robust environment mapping(Qamar Ul Islam, Fatemeh Khozaei, E. M. Barhoumi, Imran Baig, D. Ignatyev, 2024, Robotics and Autonomous Systems)
- D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment(Argo Saakyan, Dmitry Solntsev, 2026, ArXiv Preprint)
- A Unified CNN-Based Instance Segmentation Architecture for Blood Cell Classification and Early Cancer Abnormality Recognition(Nathaniel H. Dumayas, Rhomwell Ace C. Merced, Kenniniah A. Rit, Gabriel Marc B. Verzosa, Lysa V. Comia, 2026, 2026 7th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI))
- An Embedded Feature Pyramid Network Enables Bidirectional Information Flow for Object Detection and Instance Segmentation(Chunning Meng, Zequn Sun, T. Li, Lianzhi Huo, Shengjiang Chang, Zhiqing Zhang, 2024, Neurocomputing)
- Detection of Corals, Seagrass, and Seaweeds Using YOLOv9 Instance Segmentation with Image Augmentation(Ken D. Gorro, 2025, Journal of Image and Graphics)
- KongNet: A Multi-headed Deep Learning Model for Detection and Classification of Nuclei in Histopathology Images(Jiaqi Lv, Esha Sadia Nasir, Kesi Xu, Mostafa Jahanifar, Brinder Singh Chohan, Behnaz Elhaminia, Shan E Ahmed Raza, 2025, ArXiv Preprint)
- CelloType: a unified model for segmentation and classification of tissue images(Minxing Pang, Tarun Kanti Roy, Xiaodong Wu, Kai Tan, 2024, Nature Methods)
- Zero-Shot Tree Detection and Segmentation from Aerial Forest Imagery(Michelle Chen, David Russell, Amritha Pallavoor, Derek Young, Jane Wu, 2025, ArXiv Preprint)
资源受限与边缘部署:能效/压缩/联邦学习驱动的实时检测
以资源受限与部署能效为共同约束:通过轻量架构、端侧/FPGA等实现与精度-延迟-算力权衡;并将联邦学习/自蒸馏等机制纳入实时检测框架,以提升隐私与持续适配能力。该组聚焦“效率与部署”作为主目标。
- VINO_EffiFedAV: VINO with efficient federated learning through selective client updates for real-time autonomous vehicle object detection(K. Vinoth, P. Sasikumar, 2025, Results in Engineering)
- Teach YOLO to Remember: A Self-Distillation Approach for Continual Object Detection(Riccardo De Monte, Davide Dalle Pezze, Gian Antonio Susto, 2025, ArXiv Preprint)
- MS-YOLO: Infrared Object Detection for Edge Deployment via MobileNetV4 and SlideLoss(Jiali Zhang, Thomas S. White, Haoliang Zhang, Wenqing Hu, Donald C. Wunsch, Jian Liu, 2025, ArXiv Preprint)
- A Benchmark Review of YOLO Algorithm Developments for Object Detection(Zhengmao Hua, K. Aranganadin, Cheng-Cheng Yeh, Xinhe Hai, Chen-Yun Huang, T. Leung, Hua-Yi Hsu, Yung-Chiang Lan, Ming-Chieh Lin, 2025, IEEE Access)
- Model compression for real-time object detection using rigorous gradation pruning(Defu Yang, M. I. Solihin, Yawen Zhao, Bingyu Cai, Chaoran Chen, Andika Aji Wijaya, C. Ang, Wei Hong Lim, 2024, iScience)
少样本、域泛化与增量/持续学习适配(few-shot / domain gap / continual)
共同面向少样本、域泛化与持续学习:通过高效微调(LoRA等)、域差异处理与原型/3D推理泛化策略,或用自蒸馏/重放缓解灾难性遗忘,从而提高在数据不足或分布变化条件下的识别定位可靠性。
- Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification (MIDOG 2025 Task 2 Winner)(Guillaume Balezo, Hana Feki, Raphaël Bourgade, Lily Monnier, Matthieu Blons, Alice Blondel, Etienne Decencière, Albert Pla Planas, Thomas Walter, 2025, ArXiv Preprint)
- From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning(Shuangzhi Li, Junlong Shen, Lei Ma, Xingyu Li, 2025, ArXiv Preprint)
- DAM-Faster RCNN: few-shot defect detection method for wood based on dual attention mechanism(Xingyu Tong, Zhihong Liang, Mingming Qin, Fangrong Liu, Jiayu Yang, Hengjiang Xiao, Wei Dai, 2025, Scientific Reports)
- Teach YOLO to Remember: A Self-Distillation Approach for Continual Object Detection(Riccardo De Monte, Davide Dalle Pezze, Gian Antonio Susto, 2025, ArXiv Preprint)
- Object Detection and Localization in Real-Time Using Image Processing and Deep Learning(Gaurav Bhakuni, Srikanth Srinivas, Sathish Rao, Ganesh Kumar Ayyalusamy, Saideep Nakka, Sandeep Kumar, 2025, 2025 International Conference on Engineering, Technology & Management (ICETM))
- Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification (MIDOG 2025 Task 2 Winner)(Guillaume Balezo, Hana Feki, Raphaël Bourgade, Lily Monnier, Matthieu Blons, Alice Blondel, Etienne Decencière, Albert Pla Planas, Thomas Walter, 2025, ArXiv Preprint)
数据合成与提示驱动的定位/分割(合成标注与自动prompt)
通过“数据合成与提示驱动”改善定位与类别识别:用对象中心与几何/相机配置生成可扩展的标注用于训练;在SAM体系下自动生成更精确的提示并注入知识以提升核实例分割与定位精度。
- Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding(Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Winson Han, Ranjay Krishna, 2025, ArXiv Preprint)
- APSeg: Auto-Prompt Model with Acquired and Injected Knowledge for Nuclear Instance Segmentation and Classification(Liying Xu, Hongliang He, Wei Han, Hanbin Huang, Siwei Feng, Guohong Fu, 2025, ArXiv Preprint)
时空与跨域表征增强:多尺度、注意力、全局上下文与特征融合
聚焦表征层增强带来的效果提升:通过多尺度特征聚合、全局上下文建模、注意力与跨视角关联,把“定位/分类”的关键困难(小目标、密集目标、低可见、跨域)转化为表征与融合能力问题。
- YOLOv8 with Post-Processing for Small Object Detection Enhancement(Jin-Kyu Ryu, Dong-Sik Kwak, Seungmin Choi, 2025, Applied Sciences)
- Progressive class-aware instance enhancement for aircraft detection in remote sensing imagery(Tianjun Shi, J. Gong, Jianming Hu, Yu Sun, Guangzheng Bao, Pengfei Zhang, Junjie Wang, Xiyang Zhi, Wei Zhang, 2025, Pattern Recognition)
- Lightweight oriented object detection with Dynamic Smooth Feature Fusion Network(Iftikhar Ahmad, Wei Lu, Sibao Chen, Jin Tang, Bin Luo, 2025, Neurocomputing)
- Global Recurrent Mask R-CNN: Marine ship instance segmentation(Ming Yuan, Hao Meng, Junbao Wu, Shouwen Cai, 2025, Computers & Graphics)
- REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion(Ryoma Yataka, Pu Perry Wang, Petros Boufounos, Ryuhei Takahashi, 2025, ArXiv Preprint)
生成式检测:边界框预测的条件生成范式
把目标检测重定义为生成式任务:将边界框回归转为条件生成(直接生成带类别/框的输出),体现范式从判别式检测器向可控生成式检测迁移。
领域/应用定制的检测定位:医疗、安检、交互与行业质检
面向医疗、安检、交互与行业质检等特定场景的任务定制:强调针对领域数据与偏移的检测器改造、骨干/注意力/Transformer或Mamba类结构替换,以及与具体应用目标(早诊、非法物体、手势/交互、岩心/缺陷识别等)的对齐。
- Efficient diagnostic model for iron deficiency anaemia detection: a comparison of CNN and object detection algorithms in peripheral blood smear images(N. K. T., Seemitr Verma, Keerthana Prasad, Brij Mohan Kumar Singh, 2024, Automatika)
- Deep learning object detection-based early detection of lung cancer(Kuo-Yang Huang, Che-Liang Chung, Jia-Lang Xu, 2025, Frontiers in Medicine)
- Gesture Object Detection and Recognition Based on YOLOv11(Jian Xu, Heyao Chen, Xingpeng Xiao, Mengyuan Zhao, Bo Liu, 2025, Applied and Computational Engineering)
- Att-YOLO: A Real-Time Rock Core Classification and Localization Deep Learning Model(Sihao Yu, Louis Ngai Yuen Wong, 2025, Rock Mechanics and Rock Engineering)
- Object Detection Based on Improved YOLOv10 for Electrical Equipment Image Classification(Xiang Gao, Jiaxuan Du, Xinghua Liu, Duowei Jia, Jinhong Wang, 2025, Processes)
- Mamba YOLO: A Simple Baseline for Object Detection with State Space Model(Zeyu Wang, Chen Li, Huiying Xu, Xinzhong Zhu, Hongbo Li, 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
- X-ray illicit object detection using hybrid CNN-transformer neural network architectures(Jorgen Cani, Christos Diou, Spyridon Evangelatos, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos, 2025, ArXiv Preprint)
- Image Augmentation Approaches for Building Dimension Estimation in Street View Images Using Object Detection and Instance Segmentation Based on Deep Learning(Dongjin Hwang, Jae-jun Kim, Sungkon Moon, Seunghyeon Wang, 2025, Applied Sciences)
- Application of AI in Date Fruit Detection - Performance Analysis of YOLO and Faster R-CNN Models(S. Lipiński, Szymon Sadkowski, Paweł Chwietczuk, 2025, Computation)
- Intelligent GD&T symbol detection in mechanical drawings: a comparative study of YOLOv11, Faster R-CNN, and RetinaNet for quality assurance(T. N. Reddy, Nitesh Kumar, Nachappa Pemmanda Ponnappa, N. Mohana, Prakash Vinod, M. Herbert, S. S. Rao, 2025, Journal of Intelligent Manufacturing)
- Resistance Spot Welding Defect Detection Based on Visual Inspection: Improved Faster R-CNN Model(Weijie Liu, Jie Hu, Jin Qi, 2024, Machines)
- Detection of Corals, Seagrass, and Seaweeds Using YOLOv9 Instance Segmentation with Image Augmentation(Ken D. Gorro, 2025, Journal of Image and Graphics)
Faster R-CNN及其改造:医疗/工程/农业识别与时序增强
以Faster R-CNN及其改造为核心:包括在医疗/农业/工程识别中的实践,骨干替换(Swin等)、时序上下文(Bi-LSTM)、以及与数据增强策略结合来改善复杂背景与小目标/噪声场景的识别定位效果。
- Faster R-CNN in Healthcare and Disease Detection: A Comprehensive Review(Jiawei Tian, Seungho Lee, Kyungtae Kang, 2025, 2025 International Conference on Electronics, Information, and Communication (ICEIC))
- Automatic Extraction of Discolored Tree Crowns Based on an Improved Faster-RCNN Algorithm(Haoyang Ma, Banghui Yang, Ruirui Wang, Qiang Yu, Yaoyao Yang, Jiahao Wei, 2025, Forests)
- Optimized Faster R-CNN with Swintransformer for Robust Multi-Class Wildfire Detection(Sugi Choi, Sunghwan Kim, Haiyoung Jung, 2025, Fire)
- Ensemble of Fast R-CNN with Bi-LSTM for Object Detection(Sasirekha R, Surya V, N. P, Preethy Jemima P, Bhanushree T, Hanitha G, 2025, 2025 6th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI))
- Hybrid CNN Architecture for Hot Spot Detection in Photovoltaic Panels Using Fast R-CNN and GoogleNet(Carlos Quiterio Gómez Muñoz, F. Márquez, J. Sanjuán, 2025, Computer Modeling in Engineering & Sciences)
- Intelligent GD&T symbol detection in mechanical drawings: a comparative study of YOLOv11, Faster R-CNN, and RetinaNet for quality assurance(T. N. Reddy, Nitesh Kumar, Nachappa Pemmanda Ponnappa, N. Mohana, Prakash Vinod, M. Herbert, S. S. Rao, 2025, Journal of Intelligent Manufacturing)
- Image Augmentation Approaches for Building Dimension Estimation in Street View Images Using Object Detection and Instance Segmentation Based on Deep Learning(Dongjin Hwang, Jae-jun Kim, Sungkon Moon, Seunghyeon Wang, 2025, Applied Sciences)
- Application of AI in Date Fruit Detection - Performance Analysis of YOLO and Faster R-CNN Models(S. Lipiński, Szymon Sadkowski, Paweł Chwietczuk, 2025, Computation)
- Drone Image Localization by Faster R-CNN Algorithm and Detection Accuracy(Maysoon . Khazaal . Maaroof, M. Bouhlel, 2025, Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications)
- DAM-Faster RCNN: few-shot defect detection method for wood based on dual attention mechanism(Xingyu Tong, Zhihong Liang, Mingming Qin, Fangrong Liu, Jiayu Yang, Hengjiang Xiao, Wei Dai, 2025, Scientific Reports)
- Object Detection and Localization in Real-Time Using Image Processing and Deep Learning(Gaurav Bhakuni, Srikanth Srinivas, Sathish Rao, Ganesh Kumar Ayyalusamy, Saideep Nakka, Sandeep Kumar, 2025, 2025 International Conference on Engineering, Technology & Management (ICETM))
多模态/语言/VLM与输入增强:超分辨率、扩散合成与跨模态融合
以多模态与语言/合成输入增强为主:使用超分辨率与可控扩散合成增强低质量/遥感场景;通过投票或CNN融合实现多源信息互补,并引入VLM/LVLM实现语言条件推理与跨模态目标理解,从而提升类别识别与定位泛化。
- Adaptive Object Detection with ESRGAN-Enhanced Resolution & Faster R-CNN(Divya Swetha K, Ziaul Haque Choudhury, Hemanta Kumar Bhuyan, Biswajit Brahma, Nilayam Kumar Kamila, 2025, ArXiv Preprint)
- AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation(Datao Tang, Xiangyong Cao, Xuan Wu, Jialin Li, Jing Yao, Xueru Bai, Deyu Meng, 2024, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Detection of Corals, Seagrass, and Seaweeds Using YOLOv9 Instance Segmentation with Image Augmentation(Ken D. Gorro, 2025, Journal of Image and Graphics)
- Multimodal fusion via voting network for 3D object detection in indoors(Jianxin Li, Guannan Si, Xinyu Liang, Zhaoliang An, Pengxin Tian, Fengyu Zhou, Xiaoliang Wang, 2025, Pattern Recognition)
- MDF: Multi-Modal Data Fusion with CNN-Based Object Detection for Enhanced Indoor Localization Using LiDAR-SLAM(Saqi Hussain Kalan, Boon Giin Lee, Wan-Young Chung, 2025, ArXiv Preprint)
- Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation(Yongchao Feng, Yajie Liu, Shuai Yang, Wenrui Cai, Jinqing Zhang, Qiqi Zhan, Ziyue Huang, Hongxi Yan, Qiao Wan, Chenguang Liu, Junzhe Wang, Jiahui Lv, Ziqi Liu, Tengyuan Shi, Qingjie Liu, Yunhong Wang, 2025, ArXiv Preprint)
- Object Detection with Multimodal Large Vision-Language Models: An In-depth Review(Ranjan Sapkota, Manoj Karkee, 2025, ArXiv Preprint)
- MDF: Multi-Modal Data Fusion with CNN-Based Object Detection for Enhanced Indoor Localization Using LiDAR-SLAM(Saqi Hussain Kalan, Boon Giin Lee, Wan-Young Chung, 2025, ArXiv Preprint)
实例级任务:检测-分割联合与多边形/统一实例表征
专注实例级表征:将检测扩展为实例分割(统一架构、mask head、金字塔特征等),并探索用多边形顶点/统一实例表征替代像素级mask以降低计算负担,同时覆盖分割与检测的协同定位。
- Towards Instance Segmentation with Polygon Detection Transformers(Jiacheng Sun, Jiaqi Lin, Wenlong Hu, Haoyang Li, Xinghong Zhou, Chenghai Mao, Yan Peng, Xiaomao Li, 2026, ArXiv Preprint)
- D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment(Argo Saakyan, Dmitry Solntsev, 2026, ArXiv Preprint)
- A Unified CNN-Based Instance Segmentation Architecture for Blood Cell Classification and Early Cancer Abnormality Recognition(Nathaniel H. Dumayas, Rhomwell Ace C. Merced, Kenniniah A. Rit, Gabriel Marc B. Verzosa, Lysa V. Comia, 2026, 2026 7th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI))
- An Embedded Feature Pyramid Network Enables Bidirectional Information Flow for Object Detection and Instance Segmentation(Chunning Meng, Zequn Sun, T. Li, Lianzhi Huo, Shengjiang Chang, Zhiqing Zhang, 2024, Neurocomputing)
- Detection of Corals, Seagrass, and Seaweeds Using YOLOv9 Instance Segmentation with Image Augmentation(Ken D. Gorro, 2025, Journal of Image and Graphics)
定位输出形式扩展:分类-回归、单目3D检测与关键点定位
将定位输出形式从“类别+框”扩展到更广的回归/3D/关键点表征:包括分类-回归的连续属性预测、单目3D目标检测与3D关键点(点位本地化)定位,体现任务建模层面的扩展。
- FastCAR: Fast Classification And Regression for Task Consolidation in Multi-Task Learning to Model a Continuous Property Variable of Detected Object Class(Anoop Kini, Andreas Jansche, Timo Bernthaler, Gerhard Schneider, 2025, ArXiv Preprint)
- Dp-M3D: Monocular 3D object detection algorithm with depth perception capability(Peicheng Shi, Xinlong Dong, Runshuai Ge, Zhiqiang Liu, Aixi Yang, 2025, Knowledge-Based Systems)
- PAL-Net: A Point-Wise CNN with Patch-Attention for 3D Facial Landmark Localization(Ali Shadman Yazdi, Annalisa Cappella, Benedetta Baldini, Riccardo Solazzo, Gianluca Tartaglia, Chiarella Sforza, Giuseppe Baselli, 2025, ArXiv Preprint)
(单篇补充)实时轻量化目标检测代表(GMS-YOLO)
该文献在初始化分组中属于实时轻量化落地方向,但在合并后同时覆盖了实时/端到端与轻量化部署的主线。为避免跨组交叉,保留其作为实时轻量化代表以支撑该方向证据链。
- GMS-YOLO: A Lightweight Real-Time Object Detection Algorithm for Pedestrians and Vehicles Under Foggy Conditions(Yafei Chen, Yong Wang, Zhengming Zou, Wenxiu Dan, 2025, IEEE Internet of Things Journal)
合并后的统一分组将“分类与识别定位方法”按关键研究主线并列拆分为:方法体系综述/基准;开放词汇视频实例分割;3D几何与编码约束;旋转目标定位表征;小目标与尺度自适应;几何建模与边界/框回归增强;实时与可部署的端到端多任务/实例分割与流式延迟;资源受限下的能效/压缩/联邦学习;少样本、域泛化与持续学习适配;数据合成与提示驱动;表征层的时空跨域增强;生成式检测范式;领域应用定制;Faster R-CNN改造;多模态/语言输入增强;实例级表征(检测-分割联合与多边形);以及定位输出形式扩展到分类-回归、单目3D与关键点定位。整体上反映该领域从传统检测器迭代,逐步走向更强几何建模、尺度/上下文增强、端侧实时部署与开放/多模态/生成式范式融合。
总计146篇相关文献
Due to the short time, high labor intensity and high workload of fruit and vegetable harvesting, robotic harvesting instead of manual operations is the future. The accuracy of object detection and location is directly related to the picking efficiency, quality and speed of fruit-harvesting robots. Because of its low recognition accuracy, slow recognition speed and poor localization accuracy, the traditional algorithm cannot meet the requirements of automatic-harvesting robots. The increasingly evolving and powerful deep learning technology can effectively solve the above problems and has been widely used in the last few years. This work systematically summarizes and analyzes about 120 related literatures on the object detection and three-dimensional positioning algorithms of harvesting robots over the last 10 years, and reviews several significant methods. The difficulties and challenges faced by current fruit detection and localization algorithms are proposed from the aspects of the lack of large-scale high-quality datasets, the high complexity of the agricultural environment, etc. In response to the above challenges, corresponding solutions and future development trends are constructively proposed. Future research and technological development should first solve these current challenges using weakly supervised learning, efficient and lightweight model construction, multisensor fusion and so on.
… in instance segmentation rather than bounding box-based object localization. … object detection tasks. With all of these advantages, it became clear that DL based object recognition …
Common underwater target recognition systems suffer from low accuracy, high energy consumption, and low levels of automation. This paper introduces an underwater target recognition system based on the Jetson Xavier NX platform, which deploys an improved YOLOv11 recognition algorithm. During operation, the Jetson Xavier NX invokes an industrial camera to capture underwater target images, which are then processed by the improved YOLOv11 network for inference. The recognized information is transmitted via a serial port to an STM32 control board, which adaptively adjusts the lighting system to enhance image clarity based on the target information. Finally, the system controls an actuator to release a buoyant ball with positioning capabilities and communicates with the shore. On the ROUD dataset, the improved YOLOv11 algorithm achieves an accuracy of 87.5%, with a parameter size of 2.58M and a floating-point operation count of 6.3G, outperforming all current models. Compared to the original YOLOv11, the parameter size is reduced by 5% and the floating-point operation count by 0.3G. The improved DD-YOLOv11 also shows good performance on the URPC2020 dataset. After on-site experiments and hardware–software integration tests, all functions operate normally. The system is capable of identifying a specific underwater target with an accuracy rate of over 85%, simultaneously releasing communication buoys and successfully establishing communication with the shore base. This indicates that the underwater target recognition system meets the requirements of being lightweight, high-precision, and highly automated.
Object detection and localization are critical in laptop vision, with applications starting from safety surveillance to self-driving automobiles. Traditional task techniques involve complex and time-consuming methods such as guide function engineering. However, with the emergence of deep getting to know, a branch of artificial intelligence, there has been a giant development in the overall performance of those responsibilities. We advise an actual-time technique for item detection and localization using photograph processing and deep mastering. The gadget uses convolutional neural networks (CNNs) for feature extraction and identifying items inside the scene. The CNN model is skilled on a massive dataset of labeled pix to learn features and their related spatial locations, making it capable of detecting and localizing objects appropriately. The proposed approach is designed to run in real-time, making it appropriate for packages that require short and accurate object detection and localization, which includes self-sustaining motors and surveillance systems. The experimental effects display that our approach achieves high accuracy and robustness in actual-world scenarios. Our method affords a reliable and efficient solution to the hard mission of object detection and localization, with the capacity for diverse, realistic programs.
This study applies the YOLOv11 model to train and detect ground object targets in high-resolution remote sensing images, aiming to evaluate its potential in enhancing detection accuracy and efficiency. The model was trained on 70,389 samples across 20 target categories. After 496 training epochs, the loss functions (Box_Loss, Cls_Loss, and DFL_Loss) demonstrated rapid convergence, indicating effective optimization in target localization, classification, and detail refinement. The evaluation metrics yielded a precision of 0.8861, a recall of 0.8563, a map50 of 0.8920, a map50–95 of 0.8646, and an F1 score of 0.8709, highlighting the model’s high accuracy and robustness in addressing complex detection tasks. Furthermore, 80% of the test samples achieved confidence scores exceeding 85%, confirming the reliability of YOLOv11 in multiclass and multiobject detection scenarios. These findings suggest that YOLOv11 holds significant promise for remote sensing image target detection, demonstrating exceptional detection performance while offering robust technical support for intelligent remote sensing image analysis. Future studies will focus on expanding the dataset, refining the model architecture, and improving its performance in detecting small targets and processing complex scenes, paving the way for its broader applications in environmental protection, urban planning, and multiobject detection.
Context is an important factor in computer vision as it offers valuable information to clarify and analyze visual data. Utilizing the contextual information inherent in an image or a video can improve the precision and effectiveness of object detectors. For example, where recognizing an isolated object might be challenging, context information can improve comprehension of the scene. This study explores the impact of various context-based approaches to object detection. Initially, we investigate the role of context in object detection and survey it from several perspectives. We then review and discuss the most recent context-based object detection approaches and compare them. Finally, we conclude by addressing research questions and identifying gaps for further studies. More than 265 publications are included in this survey, covering different aspects of context in different categories of object detection, including general object detection, video object detection, small object detection, camouflaged object detection, zero-shot, one-shot, and few-shot object detection. This literature review presents a comprehensive overview of the latest advancements in context-based object detection, providing valuable contributions such as a thorough understanding of contextual information and effective methods for integrating various context types into object detection, thus benefiting researchers.
… In this paper, we develop a real-time implementation for unsupervised object detection in … for object detection using information supplied by optical flow to detect moving objects. Besides…
… detection head module, which improves the detection accuracy by optimizing a collaboration between classification and localization … method effectively detects hidden objects in realistic …
… objects, particularly small and complex ones. Existing frameworks, such as RT-DETR, struggle to accurately detect small objects … propose an enhanced object recognition model based …
… In agriculture, early detection and localization of plant diseases in time using deep learning … In this work, we apply object detection models to identify and localize various categories of …
In this paper, we show that current approaches using large square kernels or transformer-based global modeling aggregate contextual information uniformly across spatial dimensions, leading to feature dilution and localization errors for elongated targets. To mitigate this issue, we propose Strip R-CNN, the first work to systematically explore large strip convolutions for remote sensing object detection. Our key insight is that strip convolutions enable directional feature aggregation along the dominant spatial dimension of slender objects, reducing background interference while preserving essential geometric information. We design two core components: (i) StripNet, a backbone network employing sequential orthogonal large strip convolutions to capture anisotropic spatial patterns, and (ii) Strip Head, which enhances localization precision by incorporating strip convolutions into the detection head. Unlike previous large-kernel approaches that suffer from computational redundancy and isotropic limitations, our method achieves superior performance with remarkable efficiency. Extensive experiments on multiple benchmarks (DOTA, FAIR1M, HRSC2016, and DIOR) demonstrate significant improvements, with our 30M parameter model achieving 82.75% mAP on DOTA-v1.0, establishing a new state-of-the-art record while providing new insights into anisotropic feature learning for remote sensing applications.
Fine-grained object detection (FGOD) in remote sensing images is an emerging and challenging task in the field of image intelligent interpretation. It aims to localize objects while classifying them into different fine-grained categories. Modern FGOD methods are mainly derived from well-developed detectors and have made compelling progress. Despite this, these methods struggle to perform well in classifying objects at the subordinate level due to the limitations of their representation manners. In this paper, we propose a network capable of learning discriminative representation (DR) for fine-grained object detection in remote sensing images, named DRNet. First, a fine-grained branch that works in parallel with other task branches is introduced, where objects’ features are re-encoded with dual refinement to generate discriminative representation, enabling accurate fine-grained classification. Second, we design a confusion-minimized loss that automatically scales loss contributions according to the separability of samples to train the fine-grained branch, further boosting discriminative ability of the representation and better addressing hard-to-distinguish objects. Moreover, we devise an interaction verification strategy that empowers the network to fully utilize the results of fine-grained classification and coarse classification for achieving robust inference. On large-scale FAIR1M-1.0 and FAIR1M-2.0 datasets, our DRNet with ResNet50 and $1\times $ training schedule obtains 40.87% mAP and 47.04% mAP, respectively, establishing new state-of-the-arts for fine-grained object detection in remote sensing images. The source code is available at https://github.com//54wb//DRNet.
… In addition, many of these detection tasks … detection method should be lightweight. This paper presents a lightweight oriented object detection method for small objects in remote sensing …
… ’s ability to accurately detect, classify, and locate objects at different orientations. We … the Object Detection in Aerial Images (DOTA) dataset, employing three metrics: angle classification …
… and well-known article recognition frameworks like CNN, R-CNN… We then utilise our recommended object recognition model … localization of objects in an image is the challenge of …
The early diagnosis and accurate classification of lung cancer have a critical impact on clinical treatment and patient survival. The rise of artificial intelligence technology has led to breakthroughs in medical image analysis. The Lung-PET-CT-Dx public dataset was used for the model training and evaluation. The performance of the You Only Look Once (YOLO) series of models in the lung CT image object detection task is compared in terms of algorithms, and different versions of YOLOv5, YOLOv8, YOLOv9, YOLOv10, and YOLOv11 are examined for lung cancer detection and classification. The experimental results indicate that the prediction results of YOLOv8 are better than those of the other YOLO versions, with a precision rate of 90.32% and a recall rate of 84.91%, which proves that the model can effectively assist physicians in lung cancer diagnosis and improve the accuracy of disease localization and identification.
… to traditional 2D detection. Numerous studies have successfully achieved accurate localization, size estimation, orientation estimation, and classification of objects in diverse scenarios. …
Small-object detection in images, a core task in unstructured big-data analysis, remains challenging due to low resolution, background noise, and occlusion. Despite advancements in object detection models like You Only Look Once (YOLO) v8 and EfficientDet, small object detection still faces limitations. This study proposes an enhanced approach combining the content-aware reassembly of features (CARAFE) upsampling module and a confidence-based re-detection (CR) technique integrated with the YOLOv8n model to address these challenges. The CARAFE module is applied to the neck architecture of YOLOv8n to minimize information loss and enhance feature restoration by adaptively generating upsampling kernels based on the input feature map. Furthermore, the CR process involves cropping bounding boxes of small objects with low confidence scores from the original image and re-detecting them using the YOLOv8n-CARAFE model to improve detection performance. Experimental results demonstrate that the proposed approach significantly outperforms the baseline YOLOv8n model in detecting small objects. These findings highlight the effectiveness of combining advanced upsampling and post-processing techniques for improved small object detection. The proposed method holds promise for practical applications, including surveillance systems, autonomous driving, and medical image analysis.
… detection ability of truncated objects at the edges of an image… offset of the 3D bounding box, aligning features more … with accurate depth predictions but low classification confidence are …
This article explores the application of YOLOv11 algorithm in gesture recognition field to evaluate its performance in human-computer interaction (HCI). By introducing the YOLOv8 model and corresponding dataset for training, we obtained the confusion matrix predicted by the model, which shows that the model can accurately recognize most gestures, although there are a few cases of misidentification. When the IoU threshold is 0.5, the average accuracy (mAP) of the model steadily improves with the progress of training, indicating that the overall performance of the model in gesture detection tasks has been enhanced. In addition, even under stricter evaluation conditions where the IoU threshold was increased from 0.5 to 0.95, mAP still showed an upward trend, although the growth rate was not as significant as mAP50, which still demonstrated the improvement in model performance. Through the detection of the test set images, we found that the YOLOv11 model can effectively recognize gestures and accurately interpret their meanings, demonstrating high accuracy. This study not only demonstrates the potential of YOLOv11 in gesture recognition tasks, but also provides a new technological path for the future development of HCI field. Overall, the YOLOv11 algorithm has demonstrated strong performance and accuracy in gesture recognition, providing a more natural and intuitive way for interaction between smart devices and humans.
In this paper, the Efficient Channel Attention (ECA) mechanism is incorporated at the terminal layer of the YOLOv10 backbone network to enhance the feature expression capability. In addition, Transformer is introduced into the C3 module in the feature extraction process to construct the C3TR module to replace the original C2F module as the deepening network extraction module. In this study, both the ECA mechanism and the self-attention mechanism of Transformer are thoroughly analyzed and integrated into YOLOv10. The C3TR module is used as an important part to deepen the effect of network extraction in backbone network feature extraction. The self-attention mechanism is used to model the long-distance dependency relationship, capture the global contextual information, make up for the limitation of the local sensory field, and enhance the feature expression capability. The ECA module is added to the end of the backbone to globally model the channels of the feature map, distribute channel weights more equitably, and enhance feature expression capability. Extensive experiments on the electrical equipment dataset have demonstrated the high accuracy of the method, with a mAP of 89.4% compared to the original model, representing an improvement of 3.2%. Additionally, the mAP@[0.5, 0.95] reaches 61.8%, which is 5.2% higher than that of the original model.
… are commonly used for object detection. The first is called two-stage object detection [10]. In … initially identifies the object region and subsequently performs bounding box regression. …
Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7%, 4.3%, and 2.43%, respectively. The code is available at here.
Bounding Box Regression (BBR) plays a critical role in object detection by refining the predicted location and size of objects to enhance model accuracy. This process involves adjusting the coordinates of the proposed bounding boxes to enhance their precision. The Intersection over Union (IoU) loss metric was introduced to improve the IoU metric for integration into the model training process, measure discrepancies between the model’s predictions and ground truth, and ensures meaningful gradient updates during training. In practice, IoU loss has demonstrated improvements in object detection performance by enhancing the localization accuracy of bounding boxes. Despite significant technological advancements and the various advantages and disadvantages of IoU loss, improving the accuracy and efficiency of BBR remains an active research area in computer vision. Various IoU loss variations have evolved with new formulations and methods to improve accuracy and convergence speed. A new loss function, Dimensional Angle Precision IoU (DAPIoU) loss, is introduced in this research to enhance BBR and serve as a new object detection loss function to address the limitations in previous loss function research results. This study conducts three types of experiments: single-group BBR simulation experiment on synthetic data, simulation experiment on synthetic data, and experiment on real-world datasets. The datasets used are MS-COCO and PASCAL VOC datasets. The object detection models used are YOLOv7, YOLOv9, and Faster R-CNN. The results from the real-world datasets experiments are evaluated using the mean Average Precision (mAP) method, including object size metrics, comparing several previous loss functions based on IoU.
… diverse images with annotated bounding boxes and class … suitable classification methodology is used for object detection. … camera settings, and image processing. Blurred edges mean …
Oriented object detection is a fundamental yet challenging task in remote sensing (RS), aiming to locate and classify objects with arbitrary orientations. Recent advancements in deep learning have significantly enhanced the capabilities of oriented object detection methods. Given the rapid development of this field, a comprehensive survey of the recent advances in oriented object detection is presented in this paper. Specifically, we begin by tracing the technical evolution from horizontal object detection to oriented object detection and highlighting the specific related challenges, including feature misalignment, spatial misalignment, oriented bounding box (OBB) regression problems, and common issues encountered in RS. Subsequently, we further categorize the existing methods into detection frameworks, OBB regression techniques, feature representation approaches, and solutions to common issues and provide an in-depth discussion of how these methods address the above challenges. In addition, we cover several publicly available datasets and evaluation protocols. Furthermore, we provide a comprehensive comparison and analysis involving the state-of-the-art methods. Toward the end of this paper, we identify several future directions for oriented object detection research.
… and bounding box clustering, based on a novel gated learnable image signal processing … This is achieved by learning the parameters of a series of image processing units that operate …
Weakly supervised object detection (WSOD) using image-level labels has gained attention in the computer vision community. Most advanced WSOD approaches generate instance-…
… This paper develops a boundary-guided Siamese multitask network, … detection (SCD) from high-resolution remote sensing images. The objective of BGSNet is to utilize robust boundary …
Railway object intrusion poses a significant threat to railway safety, so it is vital to monitor the obstacles within the track area in real-time to prevent accidents, which can be achieved by vision-based technologies. However, most existing vision-based railway obstacle detection algorithms are limited to stationary cameras, which restricts the monitoring range. In contrast, onboard approaches enable full-line monitoring but face more challenges in achieving high accuracy and real-time performance. To address these challenges, we developed an onboard end-to-end multitask perception model (MTP-Rail), which includes variants of different scales (S, M, L) to realize railway obstacle detection in real-time with high accuracy. Our model consists of two decoders, which can simultaneously implement the tasks of object of interest detection and track segmentation. In addition, we designed a post-processing scheme to analyze the relationship between the detected object and the track and assess the obstacle risk level. Experiments on our dataset show that one of the proposed model MTP-rail-M achieved an 89.6% classification accuracy of obstacle risk level and 54.3% on mAP, 78.6% on mPA and 64.9% on mIoU with an inference speed of 181 FPS at an input size of $640\times 640$ on RTX 3060Ti. Our model is easy to install and deploy, highlighting its potential in engineering applications. The code is available on https://github.com/ccl-1/obstacle_detection.
… multi-task recognition of concrete cracks. The framework integrates the You Only Look Once (YOLO) object detection … distinct branches for crack classification, localization detection, and …
… This study presents a multi-task learning framework based on the YOLOv11 architecture to improve both fracture detection and localization. The goal is to provide an efficient solution for …
While performing a colonoscopy, there are many tasks to be done: finding polyps, classifying them, and deciding the next procedure for the polyps, whether to incise them or not. Such tasks are challenging for fellow doctors. All these three tasks can have an intrapersonal error, which varies among endoscopists. A proven method for enhancing performance is computer-aided detection and a diagnosis system for endoscopists, which tends to be a real-time system. In this work, we present a modified convolutional neural network (CNN) based deep learning (DL) model to perform these tasks in real-time, utilizing existing object detection models: YOLOv5 and YOLOv8. For the various tasks, the models are trained using datasets with incomplete labels, leading to a comparison of different training strategies. Our model, YOLOv8, achieved an F1-score of 95.96% for the polyp detection task, 85.24% F1-score for the polyp classification task, and 78.41% macro F1-score for the polyp size estimation task. Such results, when compared with fellow doctors’ findings proved superior in both accuracy and macro F1-score, maintaining a real-time inference speed.
Multi-task perception technology for autonomous driving significantly improves the ability of autonomous vehicles to understand complex traffic environments by integrating multiple perception tasks, such as traffic object detection, drivable area segmentation, and lane detection. The collaborative processing of these tasks not only improves the overall performance of the perception system but also enhances the robustness and real-time performance of the system. In this paper, we review the research progress in the field of vision-based multi-task perception for autonomous driving and introduce the methods of traffic object detection, drivable area segmentation, and lane detection in detail. Moreover, we discuss the definition, role, and classification of multi-task learning. In addition, we analyze the design of classical network architectures and loss functions for multi-task perception, introduce commonly used datasets and evaluation metrics, and discuss the current challenges and development prospects of multi-task perception. By analyzing these contents, this paper aims to provide a comprehensive reference framework for researchers in the field of autonomous driving and encourage more research work on multi-task perception for autonomous driving.
… novel multi-task YOLO algorithm is enhanced by c2f and anchor-free modules. The multi-task YOLO satisfies multi-task … of single-task. Experimental results demonstrate the enhanced …
… detection and type identification in optical remote sensing … In response, this paper proposes a novel anchor-free detection … Through multi-task learning, the auxiliary branches implicitly …
Pedestrian detection plays a crucial role in the driving system of automobiles, directly affecting driving safety and comfort. However, in complex driving environments, pedestrian detection faces many challenges, such as scale changes caused by distance differences between pedestrians and cameras, lighting variations, and background interference. In crowded scenes, pedestrians are often obstructed by other objects, leading to false alarms and missed detections. To address these challenges, this study improved pedestrian detection technology based on deep learning. The main contributions include building a pedestrian dataset suitable for various driving scenarios and expanding nighttime image data according to dataset requirements to enhance diversity. A multi-scale feature fusion method has been proposed to address the issue of missed detections in small-scale pedestrian SSD algorithms. Optimized the input method and adjusted the previous box size to improve the detection accuracy for small-scale pedestrians. Redundant deep features are also removed to improve detection efficiency. A YOLOX Tiny based integrated attention mechanism algorithm has been proposed. By introducing a channel attention module to enhance attention to occluded areas and using an IoU based bounding box regression loss function to improve pedestrian localization accuracy, this algorithm also reduces computational burden. The experimental results show that the algorithm effectively reduces missed detections and false positives while improving detection accuracy, achieving good performance in complex environments.
• The architecture of two and one-stage object detectors was reviewed. • Developing two-stage detectors such as Fast RCNN has improved the detection accuracy. • The evolution of one-stage detectors like YOLO is suitable for real-time work. Object detection has obtained significant attention as a fundamental and challenging task in computer vision in the past two decades. When highlighting the evolution in object detection architecture, clear structural differences can be distinguished between two-stage and one-stage detectors, each of which is significantly shaped by advances in convolutional neural networks (CNNs). Two-stage detectors, including R-CNN and its later developed models, utilize a sequential methodology that initially produces region proposals, subsequently classifying and further refining them. This methodology, as illustrated by models such as Faster R-CNN and Mask R-CNN, incorporates potent feature extraction strategies, such as Feature Pyramid Networks (FPN), thereby improving performance across diverse object metrics such as mean average precision, where the R-CNN design rich a high mean average precision (mAP) of 53.3 % on the PASCAL VOC dataset, which is over 30 % better than older methods. On the flip side, one-stage detectors, represented by the YOLO series, RetinaNet, and SSD, embrace a more integrated architecture that compresses detection tasks into a singular stage, managing to attain notable speed but at the cost of some localization accuracy. Both paradigms are fundamentally rooted in CNN architectures, signifying per- sistent advancements in harmonizing accuracy, speed, and computational efficiency within contemporary object detection systems. This paper reviews the architecture of prominent two-stage object detectors starting with RCNN and its successors and one-stage detectors including the YOLO family. This paper objectives to provide an understanding of the archi- tecture of two-stage and one-stage object detectors and the evolution in their architecture that leads to improved performance in terms of accuracy and speed.
In the field of remote sensing object detection (RSOD), significant challenges remain, including the vast field of view in remote sensing images, the diverse array of target categories, and complex backgrounds. Traditional methods for processing remote sensing images face limitations in this context. While convolutional neural networks (CNNs) can expand the receptive field by utilizing kernels of different sizes, larger kernels increase the number of parameters and introduce noise. Vision Transformers (ViT) achieve global receptive fields through their global attention mechanism. However, their quadratic computational complexity struggles with high-resolution images. Recently, Mamba has gained prominence in image processing. Its unique four-directional scanning mechanism allows focusing on regions of interest from multiple angles while maintaining linear model complexity and achieving global receptive fields. In this work, we propose a new CNN–Mamba network (CMNet) that synergistically exploits the advantages of both architectures. Specifically, we employ VMamba(VM) to extract global semantic features from images. Moreover, we design a multi-scale local feature extraction (MLFE) module, which captures local texture information and edge details through the local feature extraction (LFE) and the global attention module (GAM). The synergy between VMamba and MLFE creates complementary global–local features. To address the representational differences between these two kinds of features, we further design a feature cross-complementary (FCC) module. This module achieves cross-complementarity of features, solving feature disparity issues. Our CMNet achieves 79.38% mAP50 on the DOTA v1.0 dataset and 90.60% mAP50 on the HRSC dataset, outperforming existing state-of-the-art approaches.
… with explicit boundaries by information filtering, global localization, and progressive … a global localization module (GLM) to determine positions of potential camouflaged objects by …
ABSTRACT Iron Deficiency Anaemia (IDA) is the most prevalent form of anaemia, affecting 24.8% of the global population. An examination of the complete blood count (CBC) is performed to determine general health and the presence of illnesses. Accurate and timely diagnosis of IDA is essential for proper treatment, yet traditional methods can be time-consuming and costly. This study uses machine learning and computer vision techniques for the automatic identification of hypochromic microcytes from Peripheral Blood Smear (PBS) images to improve IDA diagnosis. Two approaches were implemented: first, a ResNet50 model was used to classify PBS images as Normal or IDA; second, the YOLOv7 object detection model was employed to localize hypochromic microcytes within the images. The YOLOv7 model was tested on 17 images containing 425 instances of hypochromic microcytes and demonstrated superior performance, achieving a test mean Average Precision (mAP) of 89% with faster inference times than ResNet50. By providing localized detection of hypochromic microcytes, YOLOv7 enhances diagnostic accuracy and speed compared to image-level classification. This study highlights the potential of object detection models for improving automated anaemia diagnosis, with implications for faster and more cost-effective healthcare solutions.
… object detection? Therefore, we propose a novel hybrid metric and classification-localization … We compare ten advanced approaches with our method, including Faster R-CNN w/ADAS-…
… object detection (OD) has become essential in computer vision for identifying and localizing objects … The findings of this work indicate that the HT-YOLOv4- and PSO-based CNN models …
The impressive advancements in semi-supervised learning have driven researchers to explore its potential in object detection tasks within the field of computer vision. Semi-Supervised Object Detection (SSOD) leverages a combination of a small labeled dataset and a larger, unlabeled dataset. This approach effectively reduces the dependence on large labeled datasets, which are often expensive and time-consuming to obtain. Initially, SSOD models encountered challenges in effectively leveraging unlabeled data and managing noise in generated pseudo-labels for unlabeled data. However, numerous recent advancements have addressed these issues, resulting in substantial improvements in SSOD performance. This paper presents a comprehensive review of 28 cutting-edge developments in SSOD methodologies, from Convolutional Neural Networks (CNNs) to Transformers. We delve into the core components of semi-supervised learning and its integration into object detection frameworks, covering data augmentation techniques, pseudo-labeling strategies, consistency regularization, and adversarial training methods. Furthermore, we conduct a comparative analysis of various SSOD models, evaluating their performance and architectural differences. We aim to ignite further research interest in overcoming existing challenges and exploring new directions in semi-supervised learning for object detection.
Since unmanned aerial vehicles (UAVs) provide real-time monitoring of vast areas, their rapid development has been crucial to the advancement of surveillance applications. However, in the face of complex environments, present surveillance systems frequently suffer from an initial lack of efficiency, scalability, and adaptability. In order to detect and track any security threats in real time, this study aims to create a unique AI-based aerial surveillance framework that makes use of CNNs and Fast R-CNNs. It trains and validates object identification models using publicly accessible UAV datasets in relation to important parameters like robustness, processing speed, and accuracy. The suggested framework for object detection using augmented intelligence thus applies to contemporary surveillance systems, which are designed to be reliable, resilient, and able to effectively satisfy contemporary security requirements. This study presents a brand-new, incredibly effective Faster R-CNN created especially to tackle the difficult object placement issue in aerial photos. For pinpointing the precise location of things of interest, the algorithm works incredibly well. The average accuracy has increased significantly to above 70%, according to the results. With an F1-score of 92.7%, the Fast R-CNN model achieved precision and recall scores of 93.1% and 92.4%, respectively, while still performing within the average of 94.7%.
This review examines the applications, challenges, and prospects of Faster Region-based Convolutional Neural Networks (Faster R-CNN) in healthcare and disease detection. Through a meta-analysis of Web of Science literature from 2017 to 2024, we provide insights into the evolving landscape of Faster R-CNN in medical contexts. The algorithm can be applied to medical image analysis across various modalities, including radiography, computed tomography, magnetic resonance imaging, ultrasound, microscopy, and endoscopy. Its applications extend to optical RGB images for dermatological and surgical uses, as well as broader healthcare areas such as posture detection, medication recognition, and assistive technologies. Despite its success, Faster R-CNN faces challenges in handling subtle abnormalities, addressing class imbalance in medical datasets, ensuring result interpretability, and managing patient privacy. Our analysis reveals dominant application areas, technological advancements, and integration trends with other artificial intelligence technologies. The review highlights Faster R-CNN's significant impact on improving diagnostic accuracy and healthcare delivery while acknowledging the need for continued research to address limitations. Emerging directions include real-time disease detection and advancements in personalized medicine, underscoring Faster R-CNN's potential to further transform healthcare practices and patient outcomes.
In the Internet of Audio Things, communication security of the audio control terminal is vulnerable to copy-move threats, and detecting and locating audio copy-move forgery remains challenging nowadays. The forgery detection method based on deep learning achieves higher detection accuracy but fails to localize forged regions. To address this issue, this article proposes an S-Faster R-CNN model for audio copy-move forgery detection and localization. We integrate a novel similarity computation module (SCM) into the Faster R-CNN framework, forming the S-Faster R-CNN model. Obtaining the integration of the SCM, which allows the S-Faster R-CNN to precisely localize forgery regions within the spectrogram. Finally, the image coordinate transformation algorithm is used to map these forged regions to the corresponding locations of the original audio waveform, thus completing the audio copy-move forgery detection and localization. Evaluated on three datasets, our method achieves an average recall of 90%, an average precision of 84%, and an average F1-score of 87%, respectively. Experimental results indicate that the S-Faster R-CNN outperforms state-of-the-art methods in both forgery detection accuracy and especially in localization. Moreover, the proposed method shows good robustness under multiple post-processing.
Object detection is a critical task in computer vision, with applications ranging from autonomous driving to medical imaging. Traditional object detection models, such as Fast R-CNN, have shown remarkable performance by leveraging Convolutional Neural Networks (CNNs) for feature extraction and region proposal generation. However, these models often face challenges in scenarios where contextual understanding or the relationship between multiple objects in an image plays a key role in accurate detection. To address these limitations, we propose an ensemble framework that combines Fast R-CNN with a Bidirectional Long Short-Term Memory (Bi-LSTM) network for improved object detection performance. The proposed ensemble leverages the strengths of both Fast R-CNN and Bi-LSTM in a complementary manner. Experimental results demonstrate that the ensemble of Fast R-CNN with Bi-LSTM outperforms traditional object detection methods on standard benchmark datasets, offering improvements in both detection accuracy and localization precision. This framework is particularly beneficial in real-world applications such as autonomous navigation and surveillance, where contextual understanding and spatial reasoning are essential for reliable object detection.
Welding spot defect detection using deep learning methods provides an effective way of body-in-white quality monitoring. Based on the existing Faster R-CNN model, this paper proposed an improved faster R-CNN model for resistance welding spot surface defect inspection to improve inspection efficiency and accuracy. The model contains the following improvements. Firstly, the improved algorithm uses anchor box with higher confidence output by the RPN network to locate welding spots. When a defect is detected and the detection system is in a suspended state, the Fast R-CNN network is used to confirm the defect category and details. Secondly, a new pruning model is proposed to replace the entire backbone neural network, which unnecessary convolutional layers and connection layers are deleted, and some parameters of each hidden layer are further reduced. On the premise of ensuring detection accuracy, the parameter quantity is extremely reduced, and the speed is improved. Experiments show that the model proposed in this paper took about 15ms for one single image test, and both the detection accuracy and recall rate reached over 90% according to the test on our dataset. This deep learning model meets the requirements of welding spot defect detection.
Traffic congestion and accident-prone zones present significant challenges to urban transportation by causing delays, pollution, and safety hazards. However, the existing techniques do not provide real-time recommendations for minimizing congestion leaving travelers and planners with suboptimal solutions. These systems often fail to integrate accident-prone zone detection with traffic congestion management and cannot provide a real time congestion-free route. This research aims to address these challenges by developing a geographic information system (GIS)-based optimal route recommendation model GINSER using advanced deep learning techniques. The primary objectives are to detect accident-prone zones, classify traffic congestion levels, and recommend efficient routes using GIS. The proposed GINSER model utilized CCTV camera as an input image are preprocessed using an adaptive Gaussian bilateral filter (AGBF) to remove noise and enhance image quality. Faster R-CNN is used for identifying and localizing objects in accident-prone areas. Particle swarm optimization (PSO) is used to hyperparameter tuning for improving an accuracy. A CNN-BiGRU model is utilized to classify traffic congestion levels into low, moderate, high, and congestion-free categories. GIS analyzes spatial data and traffic patterns to recommend the most efficient and congestion-free routes. The effectiveness of the proposed GINSER approach was assessed utilizing F1 score, accuracy, precision, recall, and specificity. The noise-free images using AGBF effectively enhances image quality by reducing noise leading to improved classification accuracy. PSO is utilized for hyperparameter tuning achieving a high accuracy of 95.24%. The GINSER model achieved a classification accuracy of 99.16%. The GINSER improved overall accuracy by 3.90%, 6.71%, 4.13%, and 0.70% better than TSANet, TCEVis, Ising-traffic, and AID, respectively. The proposed GINSER model offers a novel solution to urban transportation challenges by integrating deep learning and GIS technologies. Its ability to detect accident-prone zones classify congestion levels and recommend optimal routes ensures safer and more efficient mobility.
… analysis of YOLOv11, Faster R-CNN, and RetinaNet for the … that the Faster R-CNN model proficiently detected and localized … , Faster R-CNN demonstrated superior localization accuracy …
In wood defect detection, factors such as few-shot sample scarcity, diverse defect types, and complex background interference severely limit the model’s recognition accuracy and generalization ability. To address the above issues, this paper proposes an improved Faster RCNN model based on a dual attention mechanism (DAM). The model integrates cross-attention and spatial attention modules to enhance the expression of key region features, suppresses texture noise interference; the improved Wood-Region Proposal Network (WRPs) module utilizes feature mean pooling and cross-layer fusion strategies to significantly improve the quality and robustness of candidate box generation; in addition, the Wood-Feature Reconstruction Head (WFRH) module effectively enhances the adaptability to new classes and few-shot defects through multi-branch classification and weighted fusion mechanisms. After synergistic optimization of all modules, the model demonstrates superior detection accuracy and category discrimination capability. Experimental results show that the proposed method achieves state-of-the-art performance on the PASCAL VOC and FSOD datasets, particularly in the identification of 17 types of wood defects, where AP50 and AP75 are improved by 25% and 7.9%, respectively, validating the significant advantages of the proposed DAM mechanism under few-shot and complex background conditions. The findings of this study provide practical technical references for intelligent and efficient few-shot detection in real-world wood quality inspection tasks.
Pest infestations remain a critical threat to global agriculture, significantly compromising crop yield and quality. While accurate pest detection forms the foundation of precision pest management, current approaches face two primary challenges: (1) the scarcity of comprehensive multi-scale, multi-category pest datasets and (2) performance limitations in detection models caused by substantial target scale variations and high inter-class morphological similarity. To address these issues, we present three key contributions: First, we introduce Insect25—a novel agricultural pest detection dataset containing 25 distinct pest categories, comprising 18,349 high-resolution images. This dataset specifically addresses scale diversity through multi-resolution acquisition protocols, significantly enriching feature distribution for robust model training. Second, we propose GC-Faster RCNN, an enhanced detection framework integrating a hybrid attention mechanism that synergistically combines channel-wise correlations and spatial dependencies. This dual attention design enables more discriminative feature extraction, which is particularly effective for distinguishing morphologically similar pest species. Third, we implement an optimized training strategy featuring a cosine annealing scheduler with linear warm-up, accelerating model convergence while maintaining training stability. Experiments have shown that compared with the original Faster RCNN model, GC-Faster RCNN has improved the average accuracy mAP0.5 on the Insect25 dataset by 4.5 percentage points, and mAP0.75 by 20.4 percentage points, mAP0.5:0.95 increased by 20.8 percentage points, and the recall rate increased by 16.6 percentage points. In addition, experiments have also shown that the GC-Faster RCNN detection method can reduce interference from multiple scales and high similarity between categories, improving detection performance.
The economic value of a bale of ginned cotton decreases significantly if it contains plastic. One potential source of plastic is trash found in cotton fields. Detecting and removing plastic …
… algorithm employed in R-CNN and fast R-CNN. Although fast R-CNN utilizes 2000 regions and faster R-CNN utilizes 300 … , especially for small object detection and object localization. …
The scientific and technological principles applied in agrotechnology improve the effectiveness, productivity, and sustainability of the agricultural system. Four of the high-performance object detection models, namely You Only Look Once version 8 (YOLOv8), YOLOv12S, Detection Transformers (DETR), and Faster Region-based Convolutional Neural Network (R-CNN), were compared using key performance metrics including mean Average Precision (mAP) and F1-Score to classify rotten and fresh tomatoes and apples. YOLOv12S stands out among the four models on a custom dataset of size 11,675 with a mean Average Precision (mAP) score of 99.31% and F1-Score of 98.28%. YOLOv12S is well-suited for real-time applications as it optimizes the trade-off between computational cost and accuracy. The architectural analysis highlights YOLOv12S's CNN backbone for fast inference versus DETR's transformer-based global context modeling. Unlike past efforts focusing on single models, the comparison spans architectures, revealing their strengths, weaknesses, and practical roles in farming. Future work should address computational demands, hybrid models, and adaptability.
Presented study evaluates and compares two deep learning models, i.e., YOLOv8n and Faster R-CNN, for automated detection of date fruits in natural orchard environments. Both models were trained and tested using a publicly available annotated dataset. YOLO, a single-stage detector, achieved a mAP@0.5 of 0.942 with a training time of approximately 2 h. It demonstrated strong generalization, especially in simpler conditions, and is well-suited for real-time applications due to its speed and lower computational requirements. Faster R-CNN, a two-stage detector using a ResNet-50 backbone, reached comparable accuracy (mAP@0.5 = 0.94) with slightly higher precision and recall. However, its training required significantly more time (approximately 19 h) and resources. Deep learning metrics analysis confirmed both models performed reliably, with YOLO favoring inference speed and Faster R-CNN offering improved robustness under occlusion and variable lighting. Practical recommendations are provided for model selection based on application needs—YOLO for mobile or field robotics and Faster R-CNN for high-accuracy offline tasks. Additional conclusions highlight the benefits of GPU acceleration and high-resolution inputs. The study contributes to the growing body of research on AI deployment in precision agriculture and provides insights into the development of intelligent harvesting and crop monitoring systems.
The precise prevention and control of forest pests and diseases has always been a research hotspot in ecological environmental protection. With the continuous advancement of sensor technology, the fine-grained identification of discolored tree crowns based on UAV technology has become increasingly important in forest monitoring. Existing deep learning models face challenges such as prolonged training time and low recognition accuracy when identifying discolored tree crowns caused by pests or diseases from airborne images. To address these issues, this study improves the Faster-RCNN model by using Inception-ResNet-V2 as the feature extractor, replacing the traditional VGG16 feature extractor, aiming to enhance the accuracy of discolored tree crown recognition. Experiments and analyses were conducted using UAV aerial imagery data from Jilin Changbai Mountain. The improved model effectively identified discolored tree crowns caused by pine wood nematodes, achieving a precision of 90.22%, a mean average precision (mAP) of 83.63%, and a recall rate of 92.33%. Compared to the original RCNN model, the mAP of the improved model increased by 4.68%, precision improved by 10.11%, and recall improved by 5.23%, significantly enhancing the recognition performance of discolored tree crowns. This method provides crucial technical support and scientific basis for the prevention and control of forest pests and diseases, facilitating early detection and precise management of forest pest outbreaks.
Accurate schematic detection in Power Distribution Networks (PDNs) is critical for effective fault detection, asset management, and predictive maintenance. Conventional edge detection methods often struggle with the complexity and scale of modern PDNs, while standalone deep learning approaches face challenges in balancing real-time performance and high precision. To address these limitations, this paper introduces a hybrid detection framework that synergizes YOLOv8 and Fast R-CNN, leveraging their complementary strengths. YOLOv8 enables rapid initial detections with real-time capabilities, while Fast R-CNN refines these outputs to enhance contextual accuracy. A key contribution of this work is the integration of ensemble techniques—Soft Voting, Hard Voting, and Weighted Average Voting— which further optimize detection performance by effectively aggregating predictions from both models. Using a curated dataset of 3,304 schematic images, the proposed method achieves state-of-the-art results, including a precision of 96.59%, recall of 97.27%, and mean Average Precision at an Intersection over Union (IoU) threshold of 0.50 (mAP@50) of 99.06% with the Hard Voting ensemble. These findings underscore the robustness, scalability, and applicability of the proposed framework in automating schematic analysis. Furthermore, the method demonstrates strong potential for practical deployment within PLN Indonesia, particularly in real-time fault detection, technician training, and predictive maintenance, contributing to enhanced reliability and operational efficiency in national power distribution systems.
Wildfires are a critical global threat, emphasizing the need for efficient detection systems capable of identifying fires and distinguishing fire-related from non-fire events in their early stages. This study integrates the swintransformer into the Faster R-CNN backbone to overcome challenges in detecting small flames and smoke and distinguishing complex scenarios like fog/haze and chimney smoke. The proposed model was evaluated using a dataset comprising five classes: flames, smoke, clouds, fog/haze, and chimney smoke. Experimental results demonstrate that swintransformer-based models outperform ResNet-based Faster R-CNN models, achieving a maximum mAP50 of 0.841 with the swintransformer-based model. The model exhibited superior performance in detecting small and dynamic objects while reducing misclassification rates between similar classes, such as smoke and chimney smoke. Precision–recall analysis further validated the model’s robustness across diverse scenarios. However, slightly lower recall for specific classes and a lower FPS compared to ResNet models suggest a need for further optimization for real-time applications. This study highlights the swintransformer’s potential to enhance wildfire detection systems by addressing fire and non-fire events effectively. Future research will focus on optimizing its real-time performance and improving its recall for challenging scenarios, thereby contributing to the development of robust and reliable wildfire detection systems.
Melanoma accounts for only 1% of skin cancer diagnoses yet causes the majority of skin cancer-related deaths due to its rapid progression and high metastatic potential. Early and accurate detection is crucial for improving patient outcomes; however, existing deep learning models often struggle to balance diagnostic precision with real-time efficiency. This study presents the Melano Hybrid Model, a novel architecture that integrates the rapid detection capabilities of YOLOv9 with the boundary localization accuracy of Faster R-CNN through an adaptive feature fusion mechanism. The model was rigorously evaluated on three benchmark datasets—ISIC 2019, HAM10000, and ISIC 2020—using 5-fold cross-validation. On ISIC 2020, the hybrid model achieved a 96.2% classification accuracy (95% CI: 95.8–96.6%) and a 95.1% F1-score (95% CI: 94.7–95.5%), significantly outperforming standalone models ( $p\lt 0.001$ ). The architecture delivers an average inference speed of 31.3 frames per second (FPS), surpassing clinical real-time thresholds. Additionally, computational profiling confirms its practical feasibility with 78.3 million parameters, 134.8 GFLOPs, and a 324 MB memory footprint. These results support the hybrid framework as a robust AI-assisted tool for real-world melanoma screening, offering an optimal trade-off between speed and diagnostic performance.
Hybrid CNN Architecture for Hot Spot Detection in Photovoltaic Panels Using Fast R-CNN and GoogleNet
… Fault classification and localization methods have also … Fast R-CNN mitigates this by processing the full image once and performing detection through RoI mapping and classification…
Eye-gaze writing technology holds significant promise but faces several limitations. Existing eye-gaze-based systems often suffer from slow performance, particularly under challenging conditions such as low-light environments, user fatigue, or excessive head movement and blinking. These factors negatively impact the accuracy and reliability of eye-tracking technology, limiting the user’s ability to control the cursor or make selections. To address these challenges and enhance accessibility, we created a comprehensive dataset by integrating multiple publicly available datasets, including the Eyes Dataset, Dataset-Pupil, Pupil Detection Computer Vision Project, Pupils Computer Vision Project, and MPIIGaze dataset. This combined dataset provides diverse training data for eye images under various conditions, including open and closed eyes and diverse lighting environments. Using this dataset, we evaluated the performance of several computer vision algorithms across three key areas. For object detection, we implemented YOLOv8, SSD, and Faster R-CNN. For image segmentation, we employed DeepLab and U-Net. Finally, for self-supervised learning, we utilized the SimCLR algorithm. Our results indicate that the Haar classifier achieves the highest accuracy (0.85) with a model size of 97.358 KB, while YOLOv8 demonstrates competitive accuracy (0.83) alongside an exceptional processing speed and the smallest model size (6.083 KB), making it particularly suitable for cost-effective real-time eye-gaze applications.
This review marks the tenth anniversary of You Only Look Once (YOLO), one of the most influential frameworks in real-time object detection. Over the past decade, YOLO has evolved from a streamlined detector into a diverse family of architectures characterized by efficient design, modular scalability, and cross-domain adaptability. The paper presents a technical overview of the main versions (from YOLOv1 to YOLOv13), highlights key architectural trends, and surveys the principal application areas in which YOLO has been adopted. It also addresses evaluation practices, ethical considerations, and potential future directions for the framework’s continued development. The analysis aims to provide a comprehensive and critical perspective on YOLO’s trajectory and ongoing transformation.
: The YOLO (You Only Look Once) series, a leading single-stage object detection framework, has gained significant prominence in medical-image analysis due to its real-time efficiency and robust performance. Recent iterations of YOLO have further enhanced its accuracy and reliability in critical clinical tasks such as tumor detection, lesion segmentation, and microscopic image analysis, thereby accelerating the development of clinical decision support systems. This paper systematically reviews advances in YOLO-based medical object detection from 2018 to 2024. It compares YOLO’s performance with other models (e.g., Faster R-CNN, RetinaNet) in medical contexts, summarizes standard evaluation metrics (e.g., mean Average Precision (mAP), sensitivity), and analyzes hardware deployment strategies using public datasets such as LUNA16, BraTS, and CheXpert. The review highlights the impressive performance of YOLO models, particularly from YOLOv5 to YOLOv8, in achieving high precision (up to 99.17%), sensitivity (up to 97.5%), and mAP exceeding 95% in tasks such as lung nodule, breast cancer, and polyp detection. These results demonstrate the significant potential of YOLO models for early disease detection and real-time clinical applications, indicating their ability to enhance clinical workflows. However, the study also identifies key challenges, including high small-object miss rates, limited generalization in low-contrast images, scarcity of annotated data, and model interpretability issues. Finally, the potential future research directions are also proposed to address these challenges and further advance the application of YOLO models in healthcare.
You Only Look Once (YOLO) has established itself as a prominent object detection framework due to its excellent balance between speed and accuracy. This article provides a thorough review of the YOLO series, from YOLOv1 to YOLOv10, including YOLOX, emphasizing their architectural advancements, loss function improvements, and performance enhancements. We have benchmarked the officially released versions from YOLOv3 to YOLOv10 and YOLOX, using widely recognized datasets VOC07+12 and COCO2017, on diverse hardware platforms: NVIDIA GTX Titan X, RTX 3060, and Tesla V100. The benchmark provides significant insights, such as YOLOv9-E achieving the highest mean average precision (mAP) of 76.0% on VOC07+12 and also showing superior detection accuracy on COCO2017 with an mAP of 56.6% which is 1.2% higher than that of the latest YOLOv10-X. YOLOv9-E stands out for its superior detection accuracy making it more suitable for detection that needs high accuracy such as analysis of medical images, while some lightweight versions like YOLOv5-S, YOLOv7-S, YOLOv8-S, and YOLOv10-S offer the great balance of accuracy and speed, making them ideal for real-time applications. Among them, YOLOv7-S has the highest mAP value among these lightweight models. Inference benchmarks highlight lightweight YOLO models such as YOLOv10-S for their exceptional inference speed on all GPUs and results of training time also indicate YOLOv9-E would take the longest time to converge among all versions using both datasets. This study would provide researchers and developers with some strategies in choosing appropriate YOLO models based on accuracy, resource availability, and application-specific needs.
… This research paper introduces an enhanced method for visual Simultaneous Localization … with a lightweight object identification network known as You Only Look Once (YOLO). This …
In the domain of remote sensing imagery, minimizing the missed and false detection of small objects holds significant importance for the technology used in detecting small targets with unmanned aerial vehicles. In small object detection, accurately localizing these objects becomes challenging due to the varying motion amplitudes required in different directions. In addition, capturing the global dependency of features while keeping the network light poses a challenge. To tackle these challenges, we propose a novel object detection framework built upon the YOLO architecture, called αS-YOLO, which simultaneously tackles the problem of precise localization of small targets and maintaining the network's lightweight structure during long-range dependency capture. First, we design a new cross-convolution with 2 filters_ global context and efficient channel attention module, which aggregates global contextual features into the features of each pixel, enhancing the ability to model long-range dependencies while reducing network parameters to keep the network lightweight. Finally, to address the challenge of accurate small object localization, we propose a novel loss function, α-SIOU, which includes an adaptive angular control coefficient. This coefficient adjusts the distance loss in different directions based on the angular variation between the predicted and ground truth frames, adaptively converging the distance with the highest gradient. Experiments show that compared with the most advanced YOLO model, the accuracy of the proposed αS-YOLO method can be achieved in the detection of small targets on vehicles. On the VisDrone-DET2019 dataset, mAP and mAP50 compared to YOLOV8 improved 1% and 1.1%.
Object detection in autonomous driving scenarios represents a significant research direction within artificial intelligence. Real-time and accurate object detection and recognition are crucial in ensuring autonomous vehicles’ safe and stable operation. In recent years, the continuous introduction of the YOLO series of algorithms and their enhanced models has led to remarkable performance in autonomous driving object detection. From YOLOv1 to YOLOv12, detection accuracy has improved significantly, with mAP increasing from approximately 63.4% to over 80% and inference speed exceeding 100 FPS in lightweight versions such as YOLOv8n and YOLOv10. This paper reviews the YOLO algorithm and its application in object detection in autonomous driving scenarios. Firstly, the development and distinctions among the YOLO series of detection algorithms are explained, and their performance is analyzed. Secondly, the strategies for improving YOLO-based models across the input, feature extraction, and prediction stages are summarized. Thirdly, the research status and application of the YOLO algorithm in autonomous driving object detection are elaborated upon from the perspectives of traffic vehicles, pedestrians, traffic signs, traffic lights, and lane lines, with comparisons and analyses of performance metrics such as accuracy and real-time performance. Finally, considering the current challenges in autonomous driving object detection, the development trajectory and prospects of the YOLO algorithm are summarized and discussed.
Automated fabric defect detection is crucial for improving quality control, reducing manual labor, and optimizing efficiency in the textile industry. Traditional inspection methods rely heavily on human oversight, which makes them prone to subjectivity, inefficiency, and inconsistency in high-speed manufacturing environments. This review systematically examines the evolution of the You Only Look Once (YOLO) object detection framework from YOLO-v1 to YOLO-v11, emphasizing architectural advancements such as attention-based feature refinement and Transformer integration and their impact on fabric defect detection. Unlike prior studies focusing on specific YOLO variants, this work comprehensively compares the entire YOLO family, highlighting key innovations and their practical implications. We also discuss the challenges, including dataset limitations, domain generalization, and computational constraints, proposing future solutions such as synthetic data generation, federated learning, and edge AI deployment. By bridging the gap between academic advancements and industrial applications, this review is a practical guide for selecting and optimizing YOLO models for fabric inspection, paving the way for intelligent quality control systems.
… LSOD-YOLO, a lightweight small object detection algorithm … ) module enhances small object detection by strengthening the … dataset demonstrate that LSOD-YOLO achieves a 3.2 % …
Driven by the rapid development of deep learning technology, the YOLO series has set a new benchmark for real-time object detectors. Additionally, transformer-based structures have emerged as the most powerful solution in the field, greatly extending the model's receptive field and achieving significant performance improvements. However, this improvement comes at a cost, as the quadratic complexity of the self-attentive mechanism increases the computational burden of the model. To address this problem, we introduce a simple yet effective baseline approach called Mamba YOLO. Our contributions are as follows: 1) We propose that the ODMamba backbone introduce a State Space Model (SSM) with linear complexity to address the quadratic complexity of self-attention. Unlike the other Transformer-base and SSM-base method, ODMamba is simple to train without pretraining. 2) For real-time requirement, we designed the macro structure of ODMamba, determined the optimal stage ratio and scaling size. 3) We design the RG Block that employs a multi-branch structure to model the channel dimensions, which addresses the possible limitations of SSM in sequence modeling, such as insufficient receptive fields and weak image localization. This design captures localized image dependencies more accurately and significantly. Extensive experiments on the publicly available COCO benchmark dataset show that Mamba YOLO achieves state-of-the-art performance compared to previous methods. Specifically, a tiny version of Mamba YOLO achieves a 7.5% improvement in mAP on a single 4090 GPU with an inference time of 1.5 ms.
… Another approach is to use object detection algorithm to count. Traditional methods rely on … In this paper, object detection as a preliminary method was applied to count wheat heads in …
In computer vision applications, the primary task of object detection is to answer the following question: “What object is present and where is it located?”. However, underwater environments introduce challenges, such as poor lighting, high complexity, and diverse marine organism shapes, leading to missed detections or false positives in deep learning-based algorithms. To improve detection accuracy and robustness, this paper proposes an enhanced YOLOv11-based algorithm for underwater object detection that strengthens the ability to capture both local and global details and global contextual information in complex underwater environments. To better capture local and global features while integrating contextual information, the proposed method introduces several enhancements. The backbone incorporates a DualBottleneck module to enhance feature extraction, replacing the standard bottleneck structure in C3k, thus enhancing the feature extraction and the channel aggregation. The detection head adopts DyHead-GDC, integrating ghost depthwise separable convolution with DyHead for greater efficiency. Furthermore, the ADown module replaces conventional feature extraction and downsampling convolutions, reducing parameters and FLOPs by 14%. The C2PSF module, combining focal modulation and C2, strengthens local feature extraction and global context processing. Additionally, a SCSA module is inserted before the detection head to fully utilize multi-semantic information, improving the detection performance in complex underwater scenes. Experimental results confirm the effectiveness of these improvements. The model achieves 84.2% mAP50 on UTDAC2020, 84.4% on DUO and 86.7% on RUOD, surpassing the baseline by 2.5%, 1.6% and 1.2%, respectively. It remains lightweight, with 6.5 M parameters and a computational cost of 7.1 GFLOPs.
This paper provides a comprehensive study of the security of YOLO (You Only Look Once) model series for object detection, emphasizing their evolution, technical innovations, and performance across the COCO dataset. The robustness of YOLO models under adversarial attacks and image corruption, offering insights into their resilience and adaptability, is analyzed in depth. As real-time object detection plays an increasingly vital role in applications such as autonomous driving, security, and surveillance, this review aims to clarify the strengths and limitations of each YOLO iteration, serving as a valuable resource for researchers and practitioners aiming to optimize model selection and deployment in dynamic, real-world environments. The results reveal that YOLOX models, particularly their large variants, exhibit superior robustness compared to other YOLO versions, maintaining higher accuracy under challenging conditions. Our findings serve as a valuable resource for researchers and practitioners aiming to optimize YOLO models for dynamic and adversarial real-world environments while guiding future research toward developing more resilient object detection systems.
… object detection model called Att-YOLO, to automate rock core classification and localization … The performance of Att-YOLO model is evaluated on a test set, achieving the highest mean …
In Unmanned Aerial Vehicle (UAV) target detection tasks, issues such as missing and erroneous detections frequently occur owing to the small size of the targets and the complexity of the image background. To improve these issues, an improved target detection algorithm named RLRD-YOLO, based on You Only Look Once version 8 (YOLOv8), is proposed. First, the backbone network initially integrates the Receptive Field Attention Convolution (RFCBAMConv) Module, which combines the Convolutional Block Attention Module (CBAM) and Receptive Field Attention Convolution (RFAConv). This integration improves the issue of shared attention weights in receptive field features. It also combines attention mechanisms across both channel and spatial dimensions, enhancing the capability of feature extraction. Subsequently, Large-Scale Kernel Attention (LSKA) is integrated to further optimize the Spatial Pyramid Pooling Fast (SPPF) layer. This enhancement employs a large-scale convolutional kernel to improve the capture of intricate small target features and minimize background interference. To enhance feature fusion and effectively integrate low-level details with high-level semantic information, the Reparameterized Generalized Feature Pyramid Network (RepGFPN) replaces the original architecture in the neck network. Additionally, a small-target detection layer is added to enhance the model’s ability to perceive small targets. Finally, the detecting head is replaced with the Dynamic Head, designed to improve the localization accuracy of small targets in complex scenarios by optimizing for Scale Awareness, Spatial Awareness, and Task Awareness. The experimental results showed that RLRD-YOLO outperformed YOLOv8 on the VisDrone2019 dataset, achieving improvements of 12.2% in mAP@0.5 and 8.4% in mAP@0.5:0.95. It also surpassed other widely used object detection methods. Furthermore, experimental results on the HIT-HAV dataset demonstrate that RLRD-YOLO sustains excellent precision in infrared UAV imagery, validating its generalizability across diverse scenarios. Finally, RLRD-YOLO was deployed and validated on the typical airborne platform, Jetson Nano, providing reliable technical support for the improvement of detection algorithms in aerial scenarios and their practical applications.
Small object detection in UAV aerial images is challenging due to low contrast, complex backgrounds, and limited computational resources. Traditional methods struggle with high miss detection rates and poor localization accuracy caused by information loss, weak cross-layer feature interaction, and rigid detection heads. To address these issues, we propose LRDS-YOLO, a lightweight and efficient model tailored for UAV applications. The model incorporates a Light Adaptive-weight Downsampling (LAD) module to retain fine-grained small object features and reduce information loss. A Re-Calibration Feature Pyramid Network (Re-Calibration FPN) enhances multi-scale feature fusion using bidirectional interactions and resolution-aware hybrid attention. The SegNext Attention mechanism improves target focus while suppressing background noise, and the dynamic detection head (DyHead) optimizes multi-dimensional feature weighting for robust detection. Experiments show that LRDS-YOLO achieves 43.6% mAP50 on VisDrone2019, 11.4% higher than the baseline, with only 4.17M parameters and 24.1 GFLOPs, striking a balance between accuracy and efficiency. On the HIT-UAV infrared dataset, it reaches 84.5% mAP50, demonstrating strong generalization. With its lightweight design and high precision, LRDS-YOLO offers an effective real-time solution for UAV-based small object detection.
The growing use of remote-sensing technologies has placed greater demands on object-detection algorithms, which still face challenges. This study proposes a hierarchical adaptive feature aggregation network (HAF-YOLO) to improve detection precision in remote-sensing images. It addresses issues such as small object size, complex backgrounds, scale variation, and dense object distributions by incorporating three core modules: dynamic-cooperative multimodal fusion architecture (DyCoMF-Arch), multiscale wavelet-enhanced aggregation network (MWA-Net), and spatial-deformable dynamic enhancement module (SDDE-Module). DyCoMF-Arch builds a hierarchical feature pyramid using multistage spatial compression and expansion, with dynamic weight allocation to extract salient features. MWA-Net applies wavelet-transform-based convolution to decompose features, preserving high-frequency detail and enhancing representation of small-scale objects. SDDE-Module integrates spatial coordinate encoding and multidirectional convolution to reduce localization interference and overcome fixed sampling limitations for geometric deformations. Experiments on the NWPU VHR-10 and DIOR datasets show that HAF-YOLO achieved mAP50 scores of 85.0% and 78.1%, improving on YOLOv8 by 4.8% and 3.1%, respectively. HAF-YOLO also maintained a low computational cost of 11.8 GFLOPs, outperforming other YOLO models. Ablation studies validated the effectiveness of each module and their combined optimization. This study presents a novel approach for remote-sensing object detection, with theoretical and practical value.
This article focus on designing and programming an application for implementing the YOLO method v8 in the detection and subsequent classification of objects in video recordings. The application, developed in the Python programming language, allows the insertion of video recordings in mp4 format. It then divides the frames, enabling the use of the YOLO method for object detection, classification, and simultaneous determination of the time and frame at which the identified object is located. Compared to previous versions, YOLOv8 from 2023 allows the use of up to 53 convolutional layers. As part of the YOLO method implementation, therefore was devised a convolutional network with 5 layers and 5 object classes. Throughout the training process, was configured a total of 10 epochs, resulting in an accuracy of 94.79%. Notably, after the 7th epoch, the error rate exhibited a declining trend, reaching a value of 0.15. These values signify a sufficiently trained network without the need for further retraining.
… via object detection and instance segmentation algorithms. … of rebar detection and instance segmentation algorithms. It is … Six object detection methods and four instance segmentation …
This study presents a YOLOv11-YOLOv12based deep learning framework for automated detection, segmentation, and classification of five hematologic cell types—Basophil, Erythroblast, Monocyte, Myeloblast, and Segmented Neutrophil-from microscopic blood images. A dataset of annotated cell images was preprocessed and augmented to enhance model generalization, and transfer learning was applied to optimize performance on unseen samples. Quantitative evaluations demonstrated exceptional accuracy, with mAP50 scores exceeding 0.99 for both bounding box and mask predictions, an overall mAP@0.5 of 0.992, and F1-scores peaking at 0.98. Precision-recall, precision-confidence, and recall-confidence curves further confirmed stable highconfidence behavior across all classes. Qualitative assessments showed accurate delineation of cell morphology under varying staining conditions, with minor misclassification observed only in morphologically similar cells. The system also achieved realtime inference speeds of $0.2-0.4$ seconds per image and was deployed via a Gradio web interface to enable immediate visualization of detection outputs. These findings demonstrate that the proposed YOLO-based framework delivers fast, accurate, and robust performance, establishing a viable foundation for AI-assisted hematological screening and supporting future integration into clinical diagnostic workflows.
… YOLOv9e-instance segmentation model classifies and … to improve the detection and classification of objects using … as box loss, segmentation loss, and classification loss, …
… demand for rapid and cost-effective detection. This study introduces a novel approach that amalgamates instance segmentation and monocular depth estimation, enabling fastener …
There are numerous applications for building dimension data, including building performance simulation and urban heat island investigations. In this context, object detection and instance segmentation methods—based on deep learning—are often used with Street View Images (SVIs) to estimate building dimensions. However, these methods typically depend on large and diverse datasets. Image augmentation can artificially boost dataset diversity, yet its role in building dimension estimation from SVIs remains under-studied. This research presents a methodology that applies eight distinct augmentation techniques—brightness, contrast, perspective, rotation, scale, shearing, translation augmentation, and a combined “sum of all” approach—to train models in two tasks: object detection with Faster Region-Based Convolutional Neural Networks (Faster R-CNNs) and instance segmentation with You Only Look Once (YOLO)v10. Comparing the performance with and without augmentation revealed that contrast augmentation consistently provided the greatest improvement in both bounding-box detection and instance segmentation. Using all augmentations at once rarely outperformed the single most effective method, and sometimes degraded the accuracy; shearing augmentation ranked as the second-best approach. Notably, the validation and test findings were closely aligned. These results, alongside the potential applications and the method’s current limitations, underscore the importance of carefully selected augmentations for reliable building dimension estimation.
… Object detection and instance segmentation are fundamental … object detection and instance segmentation because their … ) for both object detection and instance segmentation tasks. …
Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect, segment, and track arbitrary object categories in a video, without being constrained to categories seen during training. In this work, we propose InstFormer, a carefully designed framework for the OpenVIS task that achieves powerful open-vocabulary capabilities through lightweight fine-tuning with limited-category data. InstFormer begins with the open-world mask proposal network, encouraged to propose all potential instance class-agnostic masks by the contrastive instance margin loss. Next, we introduce InstCLIP, adapted from pre-trained CLIP with Instance Guidance Attention, which encodes open-vocabulary instance tokens efficiently. These instance tokens not only enable open-vocabulary classification but also offer strong universal tracking capabilities. Furthermore, to prevent the tracking module from being constrained by the training data with limited categories, we propose the universal rollout association, which transforms the tracking problem into predicting the next frame’s instance tracking token. The experimental results demonstrate the proposed InstFormer achieve state-of-the-art capabilities on a comprehensive OpenVIS evaluation benchmark, while also achieves competitive performance in fully supervised VIS task.
Cell segmentation and classification are critical tasks in spatial omics data analysis. Here we introduce CelloType, an end-to-end model designed for cell segmentation and classification for image-based spatial omics data. Unlike the traditional two-stage approach of segmentation followed by classification, CelloType adopts a multitask learning strategy that integrates these tasks, simultaneously enhancing the performance of both. CelloType leverages transformer-based deep learning techniques for improved accuracy in object detection, segmentation and classification. It outperforms existing segmentation methods on a variety of multiplexed fluorescence and spatial transcriptomic images. In terms of cell type classification, CelloType surpasses a model composed of state-of-the-art methods for individual tasks and a high-performance instance segmentation model. Using multiplexed tissue images, we further demonstrate the utility of CelloType for multiscale segmentation and classification of both cellular and noncellular elements in a tissue. The enhanced accuracy and multitask learning ability of CelloType facilitate automated annotation of rapidly growing spatial omics data. CelloType is an end-to-end method for spatial omics data analysis that uses a transformer-based deep neural network for concurrent object detection, segmentation and classification and performs with high accuracy on diverse datasets.
With the continuous development of artificial intelligence technology, the transformation of traditional agriculture into intelligent agriculture is quickly accelerating. However, due to the diverse growth postures of tender shoots and complex growth environments in tea plants, traditional tea picking machines are unable to precisely select the tender shoots, and the picking of high-end and premium tea still relies on manual labor, resulting in low efficiency and high costs. To address these issues, an instance segmentation algorithm named YOLOv8-TEA is proposed. Firstly, this algorithm is based on the single-stage instance segmentation algorithm YOLOv8-seg, replacing some C2f modules in the original feature extraction network with MVB, combining the advantages of convolutional neural networks (CNN) and Transformers, and adding a C2PSA module following spatial pyramid pooling (SPPF) to integrate convolution and attention mechanisms. Secondly, a learnable dynamic upsampling method is used to replace the traditional upsampling, and the CoTAttention module is added, along with the fusion of dilated convolutions in the segmentation head to enhance the learning ability of the feature fusion network. Finally, through ablation experiments and comparative experiments, the improved algorithm significantly improves the segmentation accuracy while effectively reducing the model parameters, with mAP (Box) and mAP (Mask) reaching 86.9% and 86.8%, respectively, and GFLOPs reduced to 52.7.
… In intelligent ship navigation, instance segmentation … pose significant challenges for instance segmentation, especially for … -scale segmentation performance of ship instances in marine …
… detection problem as an instance segmentation problem, where each rice row forms its own instance. … scheme, we customize a Two-Pathway Instance Segmentation Network (TP-ISN), …
… for autonomous vehicles where segmented map features are … (light poles) in instance-based segmented images acquired from … Using instance segmentation increases the vehicle-to-…
… maintaining detection accuracy and real-time performance … This paper introduces an enhanced YOLOv8 object detection … positioning it as a competitive solution in the field of real-time …
… -based detection head, significantly enhancing object detection performance in real-time … effectively processes real-time data, demonstrating superior classification performance with …
This paper focuses on real-time object detection systems, analyzing existing Field-Programmable Gate Arrays (FPGAs) implementations that aim to achieve the best efficiency, performance, and accuracy at the same time. These three metrics are typically crucial for domains such as autonomous driving, and robotics. Fortunately, recent advancements in object detection models, particularly based on Convolutional Neural Networks (CNNs), have significantly improved object detection accuracy and speed. When these models are combined with FPGAs, it is possible to achieve even more power efficiency and more easily satisfy real-time constraints. FPGAs can deliver low latency and high throughput by leveraging true parallelism making them suitable platforms for developing real-time object detection systems. This paper reviews existing literature on FPGA-based real-time object detection, discussing commonly used algorithms, acceleration techniques, and optimization strategies. Evaluation metrics and typical datasets for assessing real-time systems are also examined. We have compared the performance of these implementations by using pixel throughput as a fair metric across different systems while processing video streams or images. Insights into state-of-the-art works, comparative analysis, challenges, and future research directions are provided to guide researchers interested in leveraging FPGA devices for real-time object detection applications.
Summary Achieving lightweight real-time object detection necessitates balancing model compression with detection accuracy, a difficulty exacerbated by low redundancy and uneven contributions from convolutional layers. As an alternative to traditional methods, we propose Rigorous Gradation Pruning (RGP), which uses a desensitized first-order Taylor approximation to assess filter importance, enabling precise pruning of redundant kernels. This approach includes the iterative reassessment of layer significance to protect essential layers, ensuring effective detection performance. We applied RGP to YOLOv8 detectors and tested it on GTSDB, Seaships, and COCO datasets. On GTSDB, RGP achieved 80% compression of YOLOv8n with only a 0.11% drop in mAP0.5, while increasing frames per second (FPS) by 43.84%. For YOLOv8x, RGP achieved 90% compression, a 1.26% mAP0.5:0.95 increase, and a 112.66% FPS boost. Significant compression was also achieved on Seaships and COCO datasets, demonstrating RGP’s robustness across diverse object detection tasks and its potential for advancing efficient, high-speed detection models.
In conditions of foggy weather, challenges, such as low light, blurred imagery, and dense fog that obscures target objects are prevalent. Moreover, computing resources are limited on edge devices. To tackle these challenges, a novel real-time detection algorithm GMS-YOLO for pedestrians and vehicles is proposed based on YOLOv10, which overcomes the semantic bottleneck of the model and enhances its detection performance. A novel ghost multiscale convolution (GMSConv) module is constructed, serving as the ghost multiScale feature extraction backbone network (GMS-Net). The Shape Consistent Intersection over Union (SCIoU) is introduced as the localization loss function, which takes into account the influence of the attributes of the regression box in the loss computation. Additionally, a compensatory consistency matching metric (CCMM) formula is designed to reduce the sensitivity of the original metric to IoU and regression scores. The GMS-YOLO algorithm has a lightweight structure, achieving FPS of 94 and 92 during the detection phase at the “‘n”’ and “‘s”’ sizes, respectively. Furthermore, we have deployed the model on Jetson Nano hardware, and the inference speed is also quite encouraging. We validated the effectiveness of the algorithm on the Foggy Cityscapes, RTTS, VOC2007-fog, and VOC2012-fog datasets. Experimental results indicate that GMS-YOLO outperforms the baseline model, with a mean average precision (mAP) improvement of 6.3% and 5.5% for the “n” and “s” scales, respectively. Consequently, the proposed GMS-YOLO algorithm not only demonstrates superior detection performance but also maintains a relatively low model complexity, significantly enhancing the efficiency and accuracy of object detection tasks in foggy environments. The source code for our algorithm is available at: https://github.com/Fwdchina/GMS-YOLO.
Unmanned Aerial Vehicles (UAVs) face a significant challenge in balancing high accuracy and high efficiency when performing real-time object detection tasks, especially amidst intricate backgrounds, diverse target scales, and stringent onboard computational resource constraints. To tackle these difficulties, this study introduces YOLO-SRMX, a lightweight real-time object detection framework specifically designed for infrared imagery captured by UAVs. Firstly, the model utilizes ShuffleNetV2 as an efficient lightweight backbone and integrates the novel Multi-Scale Dilated Attention (MSDA) module. This strategy not only facilitates a substantial 46.4% reduction in parameter volume but also, through the flexible adaptation of receptive fields, boosts the model’s robustness and precision in multi-scale object recognition tasks. Secondly, within the neck network, multi-scale feature extraction is facilitated through the design of novel composite convolutions, ConvX and MConv, based on a “split–differentiate–concatenate” paradigm. Furthermore, the lightweight GhostConv is incorporated to reduce model complexity. By synthesizing these principles, a novel composite receptive field lightweight convolution, DRFAConvP, is proposed to further optimize multi-scale feature fusion efficiency and promote model lightweighting. Finally, the Wise-IoU loss function is adopted to replace the traditional bounding box loss. This is coupled with a dynamic non-monotonic focusing mechanism formulated using the concept of outlier degrees. This mechanism intelligently assigns elevated gradient weights to anchor boxes of moderate quality by assessing their relative outlier degree, while concurrently diminishing the gradient contributions from both high-quality and low-quality anchor boxes. Consequently, this approach enhances the model’s localization accuracy for small targets in complex scenes. Experimental evaluations on the HIT-UAV dataset corroborate that YOLO-SRMX achieves an mAP50 of 82.8%, representing a 7.81% improvement over the baseline YOLOv8s model; an F1 score of 80%, marking a 3.9% increase; and a substantial 65.3% reduction in computational cost (GFLOPs). YOLO-SRMX demonstrates an exceptional trade-off between detection accuracy and operational efficiency, thereby underscoring its considerable potential for efficient and precise object detection on resource-constrained UAV platforms.
… object detection, this research meticulously trained and evaluated several YOLO versions, focusing on their detection … and accuracy, positioning it as a viable model for real-time ADAS …
In today’s digital environment, effectively detecting and censoring harmful and offensive objects such as weapons, addictive substances, and violent content on online platforms is increasingly important for user safety. This study introduces an Enhanced Object Detection (EOD) model that builds upon the YOLOv8-m architecture to improve the identification of such harmful objects in complex scenarios. Our key contributions include enhancing the cross-stage partial fusion blocks and incorporating three additional convolutional blocks into the model head, leading to better feature extraction and detection capabilities. Utilizing a public dataset covering six categories of harmful objects, our EOD model achieves superior performance with precision, recall, and mAP50 scores of 0.88, 0.89, and 0.92 on standard test data, and 0.84, 0.74, and 0.82 on challenging test cases–surpassing existing deep learning approaches. Furthermore, we employ explainable AI techniques to validate the model’s confidence and decision-making process. These advancements not only enhance detection accuracy but also set a new benchmark for harmful object detection, significantly contributing to the safety measures across various online platforms.
Object detection is a fundamental capability that enables drones to perform various tasks. However, achieving a suitable equilibrium between performance, efficiency, and lightweight design continues to be a significant challenge for current algorithms. To address this issue, we propose an enhanced small object detection transformer model called ESO-DETR. First, we present a gated single-head attention backbone block, known as the GSHA block, which enhances the extraction of local details. Besides, ESO-DETR utilizes the multiscale multihead self-attention mechanism (MMSA) to efficiently manage complex features within its backbone network. We also introduce a novel and efficient feature fusion pyramid network for enhanced small object detection, termed ESO-FPN. This network integrates large convolutional kernels with dual-domain attention mechanisms. Lastly, we introduce the EMASlideVariFocal loss (ESVF Loss), which dynamically adjusts the weights to improve the model’s focus on more challenging samples. In comparison with the baseline model, ESO-DETR demonstrates enhancements of 3.9% and 4.0% in the mAP50 metric on the VisDrone and HIT-UAV datasets, respectively, while also reducing parameters by 25%. These results highlight the capability of ESO-DETR to improve detection accuracy while maintaining a lightweight and efficient structure.
… DD) for a real-time object detection technique (YOLOv12) is employed to detect and recognize … YOLOv12 is employed to detect and localize vehicle LPs and container IDs in the images. …
… -learning models that facilitate real-time object detection and classification. To address this … localized processing and enhanced data privacy and security. YOLOv8′s superior real-time …
… detection and object detection are the primary actors in any autonomous driving technology. The proposed solution in this article contemplates a model for object detection in real-time …
… underwater object real-time detection method, which is divided into two subtasks: weakly … object localization and real-time object detection. In the weakly supervised object localization …
The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent LVLMs for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the approaches used in integration of visual and textual information, demonstrating the progress made in object detection using VLMs that facilitate more sophisticated object detection and localization strategies. This review presents comprehensive visualizations demonstrating LVLMs' effectiveness in diverse scenarios including localization and segmentation, and then compares their real-time performance, adaptability, and complexity to traditional deep learning systems. Based on the review, its is expected that LVLMs will soon meet or surpass the performance of conventional methods in object detection. The review also identifies a few major limitations of the current LVLM modes, proposes solutions to address those challenges, and presents a clear roadmap for the future advancement in this field. We conclude, based on this study, that the recent advancement in LVLMs have made and will continue to make a transformative impact on object detection and robotic applications in the future.
This paper investigates the integration of the Learning Using Privileged Information (LUPI) paradigm in object detection to exploit fine-grained, descriptive information available during training but not at inference. We introduce a general, model-agnostic methodology for injecting privileged information-such as bounding box masks, saliency maps, and depth cues-into deep learning-based object detectors through a teacher-student architecture. Experiments are conducted across five state-of-the-art object detection models and multiple public benchmarks, including UAV-based litter detection datasets and Pascal VOC 2012, to assess the impact on accuracy, generalization, and computational efficiency. Our results demonstrate that LUPI-trained students consistently outperform their baseline counterparts, achieving significant boosts in detection accuracy with no increase in inference complexity or model size. Performance improvements are especially marked for medium and large objects, while ablation studies reveal that intermediate weighting of teacher guidance optimally balances learning from privileged and standard inputs. The findings affirm that the LUPI framework provides an effective and practical strategy for advancing object detection systems in both resource-constrained and real-world settings.
Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textit{zero prediction}, \textit{visual fine-tuning}, and \textit{text prompt}, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at https://github.com/better-chao/perceptual_abilities_evaluation.
Object detection is the task of detecting objects in an image. In this task, the detection of small objects is particularly difficult. Other than the small size, it is also accompanied by difficulties due to blur, occlusion, and so on. Current small object detection methods are tailored to small and dense situations, such as pedestrians in a crowd or far objects in remote sensing scenarios. However, when the target object is small and sparse, there is a lack of objects available for training, making it more difficult to learn effective features. In this paper, we propose a specialized method for detecting a specific category of small objects; birds. Particularly, we improve the features learned by the neck; the sub-network between the backbone and the prediction head, to learn more effective features with a hierarchical design. We employ Swin Transformer to upsample the image features. Moreover, we change the shifted window size for adapting to small objects. Experiments show that the proposed Swin Transformer-based neck combined with CenterNet can lead to good performance by changing the window sizes. We further find that smaller window sizes (default 2) benefit mAPs for small object detection.
Infrared imaging has emerged as a robust solution for urban object detection under low-light and adverse weather conditions, offering significant advantages over traditional visible-light cameras. However, challenges such as class imbalance, thermal noise, and computational constraints can significantly hinder model performance in practical settings. To address these issues, we evaluate multiple YOLO variants on the FLIR ADAS V2 dataset, ultimately selecting YOLOv8 as our baseline due to its balanced accuracy and efficiency. Building on this foundation, we present \texttt{MS-YOLO} (\textbf{M}obileNetv4 and \textbf{S}lideLoss based on YOLO), which replaces YOLOv8's CSPDarknet backbone with the more efficient MobileNetV4, reducing computational overhead by \textbf{1.5%} while sustaining high accuracy. In addition, we introduce \emph{SlideLoss}, a novel loss function that dynamically emphasizes under-represented and occluded samples, boosting precision without sacrificing recall. Experiments on the FLIR ADAS V2 benchmark show that \texttt{MS-YOLO} attains competitive mAP and superior precision while operating at only \textbf{6.7 GFLOPs}. These results demonstrate that \texttt{MS-YOLO} effectively addresses the dual challenge of maintaining high detection quality while minimizing computational costs, making it well-suited for real-time edge deployment in urban environments.
Underwater object detection is a critical yet challenging research problem owing to severe light attenuation, color distortion, background clutter, and the small scale of underwater targets. To address these challenges, we propose SPMamba-YOLO, a novel underwater object detection network that integrates multi-scale feature enhancement with global context modeling. Specifically, a Spatial Pyramid Pooling Enhanced Layer Aggregation Network (SPPELAN) module is introduced to strengthen multi-scale feature aggregation and expand the receptive field, while a Pyramid Split Attention (PSA) mechanism enhances feature discrimination by emphasizing informative regions and suppressing background interference. In addition, a Mamba-based state space modeling module is incorporated to efficiently capture long-range dependencies and global contextual information, thereby improving detection robustness in complex underwater environments. Extensive experiments on the URPC2022 dataset demonstrate that SPMamba-YOLO outperforms the YOLOv8n baseline by more than 4.9\% in mAP@0.5, particularly for small and densely distributed underwater objects, while maintaining a favorable balance between detection accuracy and computational cost.
Detecting rotated objects accurately and efficiently is a significant challenge in computer vision, particularly in applications such as aerial imagery, remote sensing, and autonomous driving. Although traditional object detection frameworks are effective for axis-aligned objects, they often underperform in scenarios involving rotated objects due to their limitations in capturing orientation variations. This paper introduces an improved loss function aimed at enhancing detection accuracy and robustness by leveraging the Gaussian bounding box representation and Bhattacharyya distance. In addition, we advocate for the use of an anisotropic Gaussian representation to address the issues associated with isotropic variance in square-like objects. Our proposed method addresses these challenges by incorporating a rotation-invariant loss function that effectively captures the geometric properties of rotated objects. We integrate this proposed loss function into state-of-the-art deep learning-based rotated object detection detectors, and extensive experiments demonstrated significant improvements in mean Average Precision metrics compared to existing methods. The results highlight the potential of our approach to establish new benchmark in rotated object detection, with implications for a wide range of applications requiring precise and reliable object localization irrespective of orientation.
In robot automated assembly, snap assembly precision and efficiency directly determine overall production quality. As a core prerequisite, snap detection and localization critically affect subsequent assembly success. Traditional visual methods suffer from poor robustness and large localization errors when handling complex scenarios (e.g., transparent or low-contrast snaps), failing to meet high-precision assembly demands. To address this, this paper designs a dedicated sensor and proposes SMR-Net, an self-attention-based multi-scale object detection algorithm, to synergistically enhance detection and localization performance. SMR-Net adopts an attention-enhanced multi-scale feature fusion architecture: raw sensor data is encoded via an attention-embedded feature extractor to strengthen key snap features and suppress noise; three multi-scale feature maps are processed in parallel with standard and dilated convolution for dimension unification while preserving resolution; an adaptive reweighting network dynamically assigns weights to fused features, generating fine representations integrating details and global semantics. Experimental results on Type A and Type B snap datasets show SMR-Net outperforms traditional Faster R-CNN significantly: Intersection over Union (IoU) improves by 6.52% and 5.8%, and mean Average Precision (mAP) increases by 2.8% and 1.5% respectively. This fully demonstrates the method's superiority in complex snap detection and localization tasks.
LiDAR-based 3D object detection models often struggle to generalize to real-world environments due to limited object diversity in existing datasets. To tackle it, we introduce the first generalized cross-domain few-shot (GCFS) task in 3D object detection, aiming to adapt a source-pretrained model to both common and novel classes in a new domain with only few-shot annotations. We propose a unified framework that learns stable target semantics under limited supervision by bridging 2D open-set semantics with 3D spatial reasoning. Specifically, an image-guided multi-modal fusion injects transferable 2D semantic cues into the 3D pipeline via vision-language models, while a physically-aware box search enhances 2D-to-3D alignment via LiDAR priors. To capture class-specific semantics from sparse data, we further introduce contrastive-enhanced prototype learning, which encodes few-shot instances into discriminative semantic anchors and stabilizes representation learning. Extensive experiments on GCFS benchmarks demonstrate the effectiveness and generality of our approach in realistic deployment settings.
Detecting objects from UAV-captured images is challenging due to the small object size. In this work, a simple and efficient adaptive zoom-in framework is explored for object detection on UAV images. The main motivation is that the foreground objects are generally smaller and sparser than those in common scene images, which hinders the optimization of effective object detectors. We thus aim to zoom in adaptively on the objects to better capture object features for the detection task. To achieve the goal, two core designs are required: \textcolor{black}{i) How to conduct non-uniform zooming on each image efficiently? ii) How to enable object detection training and inference with the zoomed image space?} Correspondingly, a lightweight offset prediction scheme coupled with a novel box-based zooming objective is introduced to learn non-uniform zooming on the input image. Based on the learned zooming transformation, a corner-aligned bounding box transformation method is proposed. The method warps the ground-truth bounding boxes to the zoomed space to learn object detection, and warps the predicted bounding boxes back to the original space during inference. We conduct extensive experiments on three representative UAV object detection datasets, including VisDrone, UAVDT, and SeaDronesSee. The proposed ZoomDet is architecture-independent and can be applied to an arbitrary object detection architecture. Remarkably, on the SeaDronesSee dataset, ZoomDet offers more than 8.4 absolute gain of mAP with a Faster R-CNN model, with only about 3 ms additional latency. The code is available at https://github.com/twangnh/zoomdet_code.
This paper presents GenDet, a novel framework that redefines object detection as an image generation task. In contrast to traditional approaches, GenDet adopts a pioneering approach by leveraging generative modeling: it conditions on the input image and directly generates bounding boxes with semantic annotations in the original image space. GenDet establishes a conditional generation architecture built upon the large-scale pre-trained Stable Diffusion model, formulating the detection task as semantic constraints within the latent space. It enables precise control over bounding box positions and category attributes, while preserving the flexibility of the generative model. This novel methodology effectively bridges the gap between generative models and discriminative tasks, providing a fresh perspective for constructing unified visual understanding systems. Systematic experiments demonstrate that GenDet achieves competitive accuracy compared to discriminative detectors, while retaining the flexibility characteristic of generative methods.
Accurate detection and classification of nuclei in histopathology images are critical for diagnostic and research applications. We present KongNet, a multi-headed deep learning architecture featuring a shared encoder and parallel, cell-type-specialised decoders. Through multi-task learning, each decoder jointly predicts nuclei centroids, segmentation masks, and contours, aided by Spatial and Channel Squeeze-and-Excitation (SCSE) attention modules and a composite loss function. We validate KongNet in three Grand Challenges. The proposed model achieved first place on track 1 and second place on track 2 during the MONKEY Challenge. Its lightweight variant (KongNet-Det) secured first place in the 2025 MIDOG Challenge. KongNet pre-trained on the MONKEY dataset and fine-tuned on the PUMA dataset ranked among the top three in the PUMA Challenge without further optimisation. Furthermore, KongNet established state-of-the-art performance on the publicly available PanNuke and CoNIC datasets. Our results demonstrate that the specialised multi-decoder design is highly effective for nuclei detection and classification across diverse tissue and stain types. The pre-trained model weights along with the inference code have been publicly released to support future research.
Center-aligned regression remains dominant in LiDAR-based 3D object detection, yet it suffers from fundamental instability: object centers often fall in sparse or empty regions of the bird's-eye-view (BEV) due to the front-surface-biased nature of LiDAR point clouds, leading to noisy and inaccurate bounding box predictions. To circumvent this limitation, we revisit bounding box representation and propose corner-aligned regression, which shifts the prediction target from unstable centers to geometrically informative corners that reside in dense, observable regions. Leveraging the inherent geometric constraints among corners and image 2D boxes, partial parameters of 3D bounding boxes can be recovered from corner annotations, enabling a weakly supervised paradigm without requiring complete 3D labels. We design a simple yet effective corner-aware detection head that can be plugged into existing detectors. Experiments on KITTI show our method improves performance by 3.5% AP over center-based baseline, and achieves 83% of fully supervised accuracy using only BEV corner clicks, demonstrating the effectiveness of our corner-aware regression strategy.
Weakly supervised oriented object detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing both efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox)-supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO, a framework for WS-OOD. Our ABBSPO addresses limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with the minimum circumscribed rectangles of predicted RBoxes, often leading to inaccurate scale estimation. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS), which appropriately scales GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate scale prediction; and (ii) a Symmetric Prior Angle (SPA) loss that exploits inherent symmetry of aerial objects for self-supervised learning, resolving issues in previous methods where learning collapses when predictions for all three augmented views (original, rotated, and flipped) are consistently incorrect. Extensive experimental results demonstrate that ABBSPO achieves state-of-the-art performance, outperforming existing methods.
Multi-view indoor radar perception has drawn attention due to its cost-effectiveness and low privacy risks. Existing methods often rely on {implicit} cross-view radar feature association, such as proposal pairing in RFMask or query-to-feature cross-attention in RETR, which can lead to ambiguous feature matches and degraded detection in complex indoor scenes. To address these limitations, we propose \textbf{REXO} (multi-view Radar object dEtection with 3D bounding boX diffusiOn), which lifts the 2D bounding box (BBox) diffusion process of DiffusionDet into the 3D radar space. REXO utilizes these noisy 3D BBoxes to guide an {explicit} cross-view radar feature association, enhancing the cross-view radar-conditioned denoising process. By accounting for prior knowledge that the person is in contact with the ground, REXO reduces the number of diffusion parameters by determining them from this prior. Evaluated on two open indoor radar datasets, our approach surpasses state-of-the-art methods by a margin of +4.22 AP on the HIBER dataset and +11.02 AP on the MMVR dataset. The REXO implementation is available at https://github.com/merlresearch/radar-bbox-diffusion.
Split computing ($\neq$ split learning) is a promising approach to deep learning models for resource-constrained edge computing systems, where weak sensor (mobile) devices are wirelessly connected to stronger edge servers through channels with limited communication capacity. State-of-theart work on split computing presents methods for single tasks such as image classification, object detection, or semantic segmentation. The application of existing methods to multitask problems degrades model accuracy and/or significantly increase runtime latency. In this study, we propose Ladon, the first multi-task-head supervised compression model for multi-task split computing. Experimental results show that the multi-task supervised compression model either outperformed or rivaled strong lightweight baseline models in terms of predictive performance for ILSVRC 2012, COCO 2017, and PASCAL VOC 2012 datasets while learning compressed representations at its early layers. Furthermore, our models reduced end-to-end latency (by up to 95.4%) and energy consumption of mobile devices (by up to 88.2%) in multi-task split computing scenarios.
Atypical mitotic figures (AMFs) represent abnormal cell division associated with poor prognosis. Yet their detection remains difficult due to low prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025 challenge introduces a benchmark for AMF classification across multiple domains. In this work, we fine-tuned the recently published DINOv3-H+ vision transformer, pretrained on natural images, using low-rank adaptation (LoRA), training only ~1.3M parameters in combination with extensive augmentation and a domain-weighted Focal Loss to handle domain heterogeneity. Despite the domain gap, our fine-tuned DINOv3 transfers effectively to histopathology, reaching first place on the final test set. These results highlight the advantages of DINOv3 pretraining and underline the efficiency and robustness of our fine-tuning strategy, yielding state-of-the-art results for the atypical mitosis classification challenge in MIDOG 2025.
FastCAR is a novel task consolidation approach in Multi-Task Learning (MTL) for a classification and a regression task, despite the non-triviality of task heterogeneity with only a subtle correlation. The approach addresses the classification of a detected object (occupying the entire image frame) and regression for modeling a continuous property variable (for instances of an object class), a crucial use case in science and engineering. FastCAR involves a label transformation approach that is amenable for use with only a single-task regression network architecture. FastCAR outperforms traditional MTL model families, parametrized in the landscape of architecture and loss weighting schemes, when learning both tasks are collectively considered (classification accuracy of 99.54%, regression mean absolute percentage error of 2.4%). The experiments performed used "Advanced Steel Property Dataset" contributed by us https://github.com/fastcandr/AdvancedSteel-Property-Dataset. The dataset comprises 4536 images of 224x224 pixels, annotated with discrete object classes and its hardness property that can take continuous values. Our proposed FastCAR approach for task consolidation achieves training time efficiency (2.52x quicker) and reduced inference latency (55% faster) than benchmark MTL networks.
Indoor localization faces persistent challenges in achieving high accuracy, particularly in GPS-deprived environments. This study unveils a cutting-edge handheld indoor localization system that integrates 2D LiDAR and IMU sensors, delivering enhanced high-velocity precision mapping, computational efficiency, and real-time adaptability. Unlike 3D LiDAR systems, it excels with rapid processing, low-cost scalability, and robust performance, setting new standards for emergency response, autonomous navigation, and industrial automation. Enhanced with a CNN-driven object detection framework and optimized through Cartographer SLAM (simultaneous localization and mapping ) in ROS, the system significantly reduces Absolute Trajectory Error (ATE) by 21.03%, achieving exceptional precision compared to state-of-the-art approaches like SC-ALOAM, with a mean x-position error of -0.884 meters (1.976 meters). The integration of CNN-based object detection ensures robustness in mapping and localization, even in cluttered or dynamic environments, outperforming existing methods by 26.09%. These advancements establish the system as a reliable, scalable solution for high-precision localization in challenging indoor scenarios
Manual annotation of anatomical landmarks on 3D facial scans is a time-consuming and expertise-dependent task, yet it remains critical for clinical assessments, morphometric analysis, and craniofacial research. While several deep learning methods have been proposed for facial landmark localization, most focus on pseudo-landmarks or require complex input representations, limiting their clinical applicability. This study presents a fully automated deep learning pipeline (PAL-Net) for localizing 50 anatomical landmarks on stereo-photogrammetry facial models. The method combines coarse alignment, region-of-interest filtering, and an initial approximation of landmarks with a patch-based pointwise CNN enhanced by attention mechanisms. Trained and evaluated on 214 annotated scans from healthy adults, PAL-Net achieved a mean localization error of 3.686 mm and preserves relevant anatomical distances with a 2.822 mm average error, comparable to intra-observer variability. To assess generalization, the model was further evaluated on 700 subjects from the FaceScape dataset, achieving a point-wise error of 0.41\,mm and a distance-wise error of 0.38\,mm. Compared to existing methods, PAL-Net offers a favorable trade-off between accuracy and computational cost. While performance degrades in regions with poor mesh quality (e.g., ears, hairline), the method demonstrates consistent accuracy across most anatomical regions. PAL-Net generalizes effectively across datasets and facial regions, outperforming existing methods in both point-wise and structural evaluations. It provides a lightweight, scalable solution for high-throughput 3D anthropometric analysis, with potential to support clinical workflows and reduce reliance on manual annotation. Source code can be found at https://github.com/Ali5hadman/PAL-Net-A-Point-Wise-CNN-with-Patch-Attention
In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at https://github.com/jgenc/xray-comparative-evaluation.
In this study, proposes a method for improved object detection from the low-resolution images by integrating Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) and Faster Region-Convolutional Neural Network (Faster R-CNN). ESRGAN enhances low-quality images, restoring details and improving clarity, while Faster R-CNN performs accurate object detection on the enhanced images. The combination of these techniques ensures better detection performance, even with poor-quality inputs, offering an effective solution for applications where image resolution is in consistent. ESRGAN is employed as a pre-processing step to enhance the low-resolution input image, effectively restoring lost details and improving overall image quality. Subsequently, the enhanced image is fed into the Faster R-CNN model for accurate object detection and localization. Experimental results demonstrate that this integrated approach yields superior performance compared to traditional methods applied directly to low-resolution images. The proposed framework provides a promising solution for applications where image quality is variable or limited, enabling more robust and reliable object detection in challenging scenarios. It achieves a balance between improved image quality and efficient object detection
We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95\% on xView, outperforming SOTA methods by 11.46\%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.
Current methods for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient fine-tuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importancebased Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting.
Real-time object detectors like YOLO achieve exceptional performance when trained on large datasets for multiple epochs. However, in real-world scenarios where data arrives incrementally, neural networks suffer from catastrophic forgetting, leading to a loss of previously learned knowledge. To address this, prior research has explored strategies for Class Incremental Learning (CIL) in Continual Learning for Object Detection (CLOD), with most approaches focusing on two-stage object detectors. However, existing work suggests that Learning without Forgetting (LwF) may be ineffective for one-stage anchor-free detectors like YOLO due to noisy regression outputs, which risk transferring corrupted knowledge. In this work, we introduce YOLO LwF, a self-distillation approach tailored for YOLO-based continual object detection. We demonstrate that when coupled with a replay memory, YOLO LwF significantly mitigates forgetting. Compared to previous approaches, it achieves state-of-the-art performance, improving mAP by +2.1% and +2.9% on the VOC and COCO benchmarks, respectively.
This paper provides an extensive evaluation of YOLO object detection models (v5, v8, v9, v10, v11) by com- paring their performance across various hardware platforms and optimization libraries. Our study investigates inference speed and detection accuracy on Intel and AMD CPUs using popular libraries such as ONNX and OpenVINO, as well as on GPUs through TensorRT and other GPU-optimized frameworks. Furthermore, we analyze the sensitivity of these YOLO models to object size within the image, examining performance when detecting objects that occupy 1%, 2.5%, and 5% of the total area of the image. By identifying the trade-offs in efficiency, accuracy, and object size adaptability, this paper offers insights for optimal model selection based on specific hardware constraints and detection requirements, aiding practitioners in deploying YOLO models effectively for real-world applications.
Automated property risk detection is a high-impact yet underexplored frontier in computer vision with direct implications for real estate, underwriting, and insurance operations. We introduce HOMEY (Heuristic Object Masking with Enhanced YOLO), a novel detection framework that combines YOLO with a domain-specific masking mechanism and a custom-designed loss function. HOMEY is trained to detect 17 risk-related property classes, including structural damages (e.g., cracked foundations, roof issues), maintenance neglect (e.g., dead yards, overgrown bushes), and liability hazards (e.g., falling gutters, garbage, hazard signs). Our approach introduces heuristic object masking to amplify weak signals in cluttered backgrounds and risk-aware loss calibration to balance class skew and severity weighting. Experiments on real-world property imagery demonstrate that HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models, while retaining fast inference. Beyond detection, HOMEY enables interpretable and cost-efficient risk analysis, laying the foundation for scalable AI-driven property insurance workflows.
Small object detection remains a challenging problem in the field of object detection. To address this challenge, we propose an enhanced YOLOv8-based model, SOD-YOLO. This model integrates an ASF mechanism in the neck to enhance multi-scale feature fusion, adds a Small Object Detection Layer (named P2) to provide higher-resolution feature maps for better small object detection, and employs Soft-NMS to refine confidence scores and retain true positives. Experimental results demonstrate that SOD-YOLO significantly improves detection performance, achieving a 36.1% increase in mAP$_{50:95}$ and 20.6% increase in mAP$_{50}$ on the VisDrone2019-DET dataset compared to the baseline model. These enhancements make SOD-YOLO a practical and efficient solution for small object detection in UAV imagery. Our source code, hyper-parameters, and model weights are available at https://github.com/iamwangxiaobai/SOD-YOLO.
Underwater object detection (UOD) remains a critical challenge in computer vision due to underwater distortions which degrade low-level features and compromise the reliability of even state-of-the-art detectors. While YOLO models have become the backbone of real-time object detection, little work has systematically examined their robustness under these uniquely challenging conditions. This raises a critical question: Are YOLO models genuinely robust when operating under the chaotic and unpredictable conditions of underwater environments? In this study, we present one of the first comprehensive evaluations of recent YOLO variants (YOLOv8-YOLOv12) across six simulated underwater environments. Using a unified dataset of 10,000 annotated images from DUO and Roboflow100, we not only benchmark model robustness but also analyze how distortions affect key low-level features such as texture, edges, and color. Our findings show that (1) YOLOv12 delivers the strongest overall performance but is highly vulnerable to noise, and (2) noise disrupts edge and texture features, explaining the poor detection performance in noisy images. Class imbalance is a persistent challenge in UOD. Experiments revealed that (3) image counts and instance frequency primarily drive detection performance, while object appearance exerts only a secondary influence. Finally, we evaluated lightweight training-aware strategies: noise-aware sample injection, which improves robustness in both noisy and real-world conditions, and fine-tuning with advanced enhancement, which boosts accuracy in enhanced domains but slightly lowers performance in original data, demonstrating strong potential for domain adaptation, respectively. Together, these insights provide practical guidance for building resilient and cost-efficient UOD systems.
Underwater object detection is critical for oceanic research and industrial safety inspections. However, the complex optical environment and the limited resources of underwater equipment pose significant challenges to achieving high accuracy and low power consumption. To address these issues, we propose Spiking Underwater YOLO (SU-YOLO), a Spiking Neural Network (SNN) model. Leveraging the lightweight and energy-efficient properties of SNNs, SU-YOLO incorporates a novel spike-based underwater image denoising method based solely on integer addition, which enhances the quality of feature maps with minimal computational overhead. In addition, we introduce Separated Batch Normalization (SeBN), a technique that normalizes feature maps independently across multiple time steps and is optimized for integration with residual structures to capture the temporal dynamics of SNNs more effectively. The redesigned spiking residual blocks integrate the Cross Stage Partial Network (CSPNet) with the YOLO architecture to mitigate spike degradation and enhance the model's feature extraction capabilities. Experimental results on URPC2019 underwater dataset demonstrate that SU-YOLO achieves mAP of 78.8% with 6.97M parameters and an energy consumption of 2.98 mJ, surpassing mainstream SNN models in both detection accuracy and computational efficiency. These results underscore the potential of SNNs for engineering applications. The code is available in https://github.com/lwxfight/snn-underwater.
Object detection and classification are crucial tasks across various application domains, particularly in the development of safe and reliable Advanced Driver Assistance Systems (ADAS). Existing deep learning-based methods such as Convolutional Neural Networks (CNNs), Single Shot Detectors (SSDs), and You Only Look Once (YOLO) have demonstrated high performance in terms of accuracy and computational speed when deployed on Field-Programmable Gate Arrays (FPGAs). However, despite these advances, state-of-the-art YOLO-based object detection and classification systems continue to face challenges in achieving resource efficiency suitable for edge FPGA platforms. To address this limitation, this paper presents a resource-efficient real-time object detection and classification system based on YOLOv5 optimized for FPGA deployment. The proposed system is trained on the COCO and GTSRD datasets and implemented on the Xilinx Kria KV260 FPGA board. Experimental results demonstrate a classification accuracy of 99%, with a power consumption of 3.5W and a processing speed of 9 frames per second (FPS). These findings highlight the effectiveness of the proposed approach in enabling real-time, resource-efficient object detection and classification for edge computing applications.
The real-time detection of small objects in complex scenes, such as the unmanned aerial vehicle (UAV) photography captured by drones, has dual challenges of detecting small targets (<32 pixels) and maintaining real-time efficiency on resource-constrained platforms. While YOLO-series detectors have achieved remarkable success in real-time large object detection, they suffer from significantly higher false negative rates for drone-based detection where small objects dominate, compared to large object scenarios. This paper proposes HierLight-YOLO, a hierarchical feature fusion and lightweight model that enhances the real-time detection of small objects, based on the YOLOv8 architecture. We propose the Hierarchical Extended Path Aggregation Network (HEPAN), a multi-scale feature fusion method through hierarchical cross-level connections, enhancing the small object detection accuracy. HierLight-YOLO includes two innovative lightweight modules: Inverted Residual Depthwise Convolution Block (IRDCB) and Lightweight Downsample (LDown) module, which significantly reduce the model's parameters and computational complexity without sacrificing detection capabilities. Small object detection head is designed to further enhance spatial resolution and feature fusion to tackle the tiny object (4 pixels) detection. Comparison experiments and ablation studies on the VisDrone2019 benchmark demonstrate state-of-the-art performance of HierLight-YOLO.
Recent advancements in computer vision and deep learning have enhanced disaster-response capabilities, particularly in the rapid assessment of earthquake-affected urban environments. Timely identification of accessible entry points and structural obstacles is essential for effective search-and-rescue (SAR) operations. To address this need, we introduce DRespNeT, a high-resolution dataset specifically developed for aerial instance segmentation of post-earthquake structural environments. Unlike existing datasets, which rely heavily on satellite imagery or coarse semantic labeling, DRespNeT provides detailed polygon-level instance segmentation annotations derived from high-definition (1080p) aerial footage captured in disaster zones, including the 2023 Turkiye earthquake and other impacted regions. The dataset comprises 28 operationally critical classes, including structurally compromised buildings, access points such as doors, windows, and gaps, multiple debris levels, rescue personnel, vehicles, and civilian visibility. A distinctive feature of DRespNeT is its fine-grained annotation detail, enabling differentiation between accessible and obstructed areas, thereby improving operational planning and response efficiency. Performance evaluations using YOLO-based instance segmentation models, specifically YOLOv8-seg, demonstrate significant gains in real-time situational awareness and decision-making. Our optimized YOLOv8-DRN model achieves 92.7% mAP50 with an inference speed of 27 FPS on an RTX-4090 GPU for multi-target detection, meeting real-time operational requirements. The dataset and models support SAR teams and robotic systems, providing a foundation for enhancing human-robot collaboration, streamlining emergency response, and improving survivor outcomes.
Visual grouping -- operationalized through tasks such as instance segmentation, visual grounding, and object detection -- enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large-scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce Synthetic Object Compositions (SOC), an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen) by +24-36% -- achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Beyond the general open-vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real-data scales and yields even greater improvements under extremely limited real-data conditions, including +6.59 AP on a 1% COCO data setup. Furthermore, this controllability enables targeted data generation for intra-class referring, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.
Nuclear instance segmentation and classification provide critical quantitative foundations for digital pathology diagnosis. With the advent of the foundational Segment Anything Model (SAM), the accuracy and efficiency of nuclear segmentation have improved significantly. However, SAM imposes a strong reliance on precise prompts, and its class-agnostic design renders its classification results entirely dependent on the provided prompts. Therefore, we focus on generating prompts with more accurate localization and classification and propose \textbf{APSeg}, \textbf{A}uto-\textbf{P}rompt model with acquired and injected knowledge for nuclear instance \textbf{Seg}mentation and classification. APSeg incorporates two knowledge-aware modules: (1) Distribution-Guided Proposal Offset Module (\textbf{DG-POM}), which learns distribution knowledge through density map guided, and (2) Category Knowledge Semantic Injection Module (\textbf{CK-SIM}), which injects morphological knowledge derived from category descriptions. We conducted extensive experiments on the PanNuke and CoNSeP datasets, demonstrating the effectiveness of our approach. The code will be released upon acceptance.
One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high-resolution inputs and lightweight, real-time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly-DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction. Considering the box-to-polygon reference shift in Detection Transformers, we propose Polar Deformable Attention and Position-Aware Training Scheme to dynamically update supervision and focus attention on boundary cues. Compared with state-of-the-art polar-based methods, Poly-DETR achieves a 4.7 mAP improvement on MS COCO test-dev. Moreover, we construct a parallel mask-based counterpart to support a systematic comparison between polar and mask representations. Experimental results show that Poly-DETR is more lightweight in high-resolution scenarios, reducing memory consumption by almost half on Cityscapes dataset. Notably, on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets, Poly-DETR surpasses its mask-based counterpart on all metrics, which validates its advantage on regular-shaped instances in domain-specific settings.
Transformer-based real-time object detectors achieve strong accuracy-latency trade-offs, and D-FINE is among the top-performing recent architectures. However, real-time instance segmentation with transformers is still less common. We present D-FINE-seg, an instance segmentation extension of D-FINE that adds: a lightweight mask head, segmentation-aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency. Second contribution is an end-to-end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open-source under the Apache-2.0 license. GitHub repository - https://github.com/ArgoHA/D-FINE-seg.
Large-scale delineation of individual trees from remote sensing imagery is crucial to the advancement of ecological research, particularly as climate change and other environmental factors rapidly transform forest landscapes across the world. Current RGB tree segmentation methods rely on training specialized machine learning models with labeled tree datasets. While these learning-based approaches can outperform manual data collection when accurate, the existing models still depend on training data that's hard to scale. In this paper, we investigate the efficacy of using a state-of-the-art image segmentation model, Segment Anything Model 2 (SAM2), in a zero-shot manner for individual tree detection and segmentation. We evaluate a pretrained SAM2 model on two tasks in this domain: (1) zero-shot segmentation and (2) zero-shot transfer by using predictions from an existing tree detection model as prompts. Our results suggest that SAM2 not only has impressive generalization capabilities, but also can form a natural synergy with specialized methods trained on in-domain labeled data. We find that applying large pretrained models to problems in remote sensing is a promising avenue for future progress. We make our code available at: https://github.com/open-forest-observatory/tree-detection-framework.
Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring. This paper presents a detailed analysis of YOLOv11, the recent advancement in the YOLO series of deep learning models, focusing on its application to joint building extraction and discrete height classification from satellite imagery. YOLOv11 builds on the strengths of earlier YOLO models by introducing a more efficient architecture that better combines features at different scales, improves object localization accuracy, and enhances performance in complex urban scenes. Using the DFC2023 Track 2 dataset -- which includes over 125,000 annotated buildings across 12 cities -- we evaluate YOLOv11's performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLOv11 achieves strong instance segmentation performance with 60.4\% mAP@50 and 38.3\% mAP@50--95 while maintaining robust classification accuracy across five predefined height tiers. The model excels in handling occlusions, complex building shapes, and class imbalance, particularly for rare high-rise structures. Comparative analysis confirms that YOLOv11 outperforms earlier multitask frameworks in both detection accuracy and inference speed, making it well-suited for real-time, large-scale urban mapping. This research highlights YOLOv11's potential to advance semantic urban reconstruction through streamlined categorical height modeling, offering actionable insights for future developments in remote sensing and geospatial intelligence.
Real-time object detection takes an essential part in the decision-making process of numerous real-world applications, including collision avoidance and path planning in autonomous driving systems. This paper presents a novel real-time streaming perception method named CorrDiff, designed to tackle the challenge of delays in real-time detection systems. The main contribution of CorrDiff lies in its adaptive delay-aware detector, which is able to utilize runtime-estimated temporal cues to predict objects' locations for multiple future frames, and selectively produce predictions that matches real-world time, effectively compensating for any communication and computational delays. The proposed model outperforms current state-of-the-art methods by leveraging motion estimation and feature enhancement, both for 1) single-frame detection for the current frame or the next frame, in terms of the metric mAP, and 2) the prediction for (multiple) future frame(s), in terms of the metric sAP (The sAP metric is to evaluate object detection algorithms in streaming scenarios, factoring in both latency and accuracy). It demonstrates robust performance across a range of devices, from powerful Tesla V100 to modest RTX 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, CorrDiff meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system's adaptability and its potential to significantly improve the safety and reliability for many real-world systems, such as autonomous driving. Our code is completely open-sourced and is available at https://anonymous.4open.science/r/CorrDiff.
Despite the remarkable achievements in object detection, the model's accuracy and efficiency still require further improvement under challenging underwater conditions, such as low image quality and limited computational resources. To address this, we propose an Ultra-Light Real-Time Underwater Object Detection framework, You Sense Only Once Beneath (YSOOB). Specifically, we utilize a Multi-Spectrum Wavelet Encoder (MSWE) to perform frequency-domain encoding on the input image, minimizing the semantic loss caused by underwater optical color distortion. Furthermore, we revisit the unique characteristics of even-sized and transposed convolutions, allowing the model to dynamically select and enhance key information during the resampling process, thereby improving its generalization ability. Finally, we eliminate model redundancy through a simple yet effective channel compression and reconstructed large kernel convolution (RLKC) to achieve model lightweight. As a result, forms a high-performance underwater object detector YSOOB with only 1.2 million parameters. Extensive experimental results demonstrate that, with the fewest parameters, YSOOB achieves mAP50 of 83.1% and 82.9% on the URPC2020 and DUO datasets, respectively, comparable to the current SOTA detectors. The inference speed reaches 781.3 FPS and 57.8 FPS on the T4 GPU (TensorRT FP16) and the edge computing device Jetson Xavier NX (TensorRT FP16), surpassing YOLOv12-N by 28.1% and 22.5%, respectively.
合并后的统一分组将“分类与识别定位方法”按关键研究主线并列拆分为:方法体系综述/基准;开放词汇视频实例分割;3D几何与编码约束;旋转目标定位表征;小目标与尺度自适应;几何建模与边界/框回归增强;实时与可部署的端到端多任务/实例分割与流式延迟;资源受限下的能效/压缩/联邦学习;少样本、域泛化与持续学习适配;数据合成与提示驱动;表征层的时空跨域增强;生成式检测范式;领域应用定制;Faster R-CNN改造;多模态/语言输入增强;实例级表征(检测-分割联合与多边形);以及定位输出形式扩展到分类-回归、单目3D与关键点定位。整体上反映该领域从传统检测器迭代,逐步走向更强几何建模、尺度/上下文增强、端侧实时部署与开放/多模态/生成式范式融合。