多模态工业异常检测
多源模态融合与3D几何特征增强
该组研究侧重于RGB图像与3D点云、深度图或表面法向量的深度整合。通过特征级融合、双向重建、跨模态蒸馏及频率对齐等机制,利用模态间的互补性捕捉微小形变或复杂空间结构的异常,解决单一模态信息不足的问题。
- FMFR: Feature-level Multistage Fusion and Remapping for Multimodal Industrial Anomaly Detection(Chunshui Wang, Hengran Zhang, 2026, Journal of Computational Design and Engineering)
- Multimodal multiscale industrial anomaly detection via flows(Haicheng Qu, Junjie Lin, 2025, Journal of Image and Graphics)
- Multimodal Industrial Anomaly Detection via Attention-Enhanced Memory-Guided Network(Shuaibo Liu, Xiaoli Luan, Yueyang Li, 2026, IEEE Transactions on Multimedia)
- Masked Cross-modal Reconstruction Network (MCR-Net) for Multi-modal Industrial Anomaly Detection(Li Mai, Chen Dai, Hongji Ma, Xin Lin, Shiwei Guo, Guang Yan, 2025, 2025 IEEE 3rd International Conference on Computer, Vision and Intelligent Technology (ICCVIT))
- Unsupervised Visual-to-Geometric Feature Reconstruction for Vision-Based Industrial Anomaly Detection(Dinh-Cuong Hoang, Phan Xuan Tan, Anh-Nhat Nguyen, Duc-Thanh Tran, van-Hiep Duong, Anh-Truong Mai, D. Pham, Khanh-Toan Phan, Minh-Quang Do, Ta Huu Anh Duong, Tuan-Minh Huynh, Son-Anh Bui, Duc-Manh Nguyen, Viet-Anh Trinh, Khanh-Duong Tran, Thu-Uyen Nguyen, 2025, IEEE Access)
- Multimodal Industrial Anomaly Detection via Uni-Modal and Cross-Modal Fusion(Hao Cheng, Jiaxiang Luo, Xianyong Zhang, 2025, IEEE Transactions on Industrial Informatics)
- VLDFNet: Views-Graph and Latent Feature Disentangled Fusion Network for Multimodal Industrial Anomaly Detection(Chenxing Xia, Chaofan Liu, Yicong Zhou, Kuan Ching Li, 2025, IEEE Transactions on Instrumentation and Measurement)
- Auxiliary Information Flow for 3D Industrial Defect Detection on IC Ceramic Package Substrate Surfaces: Dataset and Benchmark(Ruiyun Yu, Ziming Zhao, Shi Zhen, 2026, IEEE Transactions on Circuits and Systems for Video Technology)
- Enhancing Multimodal Anomaly Detection via Asymmetric Dual-Branch Reverse Distillation(Zihe Chen, Bin Chen, Jianfeng Yang, Yichi Chen, Yuan Zhang, 2025, The Visual Computer)
- Unified Unsupervised Anomaly Detection via Matching Cost Filtering(Zhe Zhang, Mingxiu Cai, Gao‐Song Wu, Jing Zhang, Lingqiao Liu, Dacheng Tao, Tianyou Chai, Xiatian Zhu, 2025, ArXiv)
- Inter-modality feature prediction through multimodal fusion for 3D shape defect detection(Mujtaba Asad, Waqar Azeem, Hafiz Tayyab Mustafa, Yuming Fang, Jie Yang, Yifan Zuo, Wei Liu, 2025, Neural networks : the official journal of the International Neural Network Society)
- FAMRD: Frequency-Aware Multimodal Reverse Distillation for Industrial Anomaly Detection(Qiyin Zhong, Xianglin Qiu, Xiaolei Wang, Zhen Zhang, Gang Liu, Jimin Xiao, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- BridgeNet: A Unified Multimodal Framework for Bridging 2D and 3D Industrial Anomaly Detection(An Xiang, Zixuan Huang, Xitong Gao, Kejiang Ye, Cheng-zhong Xu, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- HGCF: Hierarchical Geometry-Color Fusion for Multimodal Industrial Anomaly Detection(Min Li, Jinghui He, Jiachen Li, Delong Han, Jin Wan, Gang Li, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- A multimodal industrial anomaly detection method based on mask training and teacher-student joint memory(Yi Liu, Changsheng Zhang, Xingjun Dong, Yufei Yang, 2025, Eng. Appl. Artif. Intell.)
- CPIR: Multimodal Industrial Anomaly Detection via Latent Bridged Cross-modal Prediction and Intra-modal Reconstruction(Shangguan Wen, Hongqiang Wu, Yanchang Niu, Haonan Yin, Jiawei Yu, Bokui Chen, Biqing Huang, 2025, Adv. Eng. Informatics)
- DFRF-MIAD: Multimodal Industrial Anomaly Detection via Feature Reconstruction and Fusion(Feng Wu, Zhaojing Wang, Li Li, 2026, No journal)
- A multi-expert framework for enhancing multimodal large language models in industrial anomaly detection(Zhiling Chen, Farhad Imani, 2026, Pattern Recognit.)
- Zero-shot Anomaly Detection Algorithm Based on Adaptive Feature Fusion(Xiaoquan Tang, Hongjie Liu, Zhen Wang, Tao Liu, 2025, 2025 5th International Conference on Artificial Intelligence, Virtual Reality and Visualization (AIVRV))
- A hierarchical framework for three‐dimensional pavement crack detection on point clouds with multi‐scale abnormal region filtering and multimodal interaction fusion(Jiayv Jing, Ling Ding, Xu Yang, Hang Cheng, Yazhen Qiu, Hainian Wang, Rauno Heikkilä, 2025, Computer‐Aided Civil and Infrastructure Engineering)
- DCRDF-Net: A Dual-Channel Reverse-Distillation Fusion Network for 3D Industrial Anomaly Detection(Chunshui Wang, Jianbo Chen, Heng Zhang, 2026, Sensors (Basel, Switzerland))
- Unsupervised Feature Metric-Based Multimodal Anomaly Detection Method(Liu Li, 2025, 2025 5th International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA))
- 2M3DF: Advancing 3D Industrial Defect Detection With Multi-Perspective Multimodal Fusion Network(Mujtaba Asad, Waqar Azeem, He Jiang, Hafiz Tayyab Mustafa, Jie Yang, Wei Liu, 2025, IEEE Transactions on Circuits and Systems for Video Technology)
- Multimodal Industrial Anomaly Detection via Geometric Prior(Min Li, Jinghui He, Gang Li, Jiachen Li, Jin Wan, Delong Han, 2026, IEEE Transactions on Circuits and Systems for Video Technology)
- MambaAlign: Alignment-Aware State-Space Fusion for RGB-X Industrial Anomaly Detection(Dinh-Cuong Hoang, Phan Xuan Tan, Anh-Nhat Nguyen, D. Ngo, Minh-Duc Cao, Minh-Quang Vu, Hoang-Nam Duong, S. Nguyen, Thi-Hong Le, Van-Viet Dang, Xuan-Tung Dinh, Minh-Anh Nguyen, Minh-Quang Do, Van-Khanh Giap, van-Hiep Duong, 2025, Journal of Computational Design and Engineering)
基于视觉语言模型(VLM)的零样本与少样本检测
此类文献利用预训练模型(如CLIP)的跨模态对齐能力,通过提示工程(Prompt Engineering)、多尺度感知、属性感知或特征解耦技术,在无需或仅需极少量目标数据训练的情况下,实现工业缺陷的快速分类与定位。
- MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples(Xurui Li, Feng Xue, Yu Zhou, 2025, ArXiv)
- Supad: a superordinary zero-shot industrial anomaly detection network based on gated-agnostic multimodal adaptive learning prompts(Xinying Li, Junfeng Jing, Tong Wu, Xin Zhang, Wei Liu, 2026, Journal of Intelligent Manufacturing)
- V2TCASA: Vision to text class-agnostic state-agnostic for industrial zero-shot anomaly detection(Cheng Jiang, Lingxi Peng, Haohuai Liu, 2025, Signal, Image and Video Processing)
- Towards Zero-Shot Anomaly Detection via Adaptive Prompting and Multi-Scale Cross-Modal Interaction(Guo Tang, Weidong Zhao, Ning Jia, Xianhui Liu, 2025, 2025 7th International Conference on Robotics and Computer Vision (ICRCV))
- An efficient and scale-aware zero-shot industrial anomaly detection technique based on optimized CLIP(Yahui Cheng, Guojun Wen, Aoshuang Luo, Shuang Mei, Hongbo Dong, Xingyue Liu, 2025, Measurement)
- Toward Zero-Shot Point Cloud Anomaly Detection: A Multiview Projection Framework(Yuqi Cheng, Yunkang Cao, Guoyang Xie, Zhichao Lu, Weiming Shen, 2026, IEEE Transactions on Systems, Man, and Cybernetics: Systems)
- H2SP-AD: hierarchical hybrid softened prompt learning for instance-aware zero-shot industrial anomaly detection(Qishuo Yang, Ying Chen, 2026, Journal of Intelligent Manufacturing)
- StackCLIP: Clustering-Driven Stacked Prompt in Zero-Shot Industrial Anomaly Detection(Yanning Hou, Yanran Ruan, Junfa Li, Shanshan Wang, Jianfeng Qiu, Ke Xu, 2025, ArXiv)
- Human-Guided Zero-Shot Surface Defect Semantic Segmentation(Yuxin Jin, Yunzhou Zhang, Dexing Shan, Zhifei Wu, 2025, IEEE Transactions on Instrumentation and Measurement)
- SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection(Chenhao Fu, Han Fang, Xiuzheng Zheng, Wenbo Wei, Yonghua Li, Hao Sun, Xuelong Li, 2026, ArXiv)
- Multimodal zero-shot anomaly detection using dual-experts for electrical power equipment inspection images(Hua Wu, Donghao Jia, Tingting Zhang, Xiaojing Bai, Li Sun, Mengyang Pu, 2025, Journal of Image and Graphics)
- ZUMA: Training-free Zero-shot Unified Multimodal Anomaly Detection.(Yunfeng Ma, Min Liu, Shuai Jiang, Jingyu Zhou, Yuan Bian, Xueping Wang, Yaonan Wang, 2026, IEEE transactions on pattern analysis and machine intelligence)
- ZSDD: Zero-Shot Detection and Segmentation of Surface Defects Using Pre-Trained Models(Mohammad Sadeghpoor, M. Nahvi, 2025, 2025 7th International Conference on Pattern Recognition and Image Analysis (IPRIA))
- AnomalyNLP: Noisy-Label Prompt Learning for Few-Shot Industrial Anomaly Detection(L. Hua, Jin Qian, 2025, Electronics)
- A Training-Free Correlation-Weighted Model for Zero-/Few-Shot Industrial Anomaly Detection with Retrieval Augmentation(Wei Ran, Zefang Yu, Suncheng Xiang, Ting Liu, Yuzhuo Fu, 2025, ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- MGFD-CLIP: Multi-Granularity Feature Decoupling for Zero-Shot Industrial Anomaly Detection(Zichun Zhang, Jiehao Chen, 2025, 2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA))
- MCL-AD: Multimodal Collaboration Learning for Zero-Shot 3D Anomaly Detection(Gang Li, Tianjiao Chen, Mingle Zhou, Min Li, Delong Han, Jin Wan, 2025, ArXiv)
- MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning(Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj, Stefan Schmid, Steffen Staab, Claudia Plant, 2025, ArXiv)
- DHR-CLIP: Dynamic High-Resolution Object-Agnostic Prompt Learning for Zero-shot Anomaly Segmentation(Jiyul Ham, Jun-Geol Baek, 2025, 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC))
- WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation(Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, O. Dabeer, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Zero-Shot Industrial Anomaly Detection via CLIP-DINOv2 Multimodal Fusion and Stabilized Attention Pooling(Junjie Jiang, Zongxiang He, Anping Wan, Khalil Al-Bukhaiti, Kaiyang Wang, Peiyi Zhu, Xiaomin Cheng, 2025, Electronics)
- Zero-Shot Defect Detection With Anomaly Attribute Awareness via Textual Domain Bridge(Zhe Zhang, Shu Chen, Jian Huang, Jie Ma, 2025, IEEE Sensors Journal)
- Local Enhancement and Semantic Alignment for Zero-Shot Anomaly Detection(Xiaohong Qiu, Jing Huang, Jun Hu, Yangfen Wang, 2025, 2025 10th International Conference on Computer and Information Processing Technology (ISCIPT))
- Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection(Zhen Qu, Xian Tao, Xinyi Gong, Shichen Qu, Qiyu Chen, Zhengtao Zhang, Xingang Wang, Guiguang Ding, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Adapting OpenAI's CLIP Model for Few-Shot Image Inspection in Manufacturing Quality Control: An Expository Case Study with Multiple Application Examples(F. Megahed, Ying-Ju Chen, B. Colosimo, Marco Luigi, G. Grasso, L. A. Jones‐Farmer, Sven Knoth, Hongyue Sun, I. Zwetsloot, 2025, ArXiv)
- Zero-Shot Industrial Anomaly Segmentation with Image-Aware Prompt Generation(SoYoung Park, Hyewon Lee, Mingyu Choi, Seunghoon Han, Jong-Ryul Lee, Sungsu Lim, Tae-Ho Kim, 2025, No journal)
- DNPR: Zero-shot industrial anomaly detection via dynamic normal prototype refinement(Shuyun Li, Zhi Li, Weidong Wang, Long Zheng, Yu Lu, 2026, Expert Syst. Appl.)
- Accurate industrial anomaly detection with efficient multimodal fusion(Dinh-Cuong Hoang, Phan Xuan Tan, Anh-Nhat Nguyen, Ta Huu Anh Duong, Tuan-Minh Huynh, Duc-Manh Nguyen, Minh-Duc Cao, D. Ngo, Thu-Uyen Nguyen, Khanh-Toan Phan, Minh-Quang Do, Xuan-Tung Dinh, van-Hiep Duong, Ngoc-Anh Hoang, van-Thiep Nguyen, 2025, Array)
- InspectVLM: Unified in Theory, Unreliable in Practice(Conor Wallace, I. Corley, Jonathan Lwowski, 2025, 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW))
多模态大模型(MLLM)驱动的逻辑推理与可解释性检测
该组研究探索利用大语言模型(LLM)或多模态大模型(如GPT-4V, InternVL)进行端到端异常分析。通过引入思维链(CoT)、多智能体协作(Multi-agent)或检索增强生成(RAG),模型不仅能定位异常,还能提供逻辑解释和缺陷描述,处理复杂的逻辑异常。
- Towards Training-free Anomaly Detection with Vision and Language Foundation Models(Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection(Yuhao Chao, Jie Liu, Jie Tang, Gangshan Wu, 2025, ArXiv)
- OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning(Shifang Zhao, Yiheng Lin, Lu Han, Yao Zhao, Yunchao Wei, 2025, ArXiv)
- IADGPT: Unified LVLM for Few-Shot Industrial Anomaly Detection, Localization, and Reasoning via In-Context Learning(Mengyang Zhao, Teng Fu, Haiyang Yu, Ke Niu, Bin Li, 2025, ArXiv)
- The Amazon Nova Family of Models: Technical Report and Model Card(Amazon Agi, Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, A. Sethi, A. Komma, A. Pena, Achin Jain, Adam Kunysz, Adam Opyrchal, Adarshjit Singh, Aditya Rawal, Adok Achar Budihal Prasad, A. D. Gispert, Agni Kumar, Aishwarya Aryamane, A. Nair, M. Akilan, Akshaya Iyengar, A. Shanbhogue, A. He, Alessandra Cervone, A. Loeb, Alex L. Zhang, A. Fu, Alexander Lisnichenko, Alexander Zhipa, Alexandros Potamianos, Ali Kebarighotbi, A. Daronkolaei, Alok Parmesh, Amanjot Kaur Samra, Ameen Khan, A. Rez, Amir Saffari, Amit Agarwalla, Amit Jhindal, A. Mamidala, Ammar Asmro, A. Ballakur, Anand Mishra, A. Sridharan, Anastasiia Dubinina, A. Lenz, Andreas Doerr, Andrew Keating, Andrew Leaver, Andrew K Smith, A. Wirth, A. Davey, Andrew Rosenbaum, Andrew Sohn, A. Chan, Aniket Chakrabarti, Anil Ramakrishna, Anirban Roy, A. Iyer, Anjali Narayan-Chen, Ankith Yennu, Anna Dąbrowska, Anna Gawlowska, Anna Rumshisky, Anna Turek, Anoop Deoras, Anton Bezruchkin, A. Prasad, Anupam Dewan, A. Kiran, Apoorv Gupta, A.G. Galstyan, Aravind Manoharan, Arijit Biswas, Arindam Mandal, Arpit Gupta, Arsamkhan Pathan, A. Nagarajan, A. Rajasekaram, A. Sundararajan, Ashwin Ganesan, Ashwin Swaminathan, Athanasios Mouchtaris, Audrey Champeau, Avik Ray, Ayush Jaiswal, Ayushi Sharma, Bailey Keefer, Balamurugan Muthiah, Beatriz Leon-Millan, B. Koopman, Benny Li, Benjamin Biggs, Benjámin Ott, B. Vinzamuri, B. Venkatesh, Bhavana Ganesh, Bhoomit Vasani, Bill Byrne, Bill Hsu, Bincheng Wang, B. King, Blazej Gorny, Bo Feng, Bo Zheng, Bodhisattwa Paul, Bo Sun, Bofeng Luo, Bowen Chen, Bowen Xie, Bo Yu, Brendan Jugan, Brett Panosh, B. Collins, Brian Thompson, Can Karakus, Can Liu, Carl Lambrecht, Carly Lin, Carolyn Wang, C. Yuan, Casey Loyda, Cezary Walczak, Chalapathi Choppa, C. Prakash, Chankrisna Richy Meas, Charith Peris, Charles Recaido, Charlie Xu, Charul Sharma, Chase Kernan, C. Thanapirom, Chengwei Su, Chenhao Xu, Chenhao Yin, Chentao Ye, Chenyang Tao, Chethan Parameshwara, Ching-Yun Chang, Chong Li, Chris Hench, Chris Tran, Christophe Dupuy, Christopher Davis, Chris DiPersio, Christos Christodoulopoulos, Christy Li, Chun Chen, Claudio Delli Bovi, Clement Chung, Cole Hawkins, C. Harris, Corey Ropell, Cynthia He, DK Joo, Dae Yon Hwang, Dan Rosén, D. Elkind, Daniel Pressel, Daniel T. Zhang, D. Kimball, Daniil Sorokin, Dave Goodell, Davide Modolo, Dawei Zhu, D. Suresh, Deepti Ragha, D. Filimonov, Denis Foo Kune, Denis Romasanta Rodriguez, Devamanyu Hazarika, Dhananjay Ram, Dhawal Parkar, Dhawal Patel, D. Desai, D. Rajput, Disha Sule, D. Singh, Dmitriy Genzel, Dolly Goldenberg, Dongyi He, Dumitru Hanciu, Dushan Tharmal, Dzmitry Siankovich, Edi Cikovic, E. Abraham, Ekraam Sabir, E. Olson, Emmett Steven, Emre Barut, Eric Jackson, Ethan Wu, Evelyn Chen, Ezhilan Mahalingam, Fabian Triefenbach, Fan Yang, Fangyu Liu, Fan Wu, Faraz Tavakoli, Farhad Khozeimeh, Feiyang Niu, F. Hieber, Feng Li, Firat Elbey, F. Krebs, F. Saupe, Florian Sprunken, Frank Fan, F. Khan, Gabriela De Vincenzo, Gagandeep Kang, George Ding, G. He, G. Yeung, Ghada Qaddoumi, Giannis Karamanolakis, Goeric Huybrechts, Gokul Maddali, Gonzalo Iglesias, Gordon McShane, Gozde Sahin, Guangtai Huang, Gukyeong Kwon, Gunnar Sigurdsson, Gurpreet Chadha, Gururaj Kosuru, Hagen Fuerstenau, Hah Hah, H. Maideen, Hajime Hosokawa, Han Liu, Han-Kai Hsu, Han Wang, Hao Li, Hao Yang, Hao Zhu, Haozheng Fan, Harman M. Singh, H. Kaluvala, H. Saeed, He Xie, Helian Feng, Hendrix Luo, Hengzhi Pei, H. Nielsen, H. Ilati, Himanshu Patel, Hongshan Li, Hongzhou Lin, Hussain Raza, Ian Cullinan, I. Kiss, Inbarasan Thangamani, Indrayani Fadnavis, I. Sorodoc, Irem Ertuerk, Iryna Yemialyanava, I. Soni, Ismail Jelal, I. Tse, Jack G. M. Fitzgerald, Jack Zhao, Jackson Rothgeb, Jacky Lee, Jake Jung, Jakub Dębski, J. Tomczak, James Jeun, James R. Sanders, J. Crowley, Jay Lee, Jayakrishna Anvesh Paidy, J. Tiwari, J. Farmer, Jeff Solinsky, Jenna Lau, Jeremy Savareese, Jerzy Zagorski, Jiawei Dai, Jiachen Gu, Jiahui Li, Jian Zheng, Jianhua Lu, Jianhua Wang, Jiawei Dai, Jiawei Mo, Jiaxi Xu, Jie Liang, Jie Yang, J. Logan, Jimit Majmudar, Jing Liu, J. Miao, Jingru Yi, Jingyang Jin, Jiun-Yu Kao, Jixuan Wang, Jiyang Wang, J. Pemberton, Joel Carlson, J. Blundell, John Chin-Jew, John He, Jonathan Ho, Jonathan Hueser, Jonathan Lunt, Jooyoung Lee, Joshua Z. Tan, Joyjit Chatterjee, Judith Gaspers, Jue Wang, Jun Fang, Jun Tang, Jun Wan, Jun Wu, Junle Wang, Junyi Shi, Justin Chiu, Justin Satriano, Justin Yee, J. Dhamala, J. Bansal, Kai Zhen, Kai-Wei Chang, Kaixiang Lin, K. Raman, Kanthashree Mysore Sathyendra, Karabo Moroe, Karan Bhandarkar, Karan Kothari, Karolina Owczarzak, Karthick Gopalswamy, K. Ravi, Karthik Ramakrishnan, Karthika Arumugam, Kartik Mehta, Katarzyna Konczalska, Kavya Ravikumar, K. Tran, Ke Qin, Kelin Li, K. Li, Ketan Kulkarni, K. Rodrigues, K. Patel, Khadige Abboud, K. Hajebi, K. Reiter, K. Schultz, Krishna Anisetty, Krishna Kotnana, Kristen Li, Kruthi Channamallikarjuna, Krzysztof Jakubczyk, Kuba Pierewoj, Kunal Pal, K. Srivastav, Kyle Bannerman, Lahari Poddar, Lakshmi Prasad, L. Tseng, L. Naik, L. C. Vankadara, Lenon Minorics, Leo Liu, Leonard Lausen, Leonardo F. R. Ribeiro, Li Zhang, Lili Gehorsam, L. Qi, Lisa Bauer, Lori Knapp, Lu Zeng, L. Tong, Lulu Wong, Luoxin Chen, M. Rudnicki, Mahdi Namazifar, Mahesh Jaliminche, Maira Ladeira Tanke, Manas Gupta, Mandeep Ahlawat, M. Khanuja, Mani Sundaram, M. Leyk, M. Momotko, Markus Boese, Markus Dreyer, Markus Mueller, M. Fu, M. G'orski, Mateusz Mastalerczyk, Matias Mora, Matt Johnson, M. Scott, Matthew Wen, Max Barysau, Maya Boumerdassi, Maya Krishnan, Mayank Gupta, Maya Hirani, Mayank Kulkarni, Meganathan Narayanasamy, M. Bradford, Melanie Gens, Melissa P. Burke, Meng Jin, Miao Chen, Michael J. Denkowski, Michael Heymel, Michael Krestyaninov, Michal Obirek, Michalina Wichorowska, M. Miotk, Milosz Watroba, Mingyi Hong, Mingzhi Yu, Miranda Liu, Mohamed Gouda, Mohammad El-Shabani, Mohammad Ghavamzadeh, Mohit Bansal, Morteza Ziyadi, Nan Xia, Nathan Susanj, Nav Bhasin, N. Goswami, Nehal Belgamwar, Nicolas Anastassacos, N. Bergeron, Nidhi Jain, Nihal Jain, Niharika Chopparapu, N. Xu, N. Strom, Nikolaos Malandrakis, Nimisha Mishra, Ninad Parkhi, Ninareh Mehrabi, Nishita Sant, Nishtha Gupta, Nitesh Sekhar, Nithin Rajeev, Nithish Raja Chidambaram, N. Dhar, Noor Bhagwagar, Noy Konforty, Omar Babu, Omid Razavi, Orchid Majumder, O. Dar, O. Hsu, Pablo Kvitca, Pallavi Pandey, Parker Seegmiller, Patrick Lange, Paul J. Ferraro, Payal Motwani, P. Kharazmi, Peifeng Wang, Pengfei Liu, Peter Bradtke, Peter Gotz, Peter Zhou, Pichao Wang, Piotr Poskart, Pooja Sonawane, Pradeep Natarajan, Pradyun Ramadorai, Pralam Shah, Prasad M. Nirantar, Prasanthi Chavali, Prashan Wanigasekara, Prashant Saraf, Prashun Dey, P. Pant, P. Pradhan, Preya Patel, Priyanka Dadlani, Prudhvee Narasimha Sadha, Qi Dong, Qian Hu, Qiaozi Gao, Qing Liu, Quinn Lam, Quynh Do, R. Manmatha, Rachel Willis, Rafael Liu, Rafal Ellert, Rafal Kalinski, Rafi Al Attrach, Ragha Prasad, R. Prasad, Raguvir Kunani, Rahul Gupta, Rahul Sharma, 2025, ArXiv)
- Intern-S1: A Scientific Multimodal Foundation Model(Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kaiming Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqing Dong, Peijie Dong, Shi-Hua Dou, Si-na Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Han-Hwa Hu, Jucheng Hu, Mingxue Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xuantuo Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peijin Li, Pengze Li, Shuaibin Li, Tian-Xin Li, Wei Li, Yuqiang Li, Tianyi Liang, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hong-wei Liu, Jiangning Liu, Jiyao Liu, Jun'nan Liu, Kaiwen Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shi Yuan Liu, Shudong Liu, Shudong Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hong Lv, Huijie Lv, Qitan Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jian-Kai Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yi-dan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning (Raymond) Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jia-Ming Qiu, Xiaoye Qu, Yuanbin Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Shuaike Shen, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, K. Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Jiaqi Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Li-Yi Xiao, Shu-Qiao Xing, Chao Xu, Huihui Xu, Jun Xu, Rui Xu, Wanghan Xu, Ganlin Yang, Yuming Yang, Hao-nan Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Yu Zang, Bo Zhang, ChaoBin Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiao-xuan Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yun-Yi Zhou, Dongsheng Zhu, Lin Zhu, Yi Zou, 2025, ArXiv)
- Towards VLM-based Hybrid Explainable Prompt Enhancement for Zero-Shot Industrial Anomaly Detection(Weichao Cai, Weiliang Huang, Yunkang Cao, Chao Huang, Fei Yuan, Bob Zhang, Jie Wen, 2025, No journal)
- Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models(Jiacong Xu, Shao-Yuan Lo, Bardia Safaei, Vishal M. Patel, Isht Dwivedi, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models(Sassan Mokhtar, Arian Mousakhan, Silvio Galesso, Jawad Tayyub, Thomas Brox, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO(Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu, Weiqiang Wang, 2025, ArXiv)
- LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction(Er Jin, Qihui Feng, Yongli Mou, Stefan Decker, G. Lakemeyer, Oliver Simons, Johannes Stegmaier, 2025, ArXiv)
- PB-IAD: Utilizing multimodal foundation models for semantic industrial anomaly detection in dynamic manufacturing environments(Bernd Hofmann, Albert Scheck, Joerg Franke, Patrick Bruendl, 2025, ArXiv)
- AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection(Junwen Miao, Penghui Du, Yi Liu, Yu Wang, Yan Wang, 2025, ArXiv)
- IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection(Zewen Li, Zitong Yu, Qilang Ye, Weicheng Xie, Wei Zhuo, Linlin Shen, 2025, IEEE Transactions on Instrumentation and Measurement)
- Can Multimodal Large Language Models be Guided to Improve Industrial Anomaly Detection?(Zhiling Chen, Hanning Chen, Mohsen Imani, Farhad Imani, 2025, ArXiv)
- Think-to-Detect: Rationale-Driven Vision–Language Anomaly Detection(Mahmoud Abdalla, M. Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi, Abdelrahman Abdallah, Hyun-Soo Kang, 2025, Mathematics)
- LR-IAD: Mask-Free Industrial Anomaly Detection with Logical Reasoning(Peijian Zeng, Feiyan Pang, Zhanbo Wang, Aimin Yang, 2025, 2025 IEEE International Conference on Data Mining (ICDM))
- Zero-Shot Anomaly Detection in Laser Powder Bed Fusion Using Multimodal RAG and Large Language Models(Kiarash Naghavi Khanghah, Zhiling Chen, Lela Romeo, Qian Yang, R. Malhotra, Farhad Imani, Hongyi Xu, 2025, Journal of Mechanical Design)
- MALM-CLIP: A generative multi-agent framework for multimodal fusion in few-shot industrial anomaly detection(Hanzhi Chen, Jingbin Que, Kexin Zhu, Zhide Chen, F. Zhu, Wencheng Yang, Xu Yang, Xuechao Yang, 2025, Inf. Fusion)
- ID-RAG: industrial defect retrieval-augmented generation for industrial surface defect detection(Mingyu Lee, Jongwon Choi, 2026, Machine Vision and Applications)
前沿架构探索:Mamba、扩散模型与高效微调
这些研究引入了如状态空间模型(Mamba)以提升长序列处理效率,或利用扩散模型(Diffusion Models)的生成能力捕获复杂语义。同时涵盖了针对工业基础模型的高效微调(PEFT)和跨领域自适应方法。
- HFMM-Net: A Hybrid Fusion Mamba Network for Efficient Multimodal Industrial Defect Detection(Guo Zhao, Liang Tan, Musong He, Qi Wu, 2025, Inf.)
- DZAD: Diffusion-based Zero-shot Anomaly Detection(Tianrui Zhang, Liang Gao, Xinyu Li, Yiping Gao, 2025, No journal)
- Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset(TsaiChing Ni, Zhen-Qi Chen, YuanFu Yang, 2025, ArXiv)
- Zoom-Anomaly: Multimodal vision-Language fusion industrial anomaly detection with synthetic data(Jiaqi Li, Shuhuan Wen, Hamid Reza Karimi, 2026, Inf. Fusion)
- LScAD: A Large–Small Model Collaboration Framework for Unsupervised Industrial Anomaly Detection(Shichen Qu, Xian Tao, Xinyi Gong, Zhen Qu, Mukesh Prasad, Fei Shen, Zhengtao Zhang, Guiguang Ding, 2025, IEEE Transactions on Instrumentation and Measurement)
- Parameter-efficient Tuning of Large-scale Multimodal Foundation Model(Haixin Wang, Xinlong Yang, Jianlong Chang, Di Jin, Jinan Sun, Shikun Zhang, Xiao Luo, Qi Tian, 2023, Advances in Neural Information Processing Systems 36)
- Source-Free Domain Adaptation with Frozen Multimodal Foundation Model(Song Tang, Wenxin Su, Mao Ye, Xiatian Zhu, 2023, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Industrial Foundation Model(Lei Ren, Haiteng Wang, Jiabao Dong, Zidi Jia, Shixiang Li, Yuqing Wang, Y. Laili, Di-Wei Huang, Lin Zhang, Bohu Li, 2025, IEEE Transactions on Cybernetics)
工业实战应用:数字孪生、具身智能与鲁棒性提升
关注实际部署挑战,包括背景干扰消除、模态缺失的鲁棒性处理,以及集成数字孪生、AR、机器人平台的自动化检测系统。同时包含针对电网、PCBA、光伏等特定行业的定制化方案与数据集构建。
- Industrial Anomaly Detection Under Background Clutter: A Foreground Extraction Study with RGB and 3D Data(GiBeom Kim, Hyejin Kim, 2025, 2025 16th International Conference on Information and Communication Technology Convergence (ICTC))
- Modality-Resilient Multimodal Industrial Anomaly Detection via Cross-Modal Knowledge Transfer and Dynamic Edge-Preserving Voxelization(Jiahui Xu, Jian Yuan, Mingrui Yang, Weishu Yan, 2025, Sensors (Basel, Switzerland))
- Enhanced Crack Segmentation Using Meta’s Segment Anything Model with Low-Cost Ground Truths and Multimodal Prompts(T. Muturi, Y. Adu-Gyamfi, 2025, Transportation Research Record)
- Real-time robotic teleoperation for pavement pothole segmentation, quantification, and localization using multimodal sensing and efficient multi-scale attention-enhanced edge deep learning(Xi Hu, Rayan H. Assaad, 2026, Automation in Construction)
- Three-dimensional inspection method for striped steel stockpiles(Kunpeng Wang, Lin Xu, 2025, Proceedings of the 2025 2nd International Conference on Modeling, Natural Language Processing and Machine Learning)
- UniPCB: A Unified Vision-Language Benchmark for Open-Ended PCB Quality Inspection(Fuxiang Sun, Xi Jiang, Jiansheng Wu, Haigang Zhang, Feng Zheng, Jinfeng Yang, 2026, ArXiv)
- A Method for 3D Printing Defect Detection Based on Multimodal Large Language Models(Bin Li, Yuzhong Cao, Runqi Chen, Yanzhu Chen, Yulin Ma, Haotian Cui, 2025, 2025 3rd International Conference on Intelligent Perception and Computer Vision (CIPCV))
- Bayesian network-based multimodal large model optimization of speech text and its fault prediction capability in power industry(Haitao Yu, Xuqiang Wang, ✉. J. Zheng, Tianyi Liu, Yongdi Bao, 2025, Journal of Combinatorial Mathematics and Combinatorial Computing)
- Adaptive Digital Twin Systems with AR Interaction for Resilient and Sustainable Industrial Operations(G. Gayathri, G. Fathima, Professor Head, 2025, 2025 6th International Conference on Electronics and Sustainable Communication Systems (ICESC))
- Hybrid Rule-Based Classification and Defect Detection System Using Insert Steel Multi-3D Matching(Soon-Woo Kwon, H. Park, Seungmin Baek, Min Young Kim, 2025, Electronics)
- A Streamlined System for Multimodal Industrial Anomaly Detection via 2D and 3D Feature Fusion(Wenbing Zhu, Mingmin Chi, Bo Peng, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Multimodal Segmentation for Photovoltaic Module Defect Detection(Xinyi He, Jianjun Tan, Tao Hu, Li Zhu, 2025, IEEE Access)
- PCAD: A Real-World Dataset for 6D Pose Industrial Anomaly Detection(Robert F. Maack, Lars Thun, Thomas Liang, Hasan Tercan, Tobias Meisen, 2025, 2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW))
- Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection(Wenbing Zhu, Lidong Wang, Ziqing Zhou, Chengjie Wang, Yurui Pan, Ruoyi Zhang, Zhuhao Chen, Linjie Cheng, Bin-Bin Gao, Jiangning Zhang, Zhenye Gan, Yuxie Wang, Yulong Chen, Shuguang Qian, Mingmin Chi, Bo Peng, Lizhuang Ma, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- A Comprehensive Survey for Real-World Industrial Defect Detection: Challenges, Approaches, and Prospects(Yuqi Cheng, Yunkang Cao, Haiming Yao, Wei Luo, Cheng Jiang, Hui Zhang, Weiming Shen, 2025, ArXiv)
- Digital Twins for Defect Detection in FDM 3D Printing Process(Chao Xu, Shengbin Lu, Yulin Zhang, Lu Zhang, Zhengyi Song, Huili Liu, Qingping Liu, Luquan Ren, 2025, Machines)
- Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin Simulation(Jose Moises Araya-Martinez, Gautham Mohan, Kenichi Hayakawa Bolanos, Roberto Mendieta, Sarvenaz Sardari, Jens Lambrecht, Jörg Krüger, 2025, ArXiv)
- AN INDUSTRIAL-GRADE ROBOTIC PLATFORM FOR PCBA OPTICAL INSPECTION INTEGRATING CONVOLUTIONAL NEURAL NETWORKS AND PHOTOGRAMMETRY(Julio Hiago de Souza, Ilmar Duarte dos Reis, 2025, Revista ft)
- Embodied Intelligence Toward Future Smart Manufacturing in the Era of AI Foundation Model(Lei Ren, Jiabao Dong, Shuai Liu, Lin Zhang, Lihui Wang, 2025, IEEE/ASME Transactions on Mechatronics)
- Remote Human-Robot Interaction in Industrial Inspection System Based on Vision-Language Models(X. Lan, Litao Zhang, Ping Huang, Haojie Huang, Zhezhuang Xu, 2025, 2025 40th Youth Academic Annual Conference of Chinese Association of Automation (YAC))
- Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory(Yuxuan Lin, Hanjing Yan, Xuan Tong, Yang Chang, Huanzhen Wang, Ziheng Zhou, Shuyong Gao, Yan Wang, Wenqiang Zhang, 2025, ArXiv)
- Concurrent historical data clustering and common feature learning for new-mode zero-shot industrial anomaly detection(Kai Wang, Xinlong Yuan, Xun Lang, Xiaofeng Yuan, Jie Han, Yalin Wang, 2026, Eng. Appl. Artif. Intell.)
- Advances in Electrical Grid Assets Inspection: Exploring Multimodal Large Language Models(P. Rocha, Fernando Lopes, Luís A. da Silva Cruz, 2025, 2025 25th International Conference on Digital Signal Processing (DSP))
多模态工业异常检测正经历从“感知融合”到“认知推理”的范式转移。研究重点已从单纯的RGB-D特征重建,转向利用VLM/MLLM实现零样本泛化与可解释性逻辑分析。新型架构如Mamba和扩散模型的引入进一步提升了检测效率与生成质量,而数字孪生与具身智能的集成则标志着该技术正加速向自动化产线的实战部署跨越。
总计105篇相关文献
Constructing comprehensive multimodal feature representations from RGB images (RGB) and point clouds (PT) in 2D–3D multimodal anomaly detection (MAD) methods is very important to reveal various types of industrial anomalies. For multimodal representations, most of the existing MAD methods often consider the explicit spatial correspondence between the modality-specific features extracted from RGB and PT through space-aligned fusion, while overlook the implicit interaction relationships between them. In this study, we propose a uni-modal and cross-modal fusion (UCF) method, which comprehensively incorporates the implicit relationships within and between modalities in multimodal representations. Specifically, UCF first establishes uni-modal and cross-modal embeddings to capture intramodal and intermodal relationships through uni-modal reconstruction and cross-modal mapping. Then, an adaptive nonequal fusion method is proposed to develop fusion embeddings, with the aim of preserving the primary features and reducing interference of the uni-modal and cross-modal embeddings. Finally, uni-modal, cross-modal, and fusion embeddings are all collaborated to reveal anomalies existing in different modalities. Experiments conducted on the MVTec 3D-AD benchmark and the real-world surface mount inspection demonstrate that the proposed UCF outperforms existing approaches, particularly in precise anomaly localization.
Accurate detection and precise localization of anomalies during precision component manufacturing are essential to maintaining high product quality. Multimodal industrial anomaly detection (MIAD) harnesses data from diverse sensors to effectively identify and pinpoint defects in industrial products. Recent MIAD approaches have made significant progress but often ignore point cloud data global contextual semantics and modality-specific information, resulting in an incomplete representation of point cloud and inadequate multimodal fusion. To confront these issues head-on, we propose a robust feature representation and comprehensive multimodal feature fusion network [views-graph and latent feature disentangled fusion network (VLDFNet)] for anomaly detection in industrial high-precision components. VLDFNet mainly consists of a point cloud views-graph representation model and a multimodal disentangled feature latent space fusion module. Specifically, the point cloud views-graph representation model explores spatial locations and semantic relationships between views using multilevel graph fusion. The multimodal disentangled feature latent space fusion module disentangles multimodal features into shared and specific representations to mitigate the omission of modality-specific information. VLDFNet introduces a cross-modal shared feature interaction (CSFI) strategy to extract coherent semantic information by aligning and integrating cross-modal features. Comprehensive experimental results on multiple datasets demonstrate that our method significantly outperforms existing approaches in detection accuracy.
No abstract available
No abstract available
We demonstrate an end-to-end system for real-time, multimodal industrial anomaly detection (IAD), built upon a custom hardware platform for synchronized 2D and 3D data acquisition. Our core contribution is a novel cross-modal residual mechanism that identifies defects by quantifying predictive errors between visual and geometric feature spaces. Instead of traditional concatenation, our dual-stream architecture mutually predicts features across modalities, leveraging the prediction residual's magnitude as a direct and robust anomaly indicator. The entire system achieves sub-second inference from acquisition to decision, enabled by efficient depth map analysis that circumvents the complexity of direct point cloud processing, offering a deployable solution for high-speed inspection.
Multimodal industrial anomaly detection (IAD), which integrates RGB and 3D information, has become one of the key technical directions for improving detection robustness and accuracy.Although prevailing cross-modal feature-mapping methods are efficient and lightweight, they still suffer from two major limitations. First, they typically adopt a one-way modeling paradigm that regresses one modality from another and lack explicit interaction within a unified representation space, making it difficult to detect local, small-magnitude anomalies that appear only in a single modality.Second, fusion-reconstruction methods derived from this paradigm rely on a single fusion stream optimized with a reconstruction loss. When trained solely on normal samples, this design can overgeneralize and lacks a parallel branch to enforce consistency constraints on the fused representations, which in turn limits reliable discrimination between normal and anomalous patterns in complex multimodal scenarios. To address these issues, we propose FMFR, a feature-level multistage fusion and remapping framework that jointly models multistage feature fusion and cross-modal remapping. The framework consists of a fusion-reconstruction branch and a remapping-fusion branch, which are jointly constrained by a multi-order consistency loss. In the fusion-reconstruction branch, a reconstruction loss supervises the intermediate fusion layers, encouraging them to learn joint representations that retain complete information and to reconstruct features without losing critical details. In the remapping-fusion branch, the network learns bidirectional mappings between modalities and re-fuses the remapped features, while the multi-order consistency loss is used to align its fused representations with those of the fusion-reconstruction branch. During inference, FMFR jointly leverages intra-modal reconstruction residuals, cross-modal remapping residuals, and the consistency deviation between the fused embeddings of the two branches to construct multi-source anomaly maps. This design forces anomalies to simultaneously violate both intra-modal and cross-modal priors, thereby suppressing the overgeneralization of a single fusion stream and enhancing the visibility of local anomaly structures that exist only in a single modality as well as the overall robustness of anomaly detection. Experimental results on the MVTec 3D-AD dataset demonstrate that FMFR achieves competitive state-of-the-art performance on both anomaly detection and anomaly segmentation tasks.
The purpose of multimodal industrial anomaly detection is to detect complex geometric shape defects such as subtle surface deformations and irregular contours that are difficult to detect in 2D-based methods. However, current multimodal industrial anomaly detection lacks the effective use of crucial geometric information like surface normal vectors and 3D shape topology, resulting in low detection accuracy. In this paper, we propose a novel Geometric Prior-based Anomaly Detection network (GPAD). Firstly, we propose a point cloud expert model to perform fine-grained geometric feature extraction, employing differential normal vector computation to enhance the geometric details of the extracted features and generate geometric prior. Secondly, we propose a two-stage fusion strategy to efficiently leverage the complementarity of multimodal data as well as the geometric prior inherent in 3D points. We further propose attention fusion and anomaly regions segmentation based on geometric prior, which enhance the model’s ability to perceive geometric defects. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the State-of-the-art (SOTA) methods in detection accuracy on both MVTec-3D AD and Eyecandies datasets.
While current multimodal anomaly detection methods predominantly employ intermediate fusion strategies, they often suffer from inadequate cross-modal interaction and irreversible information loss during feature alignment processes. To overcome these limitations, we propose Hierarchical Geometry-Color Fusion (HGCF), a novel framework that establishes deep synergistic relationships between RGB texture features and point cloud geometric representations. Firstly, we propose a bidirectional cross-modal early fusion mechanism that enables complementary information exchange between point cloud and RGB modalities at the input level. Secondly, we introduce a local self-supervised geometric color reconstruction network with group-wise feature alignment, enhancing fine-grained feature extraction through joint color-geometry reconstruction tasks. Finally, we propose a local window spatial-consistent attention fusion, which achieves semantic consistency and spatial consistency by emphasizing local mutation features to improve the detection of subtle anomalies. Extensive experiments show our model achieves 99.1% I-AUROC on MVTec 3D-AD and 91.7% on Eyecandies, both surpassing state-of-the-art methods.
Achieving high-precision anomaly detection with incomplete sensor data is a critical challenge in industrial automation and intelligent manufacturing. This incompleteness often results from sensor failures, environmental interference, occlusions, or acquisition cost constraints. This study explicitly targets both types of incompleteness commonly encountered in industrial multimodal inspection: (i) incomplete sensor data within a given modality, such as partial point cloud loss or image degradation, and (ii) incomplete modalities, where one sensing channel (RGB or 3D) is entirely unavailable. By jointly addressing intra-modal incompleteness and cross-modal absence within a unified cross-distillation framework, our approach enhances anomaly detection robustness under both conditions. First, a teacher–student cross-modal distillation mechanism enables robust feature learning from both RGB and 3D modalities, allowing the student network to accurately detect anomalies even when a modality is missing during inference. Second, a dynamic voxel resolution adjustment with edge-retention strategy alleviates the computational burden of 3D point cloud processing while preserving crucial geometric features. By jointly enhancing robustness to missing modalities and improving computational efficiency, our method offers a resilient and practical solution for anomaly detection in real-world manufacturing scenarios. Extensive experiments demonstrate that the proposed method achieves both high robustness and efficiency across multiple industrial scenarios, establishing new state-of-the-art performance that surpasses existing approaches in both accuracy and speed. This method provides a robust solution for high-precision perception under complex detection conditions, significantly enhancing the feasibility of deploying anomaly detection systems in real industrial environments.
Anomaly detection is a key technology in quality control for automated production lines. Currently, 2D-based anomaly detection methods fail to identify geometric structure anomalies in products. To address this limitation, this paper proposes a multimodal anomaly detection model using 3D point clouds and RGB images. To ensure the single-domain inference capability of each modality, we design an attention-enhanced dual memory bank to separately store local point cloud features and RGB features. The attention mechanism enhances the informativeness and discriminability of the feature descriptors, significantly improving the data quality in the memory bank. During the inference phase, the local point cloud features in the dual memory bank guide the RGB features in calculating anomaly scores in the 2D modality. This memory-guided approach strengthens the correlation between information across different modalities. Moreover, to improve the overall segmentation precision of the model, we propose an anomaly scoring scheme based on a weight map of signed distance values. The final anomaly detection results are obtained by integrating the advantages of point cloud data in geometric structure anomaly detection and RGB data in color anomaly detection. Extensive experiments demonstrate that the proposed method achieves superior segmentation precision compared to other advanced methods on the MVTec 3D-AD and Eyecandies datasets.
No abstract available
Industrial environments demand accurate detection of anomalies to maintain product quality and ensure operational safety. Traditional industrial anomaly detection (IAD) methods often lack the flexibility and adaptability needed in dynamic production settings, where new defect types and operational changes continually emerge. Recent advancements in multimodal large language models (MLLMs) have shown promise by combining visual and textual processing capabilities, yet they are often limited by their lack of domain-specific expertise, particularly regarding industry-standard defect tolerances. To overcome limitations, we introduce Echo, a novel multi-expert framework designed to enhance MLLM performance for IAD. Echo integrates four specialized modules: the Reference Extractor retrieves similar normal images to establish contextual baselines; the Knowledge Guide provides critical, industry-specific insights; the Reasoning Expert enables structured, stepwise analysis for complex queries; and the Decision Maker synthesizes information from the preceding modules to deliver precise, context-aware responses. Evaluations on the MMAD benchmark reveal that Echo significantly improves adaptability, precision, and robustness compared to conventional approaches. Our results demonstrate that guided MLLMs, when augmented with expert modules, can effectively bridge the gap between general visual understanding and the specialized requirements of industrial anomaly detection, paving the way for more reliable and interpretable inspection systems.
Industrial anomaly detection for 2D objects has gained significant attention and achieved progress in anomaly detection (AD) methods. However, identifying 3D depth anomalies using only 2D information is insufficient. Despite explicitly fusing depth information into RGB images or using point cloud backbone networks to extract depth features, both approaches struggle to adequately represent 3D information in multimodal scenarios due to the disparities among different modal information. Additionally, due to the scarcity of abnormal samples in industrial data, especially in multimodal scenarios, it is necessary to perform anomaly generation to simulate real-world abnormal samples. Therefore, we propose a novel unified multimodal anomaly detection framework to address these issues. Our contributions consist of 3 key aspects. (1) We extract visible depth information from 3D point cloud data simply and use 2D RGB images to represent appearance, which disentangles depth and appearance to support unified anomaly generation. (2) Benefiting from the flexible input representation, the proposed Multi-Scale Gaussian Anomaly Generator and Unified Texture Anomaly Generator can generate richer anomalies in RGB and depth. (3) All modules share parameters for both RGB and depth data, effectively bridging 2D and 3D anomaly detection. Subsequent modules can directly leverage features from both modalities without complex fusion. Experiments show our method outperforms state-of-the-art (SOTA) on MVTec-3D AD and Eyecandies datasets. Code available at: https://github.com/Xantastic/BridgeNet
The detection of anomalies in manufacturing processes is crucial to ensure product quality and identify process deviations. Statistical and data-driven approaches remain the standard in industrial anomaly detection, yet their adaptability and usability are constrained by the dependence on extensive annotated datasets and limited flexibility under dynamic production conditions. Recent advances in the perception capabilities of foundation models provide promising opportunities for their adaptation to this downstream task. This paper presents PB-IAD (Prompt-based Industrial Anomaly Detection), a novel framework that leverages the multimodal and reasoning capabilities of foundation models for industrial anomaly detection. Specifically, PB-IAD addresses three key requirements of dynamic production environments: data sparsity, agile adaptability, and domain user centricity. In addition to the anomaly detection, the framework includes a prompt template that is specifically designed for iteratively implementing domain-specific process knowledge, as well as a pre-processing module that translates domain user inputs into effective system prompts. This user-centric design allows domain experts to customise the system flexibly without requiring data science expertise. The proposed framework is evaluated by utilizing GPT-4.1 across three distinct manufacturing scenarios, two data modalities, and an ablation study to systematically assess the contribution of semantic instructions. Furthermore, PB-IAD is benchmarked to state-of-the-art methods for anomaly detection such as PatchCore. The results demonstrate superior performance, particularly in data-sparse scenarios and low-shot settings, achieved solely through semantic instructions.
The robust causal capability of multimodal large language models (MLLMs) holds the potential of detecting defective objects in industrial anomaly detection (IAD). However, most traditional IAD methods lack the ability to provide multiturn human–machine dialogs and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pretrained models have not fully stimulated the ability of large models in anomaly detection tasks. In this article, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ abnormal prompt generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pretrained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose text-guided enhancer (TGE), wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on the specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a multimask fusion (MMF) module to incorporate mask as expert knowledge, which enhances the LLM’s perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at https://github.com/LiZeWen1225/IAD-GPT
Industrial visual inspection demands high-precision anomaly detection amid scarce annotations and unseen defects. This paper introduces a zero-shot framework leveraging multimodal feature fusion and stabilized attention pooling. CLIP’s global semantic embeddings are hierarchically aligned with DINOv2’s multi-scale structural features via a Dual-Modality Attention (DMA) mechanism, enabling effective cross-modal knowledge transfer for capturing macro- and micro-anomalies. A Stabilized Attention-based Pooling (SAP) module adaptively aggregates discriminative representations using self-generated anomaly heatmaps, enhancing localization accuracy and mitigating feature dilution. Trained solely in auxiliary datasets with multi-task segmentation and contrastive losses, the approach requires no target-domain samples. Extensive evaluation across seven benchmarks (MVTec AD, VisA, BTAD, MPDD, KSDD, DAGM, DTD-Synthetic) demonstrates state-of-the-art performance, achieving 93.4% image-level AUROC, 94.3% AP, 96.9% pixel-level AUROC, and 92.4% AUPRO on average. Ablation studies confirm the efficacy of DMA and SAP, while qualitative results highlight superior boundary precision and noise suppression. The framework offers a scalable, annotation-efficient solution for real-world industrial anomaly detection.
Multimodal Anomaly Detection (MMAD) has attracted significant attention in industrial defect inspection as it can simultaneously leverage the complementary information from different modalities to achieve higher-precision detection. Among existing MMAD approaches, dual-branch reverse distillation is widely adopted because of its efficiency in avoiding large-scale data storage. However, it suffers from two key issues. First, the alignment of cross-modal features can lead to a loss of modality-specific characteristics. Second, when one modality indicates normal while another shows anomalies, anomaly detection may be misled by that modality ambiguity. To address these challenges, we propose a Frequency-Aware Multimodal Reverse Distillation (FAMRD) framework from the frequency domain perspective. Specifically, we introduce a frequency spectral feature alignment module that aligns the low- and medium-frequency components across modalities to preserve global shape consistency, while maintaining high-frequency modality-specific details. In addition, we design a frequency spectral anomaly synthesis module. It perturbs the normal feature of one modality to create modality consistent anomalies, fuses it with another modality normal feature to mimic modality ambiguous anomalies, and adds them to the reverse distillation process for decision boundary optimization. Extensive experiments on standard MMAD benchmarks demonstrate that FAMRD achieves competitive performance in both anomaly detection and localization, outperforming state-of-the-art methods.
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability, to revolutionize IAD. By integrating MLLM with Group Relative Policy Optimization (GRPO), enhanced by our novel Reasoned Outcome Alignment Metric (ROAM), AnomalyR1 achieves a fully end-to-end solution that autonomously processes inputs of image and domain knowledge, reasons through analysis, and generates precise anomaly localizations and masks. Based on the latest multimodal IAD benchmark, our compact 3-billion-parameter model outperforms existing methods, establishing state-of-the-art results. As MLLM capabilities continue to advance, this study is the first to deliver an end-to-end VLM-based IAD solution that demonstrates the transformative potential of ROAM-enhanced GRPO, positioning our framework as a forward-looking cornerstone for next-generation intelligent anomaly detection systems in industrial applications with limited defective data.
No abstract available
Industrial Anomaly Detection (IAD) is critical for ensuring product quality by identifying defects. Traditional methods such as feature embedding and reconstruction-based approaches require large datasets and struggle with scalability. Existing vision-language models (VLMs) and Multimodal Large Language Models (MLLMs) address some limitations but rely on mask annotations, leading to high implementation costs and false positives. Additionally, industrial datasets like MVTec-AD and VisA suffer from severe class imbalance, with defect samples constituting only 23.8 % and 11.1 % of total data respectively. To address these challenges, we propose a reward function that dynamically prioritizes rare defect patterns during training to handle class imbalance. We also introduce a mask-free reasoning framework using Chain of Thought (CoT) and Group Relative Policy Optimization (GRPO) mechanisms, enabling anomaly detection directly from raw images without annotated masks. This approach generates interpretable step-by-step explanations for defect localization. Our method achieves state-of-the-art performance, outperforming prior approaches by 36% in accuracy on MVTec-AD and 16% on VisA. By eliminating mask dependency and reducing costs while providing explainable outputs, this work advances industrial anomaly detection and supports scalable quality control in manufacturing.
The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with real industrial environments due to limitations in scale and resolution. To address these challenges, we introduce Real-IAD D3, a high-precision multimodal dataset that uniquely incorporates an additional pseudo-3D modality generated through photometric stereo, alongside high-resolution RGB images and micrometer-level 3D point clouds. Real-IAD D3 features finer defects, diverse anomalies, and greater scale across 20 categories, providing a challenging benchmark for multimodal IAD Additionally, we introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality, enhancing detection performance. Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance. The dataset and code are publicly accessible for research purposes at https://realiad4ad.github.io/Real-IAD_D3.
Industrial anomaly detection involves identifying abnormal regions in products and plays a crucial role in quality inspection. While 2D image-based anomaly detection has been extensively explored, combining two-dimensional (2D) images with three-dimensional (3D) point clouds remains less studied. Existing multimodal methods often combine features from different modalities, leading to feature interference and degraded performance. To overcome this, we propose a novel framework for unsupervised industrial anomaly detection that leverages both visual and geometric information. Specifically, we use pre-trained 2D and 3D models to extract visual features from color images and geometric features from 3D point clouds. Instead of directly fusing these features, we propose a geometric feature reconstruction network that predicts 3D geometric features from the 2D visual features. During training, we minimize the difference between the predicted geometric features and the extracted geometric features, enabling the model to learn how 2D appearance correlates with 3D structure in anomaly-free images. During inference, this learned relationship allows the model to detect anomalies: significant discrepancies between the reconstructed and actual geometric features indicate abnormal regions. Evaluated on the MVTec 3D-AD dataset, our method achieves state-of-the-art performance with an average image-level AUROC score of 0.968, surpassing previous approaches. Additionally, it provides fast inference at 8.2 frames per second with a memory footprint of only 1045 MB, making it highly efficient for industrial applications.
Industrial anomaly detection (IAD) plays a crucial role in maintaining the safety and reliability of manufacturing systems. While multimodal large language models (MLLMs) show strong vision-language reasoning abilities, their effectiveness in IAD remains limited without domain-specific adaptation. In this work, we propose EMIT, a unified framework that enhances MLLMs for IAD via difficulty-aware group relative policy optimization (GRPO). EMIT constructs a multi-task IAD dataset and utilizes GPT-generated object text descriptions to compensate for missing defective images. For few-shot anomaly detection, it integrates a soft prompt and heatmap-guided contrastive embeddings derived from patch-level comparisons. To better handle difficult data samples, i.e., cases where the MLLM struggles to generate correct answers, we propose a difficulty-aware GRPO that extends the original GRPO by incorporating a response resampling strategy to ensure the inclusion of correct answers in the sampled responses, as well as an advantage reweighting mechanism to strengthen learning from such difficult data samples. Extensive experiments on the MMAD benchmark demonstrate that EMIT significantly enhances the IAD performance of MLLMs, achieving an average improvement of 7.77\% over the base model (InternVL3-8B) across seven tasks.
Currently, in the field of anomaly detection, most existing methods rely on small models that focus on specific industrial scenarios which exhibit strong task orientation but lack generalization capabilities. Segment anything model (SAM), a vision foundation model designed for semantic segmentation tasks, has demonstrated remarkable performance in the natural scene segmentation. However, SAM lacks domain-specific knowledge of industrial defects and relies on an interactive inference framework, which restricts its application in industrial anomaly detection. In this article, we propose a novel framework: A large–small model collaboration framework for unsupervised industrial anomaly detection (LScAD), aiming to use task-oriented small models to guide SAM for precise anomaly localization. Specifically, the small model generates the initial guidance information, which serves as the input for a multimodal prompt module. This module consists of two modalities: image and text prompt. These prompts are then used to guide SAM for accurate anomaly segmentation. Moreover, we design a dual-branch adapter to enhance SAM’s domain-specific capability through a color-domain branch and a frequency-domain branch, aiming to improve its performance in anomaly detection tasks. Extensive experiments on the MVTec AD benchmark and other real-world industrial datasets demonstrate that our method achieves state-of-the-art performance, with an image-level area under the receiver operating characteristic curve (AUROC) of 99.6%, pixel-level AUROC of 98.4%, and average precision (AP) of 74.1% on the MVTec AD dataset. Our proposed method can also effortlessly adapt to multiclass anomaly detection without any modifications and achieve remarkable performance. Our code is available at: https://github.com/qsc1103/LScAD
In the context of Industrial Anomaly Detection (IAD), ensuring the quality of manufactured products is critical. Traditional 2D based methods often fail to capture anomalies present in complex 3D shapes. For effective anomaly detection in 3D shapes, it is essential to incorporate global semantic context, local geometric structure, and color information of the object. To fully leverage these features, we propose a network named 2M3DF, that leverages knowledge from multi-view RGB images and corresponding point cloud information for enhanced anomaly detection performance. Our model initially employs pre-trained feature extractors that generate local features from multi-view RGB images and corresponding point clouds. The novel inter-modality feature representation and fusion module first adapts these inter-modality features and then effectively aligns and aggregates these multimodality features on a pixel-to-point basis. To learn the normality from point-wise fused multimodal features, we fit a multivariate Gaussian distribution to model the normal feature distribution. Comprehensive experimental evaluations using the MVTec3D-AD and Eyecandies dataset validate the effectiveness of our propose model and demonstrate significant improvements in comparison to existing state-of-the-art methods. Our model achieves a 96.6% mean I-AUROC while delivering real-time results.
With the increasing demand for higher precision and real-time performance in industrial surface defect detection, multimodal detection methods integrating RGB images and 3D point clouds have drawn considerable attention. However, current mainstream methods typically employ computationally expensive Transformer-based models for capturing global features, resulting in significant inference delays that hinder their practical deployment for online inspection tasks. Furthermore, existing approaches exhibit limited capability in deep cross-modal interactions, negatively impacting defect detection and segmentation accuracy. In this paper, we propose a novel multimodal anomaly detection framework based on a bidirectional Mamba network to enhance cross-modal feature interaction and fusion. Specifically, we introduce an anomaly-aware parallel feature extraction network, leveraging a hybrid scanning state space model (SSM) to efficiently capture global and long-range dependencies with linear computational complexity. Additionally, we develop a cross-enhanced feature fusion module to facilitate dynamic interaction and adaptive fusion of multimodal features at multiple scales. Extensive experiments conducted on two publicly available benchmark datasets, MVTec 3D-AD and Eyecandies, demonstrate that the proposed method consistently outperforms existing approaches in both defect detection and segmentation tasks.
3D shape defect detection plays an important role in autonomous industrial inspection. However, accurate detection of anomalies remains challenging due to the complexity of multimodal sensor data, especially when both color and structural information are required. In this work, we propose a lightweight inter-modality feature prediction framework that effectively utilizes multimodal fused features from the inputs of RGB, depth and point clouds for efficient 3D shape defect detection. Our proposed framework consists of three main key components: 1) Modality-specific pre-trained feature extractor networks, 2) Multi-level Adaptive Dual-Modal Gated Fusion (ADMGF) module that effectively combines the RGB and depth features to obtain rich spatial and contextual information. 3) A lightweight inter-modal feature prediction network that utilizes the fused RGB-Depth features to predict the corresponding point cloud features and vice versa, forming a bidirectional learning mechanism through tri-modal inputs. Our model eliminates the need for large memory banks or pixel-level reconstructions. Comprehensive experiments on the MVTec3D-AD and Eyecandies datasets showed significant improvements in performance over the state-of-the-art methods.
Early-stage visual quality inspection is vital for achieving Zero-Defect Manufacturing and minimizing production waste in modern industrial environments. However, the complexity of robust visual inspection systems and their extensive data requirements hinder widespread adoption in semi-controlled industrial settings. In this context, we propose a pose-agnostic, zero-shot quality inspection framework that compares real scenes against real-time Digital Twins (DT) in the RGB-D space. Our approach enables efficient real-time DT rendering by semantically describing industrial scenes through object detection and pose estimation of known Computer-Aided Design models. We benchmark tools for real-time, multimodal RGB-D DT creation while tracking consumption of computational resources. Additionally, we provide an extensible and hierarchical annotation strategy for multi-criteria defect detection, unifying pose labelling with logical and structural defect annotations. Based on an automotive use case featuring the quality inspection of an axial flux motor, we demonstrate the effectiveness of our framework. Our results demonstrate detection performace, achieving intersection-over-union (IoU) scores of up to 63.3% compared to ground-truth masks, even if using simple distance measurements under semi-controlled industrial conditions. Our findings lay the groundwork for future research on generalizable, low-data defect detection methods in dynamic manufacturing settings.
Industrial surface defect detection is essential for ensuring product quality, but real-world production lines often provide only a limited number of defective samples, making supervised training difficult. Multimodal anomaly detection with aligned RGB and depth data is a promising solution, yet existing fusion schemes tend to overlook modality-specific characteristics and cross-modal inconsistencies, so that defects visible in only one modality may be suppressed or diluted. In this work, we propose DCRDF-Net, a dual-channel reverse-distillation fusion network for unsupervised RGB–depth industrial anomaly detection. The framework learns modality-specific normal manifolds from nominal RGB and depth data and detects defects as deviations from these learned manifolds. It consists of three collaborative components: a Perlin-guided pseudo-anomaly generator that injects appearance–geometry-consistent perturbations into both modalities to enrich training signals; a dual-channel reverse-distillation architecture with guided feature refinement that denoises teacher features and constrains RGB and depth students towards clean, defect-free representations; and a cross-modal squeeze–excitation gated fusion module that adaptively combines RGB and depth anomaly evidence based on their reliability and agreement.Extensive experiments on the MVTec 3D-AD dataset show that DCRDF-Net achieves 97.1% image-level I-AUROC and 98.8% pixel-level PRO, surpassing current state-of-the-art multimodal methods on this benchmark.
Industrial Anomaly Detection (IAD) has received increasing attention because undetected flaws on power-grid hardware, turbine blades or production lines can trigger blackouts, costly shutdowns or safety accidents. Considering singlemodality RGB or LiDAR data alone is easily invalidated by darkness, fog or specular reflection, this paper studies multi-modal detection and proposes the Masked Cross-modal Reconstruction Network (MCR-Net). The core idea is to learn bidirectional RGB to 3-D correspondence by reconstructing randomly masked tokens from the complementary modality; defects are located where reconstruction residuals deviate from the normal manifold learned without any anomalous training samples. The proposed method is evaluated by defect detection and segmentation tasks on the most challenging and largest multi-modal IAD database, MVTec 3D-AD dataset, to demonstrate its superiority over the state-of-the-art.
In industrial assembly line manufacturing, visual inspection is crucial for maintaining high-quality standards. The detection of defects can be accomplished using anomaly detection, by identifying irregularities in RGB images. However, currently available datasets mostly contain only poseinvariant samples. Lately, this has resulted in the emergence of datasets that supplement depth information into RGB image data, enabling defect detection with respect to the geometrical surface of objects. Nevertheless, the acquisition costs for high-resolution 3D point cloud data are considerably higher than the costs of conventional RGB cameras. Other datasets contain only synthetic 3D representations of the objects for model training. However, this results in a considerable gap in the applicability of real-world industrial applications. Conversely, multi-component CAD models are already an abundantly available source of 3D information, yet, not fully leveraged for anomaly detection. In this paper, we introduce the dataset PCAD - a dataset acquired using a specialized hardware setup designed to capture high-quality images of assembly groups with varying complexity and different poses in a systematic and reproducible way. The dataset includes variations in structural and semantic anomalies, poses, and lighting conditions, providing a comprehensive real-world scenario for defect detection. We show the usability and efficacy of our dataset regarding various state-of-the-art anomaly detection models. Furthermore, we demonstrate its applicability to visual quality inspection and its potential to support future research in this field. The code and dataset are publicly available: https://github.com/tmdt-buw/pcad-dataset
Abstract—Printed Circuit Board Assembly (PCBA) inspection remains a critical step in electronics manufacturing, yet conventional approaches—either purely manual or based on rule-driven Automated Optical Inspection (AOI)—struggle with the increasing miniaturization of SMD components, variable lighting conditions, and the need for rapid adaptation to new product designs. This work introduces an industrial-grade robotic platform that integrates high-resolution imaging, adaptive RGB illumination, robotic actuation, YOLOv8-based defect detection, and photogrammetric 3D reconstruction. The system enables 360° multi-view acquisition through a custom rotation mechanism and provides detailed component-level analysis with high accuracy, outperforming traditional camera-only and commercial smart-sensor approaches. Validated in a smartphone production line in Manaus, Brazil, the platform demonstrates strong robustness, reduced false detections, enhanced traceability through 3D modeling, and a cost–performance ratio suitable for large-scale industrial deployment.
Automated quality inspection in real-world factories must contend with complex backgrounds that can obscure subtle product defects. We investigate whether explicitly removing background regions benefits multimodal anomaly detection on RGB + 3D point-cloud data. Focusing on washing-drum assembly, we isolate the drum via 3D spatial filtering to create Foreground-Only inputs and compare them with the unprocessed Original scenes. Three state-of-the-art unsupervised models, Asymmetric Student–Teacher (AST) [2], Shape-Guided Dual-Memory [3], and 3DSR [4], are trained solely on normal samples and evaluated on a balanced test set. Image-level AUROC rises consistently for all models when using foreground data: from 0.973 to 0.991 for 3DSR, 0.983 to 1.000 for AST, and 0.949 to 0.999 for Shape-Guided, yielding a mean gain of approximately 3 percentage and a perfect score for AST. Qualitative inspection shows that background removal eliminates false positives and concentrates anomaly heatmaps on genuine defects. These results demonstrate that foreground extraction is a simple yet powerful preprocessing step for RGB + 3D anomaly detection in cluttered industrial environments and should be considered a standard component of deployment pipelines.
No abstract available
Industrial visual inspection increasingly incorporates complementary sensors, including depth, thermal, and surface normals, to capture defects that RGB imagery alone cannot reveal. Current fusion approaches face three limitations that hinder reliable, deployable inspection: convolutional neural networks exhibit limited local receptive fields that impede aggregation of long-range and orientation-dependent context necessary for elongated or subtle defect detection; Vision Transformers deliver global interactions but incur quadratic compute and memory costs that scale poorly with high-resolution multimodal inputs common in industrial settings; and modest sensor misregistration together with modality-specific noise lead to cross-modal contamination and degraded pixel-level localization. To address these gaps, we propose MambaAlign, an alignment-aware state-space fusion framework that refines each modality with Per-Modal Mamba Modules (PMMs) built on state-space models and QuadSnake scanning to capture long-range, orientation-aware context while preserving spatial coherence. MambaAlign enables semantic, content-conditioned cross-modal exchange through a lightweight Cross Mamba Interaction (CMI) applied at deep semantic stages, which provides cross-modal guidance with near-linear complexity and reduced sensitivity to spatial offsets. A top-down Alignment-Aware Fusion (AAF) reconstitutes low-level channels via local fusion and channel reconstruction, tolerating small spatial misalignments and preserving precise localization. Extensive evaluation on multiple multimodal anomaly detection benchmarks demonstrates large, consistent gains in both image-level detection and pixel-level localization; averaged across three datasets, MambaAlign improves I-AUROC by 4.8%, P-AUROC by 5.0%, and AUPRO by 6.5% while maintaining a competitive runtime of 30 FPS.
This paper presents an integrated three-dimensional (3D) quality inspection system for mold manufacturing that addresses critical industrial constraints, including zero-shot generalization without retraining, complete decision traceability for regulatory compliance, and robustness under severe data shortages (<2% defect rate). Dual optical sensors (Photoneo MotionCam 3D and SICK Ruler) are integrated via affine transformation-based registration, followed by computer-aided design (CAD)-based classification using geometric feature matching to CAD specifications. Unsupervised defect detection combines density-based spatial clustering of applications with noise (DBSCAN) clustering, curvature analysis, and alpha shape boundary estimation to identify surface anomalies without labeled training data. Industrial validation on 38 product classes (3000 samples) yielded 99.00% classification accuracy and 99.12% macroscopic precision, outperforming Point-MAE (93.24%) trained under the same limited-data conditions. The CAD-based architecture enables immediate deployment via CAD reference registration, eliminating the five-day retraining cycle required for deep learning, essential for agile manufacturing. Processing time stability (0.47 s compared to 43.68 s for Point-MAE) ensures predictable production throughput. Defect detection achieved 98.00% accuracy on a synthetic validation dataset (scratches: 97.25% F1; dents: 98.15% F1).
Additive manufacturing (AM, also known as 3D printing) is a bottom–up process where variations in process conditions can significantly influence the quality and performance of the printed parts. Digital twin (DT) technology can measure process parameters and printed part characteristics in real-time, achieving online monitoring, analysis, and optimization of the AM process. Existing DT research on AM focuses on simulating the printing process and lacks real-time defect detection and twinning of actual printed objects, which hinders the timely detection and correction of defects. This study developed a DT system for fused deposition modeling (FDM) AM technology that not only accurately simulates the printing process but also performs real-time quality monitoring of the printed parts. A laser profilometer and industrial camera were integrated into the printer to detect and collect real-time morphological data on the printed object. The custom-developed DT software could convert the morphological data of the printed parts into a DT model. By comparing the DT model of the printed object with its three-dimensional model, defect detection of the printed parts was achieved, where the quality of the printed parts was evaluated using a defect percentage index. This study combines DT and AM to achieve process quality monitoring, demonstrating the potential of DT technology in reducing printing defects and improving the quality of printed parts.
With the widespread application of 3D printing in industries such as industrial manufacturing, aerospace, and construction, increasing attention is being paid to defects that arise during the printing process. To address defect detection, academia and industry have proposed numerous solutions, yet these still exhibit several limitations: 1) Traditional defect detection methods often require the collection, fine-tuning, and training of various types of defects. Given the vast amount of data in the 3D printing process, this demands significant time and memory consumption; 2) When dealing with complex defect issues, traditional models struggle to achieve desired results due to limitations in their structure and parameter scale; 3) For different printing tasks, traditional defect detection methods necessitate extensive fine-tuning and processing before defect recognition knowledge can be transferred. To tackle these issues, this study introduces a 3D printing defect detection method based on Multimodal Large Language Models (MLLM) and Retrieval Augmented Generation (RAG). Leveraging the vast parameter count and computational power of the large model, coupled with Prompt Engineering technology, a knowledge base pertaining to 3D printing defects is constructed. Examples composed of partial printed defect data are created to train the model, enabling it to precisely articulate defect issues. Ultimately, it continuously learns from existing data to accurately describe unknown defect problems. Through database verification and manual evaluation of the large model, we have confirmed the effectiveness of its question-answering results in 3D printing defect detection.
No abstract available
Aiming at the problems of complex and diverse types of abnormal data and the difficulty of collection and annotation in actual industrial scenarios, a multimodal anomaly detection method based on unsupervised feature measurement is proposed. Firstly, a rotation-invariant 3D point cloud feature extraction network is constructed to solve the problem of data pose differences and enhance the model's ability to capture 3D features; Secondly, an unsupervised multimodal feature modeling module is designed to achieve the extraction of richer feature information; Finally, the feature measurement module is introduced to improve the accuracy of anomaly detection and location. The comparison results with the existing advanced methods on the MVTec-3D dataset show that this method performs excellently in anomaly detection, location and resistance to interference from abnormal data. The ablation experiments further verified the effectiveness of the proposed 3D feature extractor and multimodal fusion method.
Early crack detection enables timely maintenance actions, which in turn help extend pavement life and reduce maintenance costs. Traditional 2D detection lacks detail, while 3D detection faces accuracy and efficiency challenges. This paper proposes a hierarchical crack detection framework—F2CrackDet‐PCD (crack detection based on point cloud data with filtering and fusion). The framework adopts a pre‐filtering and fine segmentation strategy (multi‐scale anomaly region filtering [MARF]). First, the MARF uses point cloud characteristics to quickly identify potential crack regions. Then, an orthogonal projection converts 3D data into RGB, depth, and normal images, which are combined by MIF‐CrackNet (multimodal interaction fusion) to enhance detection accuracy and robustness. Two datasets were developed: RoadScan‐2228, capturing realistic road scenes, and CrackNet‐1187, emphasizing densely cracked pavement. Experimental results show that the MARF achieves a recall of about 98% on both datasets. F2CrackDet‐PCD achieves F1‐scores of 75.0 on RoadScan‐2228 and 78.2 on CrackNet‐1187. F2CrackDet‐PCD provides a solution of lane‐level 3D point cloud crack detection for large‐scale road detection.
Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB--3D and RGB--Text, enabled by point cloud sensing and vision--language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB--3D, RGB--Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.
The lack of defect annotations in industrial settings due to commercial confidentiality or privacy concerns, has driven the development of zero-shot anomaly detection methods. Traditional 2D-based algorithms are sensitive to variations in lighting and viewpoint, making it challenging to capture geometric shape anomalies, while point cloud modalities tend to overlook color-based defects. To harness the complementary strengths of RGB images and point clouds, this paper proposes a zero-shot anomaly detection algorithm based on adaptive feature fusion (AFF). The algorithm fuses RGB images with depth maps derived from point clouds using adaptive feature fusion, and then evaluates anomaly scores through a cross-sample Patch Affinity Quantification (CPAQ) mechanism. Experiments on the MVTec3D-AD dataset demonstrate that the proposed method achieves an image-level AUROC of 84.5%, a pixel-level AUROC of 99.2%, and a PRO score of 97%, establishing a promising new paradigm for multimodal industrial anomaly detection.
The integration of Augmented Reality (AR) with Digital Twin (DT) technology is increasingly recognized as a transformative approach for achieving resilient and sustainable industrial operations. This research presents an Adaptive Digital Twin (DT) framework integrated with Augmented Reality (AR) to enhance resilience, predictive intelligence, and sustainability in industrial operations. The proposed Immersive Sensing Digital Twin (ImmersiSense-Twin) algorithm combines real-time 3D point cloud acquisition, multimodal sensor fusion, and edge-deployed artificial intelligence to deliver immersive, context-aware operational insights. The system architecture leverages low-latency edge processing and AR-based visualization to facilitate rapid anomaly detection, predictive maintenance, and resource optimization in dynamic industrial environments. Simulation experiments were conducted using industrial time-series datasets in a controlled digital twin simulation testbed. Performance was evaluated against three benchmark algorithms: Deep Embedded Clustering (DEC), LSTM-Autoencoder for Industrial Time-Series (LSTM-AE), and Federated Learning-Based Predictive Maintenance Model (FedPM). The comparison was based on prediction accuracy, anomaly detection rate, latency, resource utilization efficiency, and AR rendering time. The ImmersiSense-Twin algorithm achieved notable improvements, including up to 9.3% higher prediction accuracy, 12.5% faster anomaly detection, and 14.8% better resource utilization efficiency compared to the best-performing baseline. The results validate the suitability of the proposed framework for Industry 4.0 and emerging Industry 5.0 environments, offering a pathway toward more resilient, intelligent, and sustainable industrial ecosystems. The findings demonstrate that the integration of adaptive DT systems with immersive AR interaction provides significant operational advantages, enabling proactive decision-making and reducing downtime in mission-critical industrial applications.
Detecting anomalies within point clouds is crucial for various industrial applications, but traditional unsupervised methods face challenges due to data acquisition costs, early stage production constraints, and limited generalization across product categories. To overcome these challenges, we introduce the multiview projection (MVP) framework, leveraging pretrained vision-language models (VLMs) to detect anomalies. Specifically, MVP projects point cloud data into multiview depth images, thereby translating point cloud anomaly detection into image anomaly detection. Following zero-shot image anomaly detection methods, pretrained VLMs are utilized to detect anomalies on these depth images. Given that pretrained VLMs are not inherently tailored for zero-shot point cloud anomaly detection and may lack specificity, we propose the integration of learnable visual and adaptive text prompting techniques to fine-tune these VLMs, thereby enhancing their detection performance. Extensive experiments on the MVTec 3-D-AD and Real3D-AD demonstrate our proposed MVP framework’s superior zero-shot anomaly detection performance and the prompting techniques’ effectiveness. Real-world evaluations on automotive plastic part inspection further showcase that the proposed method can also be generalized to practical, unseen scenarios.
No abstract available
Multimodal Large Language Models (MLLMs) show promise for general industrial quality inspection, but fall short in complex scenarios, such as Printed Circuit Board (PCB) inspection. PCB inspection poses unique challenges due to densely packed components, complex wiring structures, and subtle defect patterns that require specialized domain expertise. However, a high-quality, unified vision-language benchmark for quantitatively evaluating MLLMs across PCB inspection tasks remains absent, stemming not only from limited data availability but also from fragmented datasets and inconsistent standardization. To fill this gap, we propose UniPCB, the first unified vision-language benchmark for open-ended PCB quality inspection. UniPCB is built via a systematic pipeline that curates and standardizes data from disparate sources across three annotated scenarios. Furthermore, we introduce PCB-GPT, an MLLM trained on a new instruction dataset generated by this pipeline, utilizing a novel progressive curriculum that mimics the learning process of human experts. Evaluations on the UniPCB benchmark show that while existing MLLMs falter on domain-specific tasks, PCB-GPT establishes a new baseline. Notably, it more than doubles the performance on fine-grained defect localization compared to the strongest competitors, with significant advantages in localization and analysis. We will release the instruction data, benchmark, and model to facilitate future research.
Large vision–language models (VLMs) can describe images fluently, yet their anomaly decisions often rely on opaque heuristics and manual thresholds. We present ThinkAnomaly, a rationale-first vision–language framework for industrial anomaly detection. The model generates a concise structured rationale and then issues a calibrated yes/no decision, eliminating per-class thresholds. To supervise reasoning, we construct chain-of-thought annotations for MVTec-AD and VisA via synthesis, automatic filtering, and human validation. We fine-tune Llama-3.2-Vision with a two-stage objective and a rationale–label consistency loss, yielding state-of-the-art classification accuracy while maintaining a competitive detection AUC: MVTec-AD—93.9% accuracy and 93.8 Image-AUC; VisA—90.3% accuracy and 85.0 Image-AUC. This improves classification accuracy over AnomalyGPT by +7.8 (MVTec-AD) and +12.9 (VisA) percentage points. The explicit reasoning and calibrated decisions make ThinkAnomaly transparent and deployment-ready for industrial inspection.
Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.
Electrical grid asset inspection is crucial for ensuring infrastructure reliability, preventing failures, and optimizing maintenance strategies. Traditional methods rely heavily on manual labor, making inspections time-consuming, costly, and prone to inconsistencies. Recent advances in deep learning, particularly CNNs and ViTs, have shown potential for automating visual inspection tasks. Vision models are evolving towards multimodal, large-scale architectures. While MLLMs trained on diverse internet-sourced datasets show strong performance, many industrial tasks require models fine-tuned on proprietary data to capture the nuanced features necessary for specialization. This work explores the capabilities and limitations of multi-modal large language models for electrical grid asset inspection, comparing pre-trained and fine-tuned versions of the Florence-VL model across tasks like detailed captioning, object detection, and segmentation. Ground truth labels for caption-related tasks were generated using GPT-4, while annotations from a dataset of insulator defects were used for region-related tasks. Results indicate that fine-tuning improves class recognition and object localization, though captions often remain simplistic and lack domain-specific detail. Despite GPT-4 providing accurate captions, it lacks batch-processing capabilities, limiting its scalability. These findings underscore the importance of domain-specific fine-tuning and high-quality data to enhance MLLM performance.
Recent advances in visual industrial anomaly detection have demonstrated exceptional performance in identifying and segmenting anomalous regions while maintaining fast inference speeds. However, anomaly classification-distinguishing different types of anomalies-remains largely unexplored despite its critical importance in realworld inspection tasks. To address this gap, we propose VELM, a novel LLM-based pipeline for anomaly classification. Given the critical importance of inference speed, we first apply an unsupervised anomaly detection method as a vision expert to assess the normality of an observation. If an anomaly is detected, the LLM then classifies its type. A key challenge in developing and evaluating anomaly classification models is the lack of precise annotations of anomaly classes in existing datasets. To address this limitation, we introduce MVTec-AC and VisA-AC, refined versions of the widely used MVTec-AD and VisA datasets, which include accurate anomaly class labels for rigorous evaluation. Our approach achieves a state-of-the-art anomaly classification accuracy of 80.4 % on MVTec-AD, exceeding the prior baselines by 5 %, and 84 % on MVTec-AC, demonstrating the effectiveness of VELM in understanding and categorizing anomalies. We hope our methodology and benchmark inspire further research in anomaly classification, helping bridge the gap between detection and comprehensive anomaly characterization.
This expository paper introduces a simplified approach to image-based quality inspection in manufacturing using OpenAI's CLIP (Contrastive Language-Image Pretraining) model adapted for few-shot learning. While CLIP has demonstrated impressive capabilities in general computer vision tasks, its direct application to manufacturing inspection presents challenges due to the domain gap between its training data and industrial applications. We evaluate CLIP's effectiveness through five case studies: metallic pan surface inspection, 3D printing extrusion profile analysis, stochastic textured surface evaluation, automotive assembly inspection, and microstructure image classification. Our results show that CLIP can achieve high classification accuracy with relatively small learning sets (50-100 examples per class) for single-component and texture-based applications. However, the performance degrades with complex multi-component scenes. We provide a practical implementation framework that enables quality engineers to quickly assess CLIP's suitability for their specific applications before pursuing more complex solutions. This work establishes CLIP-based few-shot learning as an effective baseline approach that balances implementation simplicity with robust performance, demonstrated in several manufacturing quality control applications.
As an important material handling equipment in heavy industries such as steel and metallurgy, gripper machine is widely used for gripping and transferring heavy objects such as steel. However, the traditional gripper mainly relies on manual control or simple automation procedures, in order to meet the demand of unmanned or intelligent gripper and realize intelligent factory. Therefore, this paper utilizes the powerful feature learning and representation capability of deep convolutional neural network, and focuses on the topic of 3D attitude estimation of bar steel in industrial scenarios, taking stereo vision as the basis and combining with the target LiDAR point cloud information, to study the 3D attitude estimation method of bar steel based on stereo vision and LiDAR point cloud. In order to realize the 3D attitude estimation of strip steel, a multimodal feature fusion network based on stereo vision and LIDAR point cloud is proposed in this paper. The experimental results show that the multimodal interaction method proposed in this paper significantly improves the accuracy of strip steel 3D detection, and our overall accuracy is improved by 2.91% compared with similar fusion methods.
Few-Shot Industrial Anomaly Detection (FS-IAD) has important applications in automating industrial quality inspection. Recently, some FS-IAD methods based on Large Vision-Language Models (LVLMs) have been proposed with some achievements through prompt learning or fine-tuning. However, existing LVLMs focus on general tasks but lack basic industrial knowledge and reasoning capabilities related to FS-IAD, making these methods far from specialized human quality inspectors. To address these challenges, we propose a unified framework, IADGPT, designed to perform FS-IAD in a human-like manner, while also handling associated localization and reasoning tasks, even for diverse and novel industrial products. To this end, we introduce a three-stage progressive training strategy inspired by humans. Specifically, the first two stages gradually guide IADGPT in acquiring fundamental industrial knowledge and discrepancy awareness. In the third stage, we design an in-context learning-based training paradigm, enabling IADGPT to leverage a few-shot image as the exemplars for improved generalization to novel products. In addition, we design a strategy that enables IADGPT to output image-level and pixel-level anomaly scores using the logits output and the attention map, respectively, in conjunction with the language output to accomplish anomaly reasoning. To support our training, we present a new dataset comprising 100K images across 400 diverse industrial product categories with extensive attribute-level textual annotations. Experiments indicate IADGPT achieves considerable performance gains in anomaly detection and demonstrates competitiveness in anomaly localization and reasoning. We will release our dataset in camera-ready.
We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence. Additional details and resources can be found in this URL: https://ninaneon.github.io/projectpage/
Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
Few-Shot Industrial Anomaly Detection (FSIAD) is an essential yet challenging problem in practical scenarios such as industrial quality inspection. Its objective is to identify previously unseen anomalous regions using only a limited number of normal support images from the same category. Recently, large pre-trained vision-language models (VLMs), such as CLIP, have exhibited remarkable few-shot image-text representation abilities across a range of visual tasks, including anomaly detection. Despite their promise, real-world industrial anomaly datasets often contain noisy labels, which can degrade prompt learning and detection performance. In this paper, we propose AnomalyNLP, a new Noisy-Label Prompt Learning approach designed to tackle the challenge of few-shot anomaly detection. This framework offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of VLMs for industrial anomaly detection. First, we design a Noisy-Label Prompt Learning (NLPL) strategy. This strategy utilizes feature learning principles to suppress the influence of noisy samples via Mean Absolute Error (MAE) loss, thereby improving the signal-to-noise ratio and enhancing overall model robustness. Furthermore, we introduce a prompt-driven optimal transport feature purification method to accurately partition datasets into clean and noisy subsets. For both image-level and pixel-level anomaly detection, AnomalyNLP achieves state-of-the-art performance across various few-shot settings on the MVTecAD and VisA public datasets. Qualitative and quantitative results on two datasets demonstrate that our method achieves the largest average AUC improvement over baseline methods across 1-, 2-, and 4-shot settings, with gains of up to 10.60%, 10.11%, and 9.55% in practical anomaly detection scenarios.
Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3's multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
No abstract available
Visual anomaly classification and segmentation are vital for automating industrial quality inspection. The focus of prior research in the field has been on training custom models for each quality inspection task, which requires task-specific images and annotation. In this paper we move away from this regime, addressing zero-shot and few-normal-shot anomaly classification and segmentation. Recently CLIP, a vision-language model, has shown revolutionary generality with competitive zero-/few-shot performance in comparison to full-supervision. But CLIP falls short on anomaly classification and segmentation tasks. Hence, we propose window-based CLIP (WinCLIP) with (1) a compositional ensemble on state words and prompt templates and (2) efficient extraction and aggregation of window/patch/image-level features aligned with text. We also propose its few-normal-shot extension Win-CLIP+, which uses complementary information from normal images. In MVTec-AD (and VisA), without further tuning, WinCLIP achieves 91.8%/85.1% (78.1%/79.6%) AU-ROC in zero-shot anomaly classification and segmentation while WinCLIP + does 93.1%/95.2% (83.8%/96.4%) in 1-normal-shot, surpassing state-of-the-art by large margins.
Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image's visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining high-quality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application to logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve SOTA performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and an F1-max of 83.7% along with explanations of the anomalies. This significantly outperforms the existing SOTA method by 18.1% in AUROC and 4.6% in F1-max score.
Visual defect detection is crucial for industrial quality control in intelligent manufacturing. Previous research requires target-specific data to train the model for each inspection task. However, due to the challenges of collecting proprietary data and model-training time costs, zero-shot defect detection (ZSDD) has become an emerging topic in the field. ZSDD, which requires models trained with auxiliary data, can detect defects on different products without target-data training. Recently, large pretrained vision-language models (VLMs), such as contrastive language-image pre-training model (CLIP), have demonstrated revolutionary generality with competitive zero-shot performance across various downstream tasks. However, VLMs have limitations in defect detection, which are designed to focus on identifying category semantics of the objects rather than sensing object attributes (defective/nondefective). The current VLMs-based ZSDD methods require manually crafted text prompts to guide the discovery of anomaly attributes. In this article, we propose a novel ZSDD method, namely attribute-aware CLIP, to adapt CLIP for anomaly attribute discovery without designing specific textual prompts. The core is designing a textual domain bridge, which transforms simple general textual prompt features into prompt embeddings better aligned with the attribute awareness. This enables the model to perceive the attributes of objects by text-image feature matching, bridging the gap between object semantic recognition and attribute discovery. Additionally, we perform component clustering on the images to break down the overall object semantics, encouraging the model to focus on attribute awareness. Extensive experiments on 16 real-world defect datasets demonstrate that our method achieves state-of-the-art (SOTA) ZSDD performance in diverse class-semantic datasets.
Surface defect detection is pivotal for industrial quality control, yet existing deep learning methods face challenges such as data dependency, annotation complexity, and limited generalization. This paper introduces ZSDD (Zero-Shot Surface Defect Detection and Segmentation), a novel framework that synergizes pre-trained vision-language models (CLIP, Grounding-DINO) and segmentation models (SAM) to detect and segment defects without task-specific training or annotated data. ZSDD operates in three stages: (1) CLIP-based zero-shot classification using compositional text prompts, (2) text-guided defect localization via Grounding-DINO, and (3) SAM-powered pixel-level segmentation. Evaluated on the MVTec-AD benchmark, ZSDD achieves state-of-the-art performance, with $\mathbf{1 0 0 \%}$ AUROC on tile, wood, and leather surfaces, while significantly reducing deployment costs. The framework’s modular design addresses key challenges in industrial inspection, including defect diversity and annotation complexity, offering a practical, scalable solution for real-world quality control.
Unified vision-language models (VLMs) promise to streamline computer vision pipelines by reframing multiple visual tasks—such as classification, detection, and keypoint localization—within a single language-driven interface. This architecture is particularly appealing in industrial inspection, where managing disjoint task-specific models introduces complexity, inefficiency, and maintenance overhead. In this paper, we critically evaluate the viability of this unified paradigm using InspectVLM, a Florence-2–based VLM trained on InspectMM, our new large-scale multimodal, multitask inspection dataset. While InspectVLM performs competitively on image-level classification and structured keypoint tasks, we find that it fails to match traditional ResNet-based models in core inspection metrics. Notably, the model exhibits brittle behavior under low prompt variability, produces degenerate outputs for fine-grained object detection, and frequently defaults to memorized language responses regardless of visual input. Our findings suggest that while language-driven unification offers conceptual elegance, current VLMs lack the visual grounding and robustness necessary for deployment in precision-critical industrial inspections.
DHR-CLIP: Dynamic High-Resolution Object-Agnostic Prompt Learning for Zero-shot Anomaly Segmentation
Zero-shot anomaly segmentation (ZSAS) is crucial for detecting and localizing defects in target datasets without need for training samples. This approach is particularly valuable in industrial quality control, where there are distributional shifts between training and operational environments or when data access is restricted. Recent vision-language models have demonstrated strong zero-shot performance across various visual tasks. However, the variations in the granularity of local anomaly regions due to resolution changes and their focus on class semantics make it challenging to directly apply them to ZSAS. To address these issues, we propose DHR-CLIP, a novel approach that incorporates dynamic high-resolution processing to enhance ZSAS in industrial inspection tasks. Additionally, we adapt object-agnostic prompt design to detect normal and anomalous patterns without relying on specific object semantics. Finally, we implement deep-text prompt tuning in the text encoder for refined textual representations and employ V-V attention layers in the vision encoder to capture detailed local features. Our integrated framework enables effective identification of fine-grained anomalies through refinement of image and text prompt design, providing precise localization of defects. The effectiveness of DHR-CLIP has been demonstrated through comprehensive experiments on real-world industrial datasets, MVTecAD and VisA, achieving strong performance and generalization capabilities across diverse industrial scenarios.
Zero-shot anomaly detection (ZSAD) leverages vision-language models to achieve anomaly classification and segmentation without training on target datasets, showing strong potential for industrial inspection. However, existing methods face two limitations. First, they rely heavily on handcrafted text prompts and predefined object priors, restricting adaptability. Second, the alignment between modalities is overly simplistic, leading to insufficient integration between high-level semantics and fine-grained visual details. To address these issues, we propose an adaptive multi-scale CLIP-based framework (AMSCLIP) for ZSAD. Our approach introduces Adaptive Context Prompting (ACP), which dynamically refines prompt representations with visual priors in an object-agnostic manner, improving flexibility and generalization. Furthermore, we design the Multi-scale Perception Cross-modal Interaction (MPCI) module that combines multi-scale feature aggregation with attention mechanisms to enhance sensitivity to subtle anomalies, and employs an adapter to capture global semantics. Experimental results demonstrate our framework’s outstanding performance across seven industrial anomaly detection datasets.
Zero-Shot Anomaly Detection (ZSAD) aims to identify anomalies without task-specific training samples and is widely applied in industrial defect inspection and medical image analysis. Recently, the vision-language model CLIP has shown strong performance in ZSAD due to its cross-modal alignment and generalization ability. However, CLIP is designed for natural image classification, relying on global semantics while being insensitive to local details. Its static text-image fusion also limits the use of textual prompts, reducing performance on small-scale, complex-texture, or weakly semantic anomalies. To address this, we propose LACLIP, an enhanced vision-language anomaly detection framework. We incorporate Local Self-Correlation Attention (LSC-Attention) to improve local context modeling and introduce an Anomaly Semantic Aggregation (ASA) module to guide text-image matching with more discriminative semantics. Experiments show that LACLIP achieves state-of-the-art results on MVTec AD and VisA for both image-level classification and pixel-level localization, highlighting its potential for real-world applications.
Accurate photovoltaic module defect detection is hindered by limited single-modality information and weak feature representation under complex environmental conditions. To enhance defect identification in photovoltaic modules, a cross-modal attention and semantic-aware segmentation network is proposed herein. The framework introduces a cross-modal attention interaction module that adaptively balances visible and thermal features to strengthen inter-modal complementarity. A semantic-aware fusion module is also developed to refine semantic representation and improve feature fusion consistency. In addition, a dedicated multimodal dataset is constructed to support model evaluation. The proposed method demonstrates superior robustness and generalization capability across diverse scenarios. This work provides an effective and scalable approach for improving defect detection performance in photovoltaic systems via multimodal feature integration and attention-guided semantic learning.
No abstract available
Crack detection and rehabilitation are critical components of a pavement’s life cycle. Various detection methods have been developed, among which classification, object detection, and segmentation deep-learning approaches have been revolutionary. Segmentation models enable the pixel-wise delineation of crack networks, which are used in quantifying severity, defect type, and condition index of distress. However, supervised segmentation algorithms require a substantial amount of pixel-accurate ground truth labels, which are challenging to obtain. Additionally, current models exhibit limited generalizability to unseen data, with state-of-the-art models performing inadequately in detecting low-severity cracks. This article therefore presents a novel crack segmentation approach leveraging Meta’s Segment Anything Model (SAM) and low-cost ground truths. We fine-tune the SAM model using box, points, and text prompts, enhancing the model’s generalizability and improving crack fidelity. The model achieves a 94% F1 score on the authors’ dataset and 91% and 77% F1 scores on the Fully Convolutional Network (FCN) dataset and Crack Forest Dataset (CFD), respectively. Our approach outperforms the U-Net, DeepLabV3+, and TransUNet models on the FCN dataset and achieves comparable performance on the CFD dataset. Exploring different loss combinations during training reveals that a dice and binary cross-entropy loss combination does not significantly outperform a dice and focal loss combination. The use of text prompts in querying the images is also examined. Although initial results look promising, their segmentation and classification accuracies are relatively lower.
Existing surface defect semantic segmentation methods are limited by costly annotated data and are unable to cope with new or rare defect types. Zero-shot learning offers a new possibility for addressing this issue by reducing reliance on extensive annotated data. However, methods that solely rely on image information waste the valuable experience that humans have accumulated in the field of defect detection. In this work, we propose a human-guided segmentation network (HGNet) based on CLIP, introducing human guidance to address the data scarcity and effectively leverage expert knowledge, leading to more accurate and reliable surface defect segmentation. HGNet, guided by the human-provided text, consists of two novel modules: 1) attention-based multilevel feature fusion (AMFF) which effectively integrates multilevel features using attention mechanisms to enhance the fine-grained information capture and 2) multimodal feature adaptive balancing (MFAB) which aligns and balances multimodal features through dynamic adjustment and optimization. Moreover, we extend HGNet to HGNet+ by incorporating interactive learning to correct segmentation errors with human-provided points. Our proposed method can generalize to unseen classes without additional training samples for retraining, meeting the practical needs of industrial defect detection. Extensive experiments on Defect- $4^{i}$ (and MVTec-ZSS) demonstrate that our method outperforms the state-of-the-art zero-shot methods by 5.7%/7.81% (6.57%/8.06%) and is even comparable to the performance of existing few-shot methods.
Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7\%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3\%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.
Obtaining labeled data in the field of industrial anomaly detection is challenging, which necessitates the development of label-free frameworks. However, current methods mainly focus on the unsupervised paradigm, which uses a large number of normal samples of the same category to train the model, and distinguish anomalies during testing. This training approach necessitates retraining when new datasets or object categories are encountered. Recently, studies have suggested using large pre-trained multimodal vision-language models, such as CLIP, for zero-shot and few-shot anomaly detection, yielding promising outcomes. However, the lack of spatial awareness of these models results in less effectiveness in dense prediction tasks such as anomaly localization. To mitigate this issue, various fine-tuning methods using additional labeled anomaly data have been employed. In other words, substantial data and extensive training efforts are still necessary to ensure optimal model performance on specific datasets. In this paper, we introduce a training-free, CLIP-based model that utilizes patch correlations and prototype guidance to enable zero-shot and few-shot anomaly detection. Specifically, we first use a self-supervised pre-trained model to capture patch correlations within a single image, enhancing the model's regional awareness of defects. Then, we dynamically construct prototypes using a retrieval-enhanced method to alleviate domain gap in general domain models for anomaly detection. Extensive experiments on the popular benchmarks MVTec and VisA demonstrate that our approach achieves state-of-the-art performance across nearly all metrics. Furthermore, we validate the generalization of our method on collected real industrial data.
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/
Additive manufacturing enables the fabrication of complex designs while minimizing waste, but faces challenges related to defects and process anomalies. This study presents a novel multimodal Retrieval-Augmented Generation-based framework that automates anomaly detection across various Additive Manufacturing processes leveraging retrieved information from literature, including images and descriptive text, rather than training datasets. This framework integrates text and image retrieval from scientific literature and multimodal generation models to perform zero-shot anomaly identification, classification, and explanation generation in a Laser Powder Bed Fusion setting. The proposed framework is evaluated on four L-PBF manufacturing datasets from Oak Ridge National Laboratory, featuring various printer makes, models, and materials. This evaluation demonstrates the framework's adaptability and generalizability across diverse images without requiring additional training. Comparative analysis using Qwen2-VL-2B and GPT-4o-mini as MLLM within the proposed framework highlights that GPT-4o-mini outperforms Qwen2-VL-2B and proportional random baseline in manufacturing anomalies classification. Additionally, the evaluation of the RAG system confirms that incorporating retrieval mechanisms improves average accuracy by 12% by reducing the risk of hallucination and providing additional information. The proposed framework can be continuously updated by integrating emerging research, allowing seamless adaptation to the evolving landscape of AM technologies. This scalable, automated, and zero-shot-capable framework streamlines AM anomaly analysis, enhancing efficiency and accuracy.
Zero-shot 3D (ZS-3D) anomaly detection aims to identify defects in 3D objects without relying on labeled training data, making it especially valuable in scenarios constrained by data scarcity, privacy, or high annotation cost. However, most existing methods focus exclusively on point clouds, neglecting the rich semantic cues available from complementary modalities such as RGB images and texts priors. This paper introduces MCL-AD, a novel framework that leverages multimodal collaboration learning across point clouds, RGB images, and texts semantics to achieve superior zero-shot 3D anomaly detection. Specifically, we propose a Multimodal Prompt Learning Mechanism (MPLM) that enhances the intra-modal representation capability and inter-modal collaborative learning by introducing an object-agnostic decoupled text prompt and a multimodal contrastive loss. In addition, a collaborative modulation mechanism (CMM) is proposed to fully leverage the complementary representations of point clouds and RGB images by jointly modulating the RGB image-guided and point cloud-guided branches. Extensive experiments demonstrate that the proposed MCL-AD framework achieves state-of-the-art performance in ZS-3D anomaly detection.
Zero-Shot Industrial Anomaly Detection (ZSIAD) aims to identify and localize anomalies in industrial images from unseen categories. Owing to the powerful generalization capabilities, Vision-Language Models (VLMs) have achieved growing interest in ZSIAD. To guide the model toward understanding and localizing the semantically complex industrial anomalies, existing VLM-based methods have attempted to provide additional prompts to the model through learnable text prompt templates. However, these zero-shot methods lack detailed descriptions of specific anomalies, making it difficult to classify and segment the diverse range of industrial anomalies accurately. To address the aforementioned issue, we firstly propose the multi-stage prompt generation agent for ZSIAD. Specifically, we leverage the Multi-modal Language Large Model (MLLM) to articulate the detailed differential information between normal and test samples, which can provide detailed text prompts to the model through further refinement and anti-false alarm constraint. Moreover, we introduce the Visual Fundamental Model (VFM) to generate anomaly-related attention prompts for more accurate localization of anomalies with varying sizes and shapes. Extensive experiments on seven real-world industrial anomaly detection datasets have shown that the proposed method not only outperforms recent SOTA methods, but also its explainable prompts provide the model with a more intuitive basis for anomaly identification.
No abstract available
Enhancing the alignment between text and image features in the CLIP model is a critical challenge in zero-shot industrial anomaly detection tasks. Recent studies predominantly utilize specific category prompts during pretraining, which can cause overfitting to the training categories and limit model generalization. To address this, we propose a method that transforms category names through multicategory name stacking to create stacked prompts, forming the basis of our StackCLIP model. Our approach introduces two key components. The Clustering-Driven Stacked Prompts (CSP) module constructs generic prompts by stacking semantically analogous categories, while utilizing multi-object textual feature fusion to amplify discriminative anomalies among similar objects. The Ensemble Feature Alignment (EFA) module trains knowledge-specific linear layers tailored for each stack cluster and adaptively integrates them based on the attributes of test categories. These modules work together to deliver superior training speed, stability, and convergence, significantly boosting anomaly segmentation performance. Additionally, our stacked prompt framework offers robust generalization across classification tasks. To further improve performance, we introduce the Regulating Prompt Learning (RPL) module, which leverages the generalization power of stacked prompts to refine prompt learning, elevating results in anomaly detection classification tasks. Extensive testing on seven industrial anomaly detection datasets demonstrates that our method achieves state-of-the-art performance in both zero-shot anomaly detection and segmentation tasks.
Anomaly segmentation is essential for industrial quality, maintenance, and stability. Existing text-guided zero-shot anomaly segmentation models are effective but rely on fixed prompts, limiting adaptability in diverse industrial scenarios. This highlights the need for flexible, context-aware prompting strategies. We propose Image-Aware Prompt Anomaly Segmentation (IAP-AS), which enhances anomaly segmentation by generating dynamic, context-aware prompts using an image tagging model and a large language model (LLM). IAP-AS extracts object attributes from images to generate context-aware prompts, improving adaptability and generalization in dynamic and unstructured industrial environments. In our experiments, IAP-AS improves the F1-max metric by up to 10%, demonstrating superior adaptability and generalization. It provides a scalable solution for anomaly segmentation across industries
No abstract available
No abstract available
No abstract available
Multimodal anomaly detection (MAD) aims to exploit both texture and spatial attributes to identify deviations from normal patterns in complex scenarios. However, zero-shot (ZS) settings arising from privacy concerns or confidentiality constraints present significant challenges to existing MAD methods. To address this issue, we introduce ZUMA, a training-free, Zero-shot Unified Multimodal Anomaly detection framework that unleashes CLIP's cross-modal potential to perform ZS MAD. To mitigate the domain gap between CLIP's pretraining space and point clouds, we propose cross-domain calibration (CDC), which efficiently bridges the manifold misalignment through source-domain semantic transfer and establishes a hybrid semantic space, enabling a joint embedding of 2D and 3D representations. Subsequently, ZUMA performs dynamic semantic interaction (DSI) to enable structural decoupling of anomaly regions in the high-dimensional embedding space constructed by CDC, where natural languages serve as semantic anchors to help DSI establish discriminative hyperplanes within hybrid modality representations. Within this framework, ZUMA enables plug-and-play detection of 2D, 3D or multimodal anomalies, without training or fine-tuning even for cross-dataset or incomplete-modality scenarios. Additionally, to further investigate the potential of the training-free ZUMA within the training-based paradigm, we develop ZUMA-FT, a fine-tuned variant that achieves notable improvements with minimal parameter trade-off. Extensive experiments are conducted on two MAD benchmarks, MVTec 3D-AD and Eyecandies. Notably, the training-free ZUMA achieves state-of-the-art (SOTA) performance on both datasets, outperforming existing ZS MAD methods, including training-based approaches. Moreover, ZUMA-FT further extends the performance boundary of ZUMA with only 6.75 M learnable parameters. Code is available at: https://github.com/yif-ma/ZUMA.
No abstract available
No abstract available
No abstract available
Recently, vision-language models (e.g. CLIP) have demonstrated remarkable performance in zero-shot anomaly detection (ZSAD). By leveraging auxiliary data during training, these models can directly perform cross-category anomaly detection on target datasets, such as detecting defects on industrial product surfaces or identifying tumors in organ tissues. Existing approaches typically construct text prompts through either manual design or the optimization of learnable prompt vectors. However, these methods face several challenges: 1) handcrafted prompts require extensive expert knowledge and trial-and-error; 2) single-form learnable prompts struggle to capture complex anomaly semantics; and 3) an unconstrained prompt space limits generalization to unseen categories. To address these issues, we propose Bayesian Prompt Flow Learning (Bayes-PFL), which models the prompt space as a learnable probability distribution from a Bayesian perspective. Specifically, a prompt flow module is designed to learn both image-specific and image-agnostic distributions, which are jointly utilized to regularize the text prompt space and improve the model's generalization on unseen categories. These learned distributions are then sampled to generate diverse text prompts, effectively covering the prompt space. Additionally, a residual cross-model attention (RCA) module is introduced to better align dynamic text embeddings with fine-grained image features. Extensive experiments on 15 industrial and medical datasets demonstrate our method's superior performance. The code is available at https://github.com/xiaozhen228/Bayes-PFL.
While anomaly detection has made significant progress, generating detailed analyses that incorporate industrial knowledge remains a challenge. To address this gap, we introduce OmniAD, a novel framework that unifies anomaly detection and understanding for fine-grained analysis. OmniAD is a multimodal reasoner that combines visual and textual reasoning processes. The visual reasoning provides detailed inspection by leveraging Text-as-Mask Encoding to perform anomaly detection through text generation without manually selected thresholds. Following this, Visual Guided Textual Reasoning conducts comprehensive analysis by integrating visual perception. To enhance few-shot generalization, we employ an integrated training strategy that combines supervised fine-tuning (SFT) with reinforcement learning (GRPO), incorporating three sophisticated reward functions. Experimental results demonstrate that OmniAD achieves a performance of 79.1 on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. It also shows strong results across multiple anomaly detection benchmarks. These results highlight the importance of enhancing visual perception for effective reasoning in anomaly understanding. All codes and models will be publicly available.
Few-shot multimodal industrial anomaly detection is a critical yet underexplored task, offering the ability to quickly adapt to complex industrial scenarios. In few-shot settings, insufficient training samples often fail to cover the diverse patterns present in test samples. This challenge can be mitigated by extracting structural commonality from a small number of training samples. In this paper, we propose a novel few-shot unsupervised multimodal industrial anomaly detection method based on structural commonality, CIF (Commonality In Few). To extract intra-class structural information, we employ hypergraphs, which are capable of modeling higher-order correlations, to capture the structural commonality within training samples, and use a memory bank to store this intra-class structural prior. Firstly, we design a semantic-aware hypergraph construction module tailored for single-semantic industrial images, from which we extract common structures to guide the construction of the memory bank. Secondly, we use a training-free hypergraph message passing module to update the visual features of test samples, reducing the distribution gap between test features and features in the memory bank. We further propose a hyperedge-guided memory search module, which utilizes structural information to assist the memory search process and reduce the false positive rate. Experimental results on the MVTec 3D-AD dataset and the Eyecandies dataset show that our method outperforms the state-of-the-art (SOTA) methods in few-shot settings. Code is available at https://github.com/Sunny5250/CIF.
Precise optical inspection in industrial applications is crucial for minimizing scrap rates and reducing the associated costs. Besides merely detecting if a product is anomalous or not, it is crucial to know the distinct type of defect, such as a bent, cut, or scratch. The ability to recognize the"exact"defect type enables automated treatments of the anomalies in modern production lines. Current methods are limited to solely detecting whether a product is defective or not without providing any insights on the defect type, nevertheless detecting and identifying multiple defects. We propose MultiADS, a zero-shot learning approach, able to perform Multi-type Anomaly Detection and Segmentation. The architecture of MultiADS comprises CLIP and extra linear layers to align the visual- and textual representation in a joint feature space. To the best of our knowledge, our proposal, is the first approach to perform a multi-type anomaly segmentation task in zero-shot learning. Contrary to the other baselines, our approach i) generates specific anomaly masks for each distinct defect type, ii) learns to distinguish defect types, and iii) simultaneously identifies multiple defect types present in an anomalous product. Additionally, our approach outperforms zero/few-shot learning SoTA methods on image-level and pixel-level anomaly detection and segmentation tasks on five commonly used datasets: MVTec-AD, Visa, MPDD, MAD and Real-IAD.
Zero-shot anomaly detection (ZSAD) aims to identify anomalies in new classes of images, and it’s vital in industry and other fields. Most current methods are based on the multimodal models CLIP and SAM, which have prior knowledge to assist model training, but they are highly dependent on the input of the prompts and their accuracy. We found that some diffusion model-based anomaly detection methods generate a large amount of semantic information and are very valuable for the ZSAD task. Therefore, we propose a diffusion model based zero-shot anomaly detection method, DZAD, and no additional prompt input is required. First, we propose the first diffusion-based zero-shot anomaly detection framework, which uses the proposed multi-timestep noise features extraction method to achieve anomaly detection in the denoising process of a latent space diffusion model with a semantic-guided (SG) network. Second, based on the detection results, we proposed a two-branch feature extractor for anomaly maps at different scales. Third, based on the difference between the anomaly detection task and other general image detection tasks, we propose a noise feature weight function for the diffusion model in the zero-shot anomaly detection task. Comparing with 7 recently state-of-the-art (SOTA) methods on MVTec AD and VisA datasets and analysis of the role of each component in ablation studies. The experiments demonstrate the validity of the method beyond the existing methods.
Speech-text multimodal large model as a key tool in the operation of the power industry, its fault prediction performance directly affects the operational safety of mechanical equipment, this paper designs a detailed scheme for the optimization of its performance. Firstly, the structural design of the unimodal model is discussed, and the audio classifier based on Wav2Vec2 and the text classifier based on BERT are used to pre-train the model. Based on the above foundation, a multimodal model is introduced, with the cross-attention mechanism as the fusion strategy, so that the different modal information in the deep neural network is fused with each other, thus improving the accuracy and robustness of the recognition task. After completing the fault feature extraction task, on the premise of introducing the relevant theory of BNN, the structure of BBN is optimized, and after fusing the HC algorithm, BIC and annealing idea, the fault diagnosis method based on the improved BBN network is constructed by combining the fault feature extraction method in the electric power industry and the optimized BBN method. The effectiveness of the method is verified through simulation experiments. The prediction accuracy of this paper's method for nine categories of fault data is above 90% at a high level, and the prediction accuracy of faults in some categories can reach 100%. The multimodal model fusion strategy proposed in this paper significantly improves the performance of fault feature recognition, in addition, the fault diagnosis method based on the improved BBN reduces the computational volume of the model and improves the fault prediction ability of the model.
Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain, with only access to unlabeled target training data and the source model pretrained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g., CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task specific, we propose a novel Distilling multImodal Foundation mOdel (DIFO) approach. Specifically, DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation, we further introduce two effective regularization terms, namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Code is here.
Driven by the progress of large-scale pre-training, parameter-efficient transfer learning has gained immense popularity across different subfields of Artificial Intelligence. The core is to adapt the model to downstream tasks with only a small set of parameters. Recently, researchers have leveraged such proven techniques in multimodal tasks and achieve promising results. However, two critical issues remain unresolved: how to further reduce the complexity with lightweight design and how to boost alignment between modalities under extremely low parameters. In this paper, we propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning, which explores the low intrinsic dimension with only 0.04% parameters of the pre-trained model. Then, for better modality alignment, we propose the Informative Context Enhancement and Gated Query Transformation module under extremely few parameters scenes. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach. Our code is available at: https://github.com/WillDreamer/Aurora.
Industrial defect detection is vital for upholding product quality across contemporary manufacturing systems. As the expectations for precision, automation, and scalability intensify, conventional inspection approaches are increasingly found wanting in addressing real-world demands. Notable progress in computer vision and deep learning has substantially bolstered defect detection capabilities across both 2D and 3D modalities. A significant development has been the pivot from closed-set to open-set defect detection frameworks, which diminishes the necessity for extensive defect annotations and facilitates the recognition of novel anomalies. Despite such strides, a cohesive and contemporary understanding of industrial defect detection remains elusive. Consequently, this survey delivers an in-depth analysis of both closed-set and open-set defect detection strategies within 2D and 3D modalities, charting their evolution in recent years and underscoring the rising prominence of open-set techniques. We distill critical challenges inherent in practical detection environments and illuminate emerging trends, thereby providing a current and comprehensive vista of this swiftly progressing field.
Embodied intelligence has always been regarded as the ultimate form of artificial intelligence (AI) and an ideal concept for smart manufacturing. With the development of AI foundation models, remarkable generalization capabilities have been achieved in various fields, such as natural language processing and computer vision. In the era of AI foundation models, embodied intelligence will be capable of continuous evolution for unlimited tasks with multimodal physical interaction in the open world. Therefore, it is envisioned that embodied intelligence should be integrated into smart manufacturing to upgrade the industry for more intelligent, flexible, and human-centric manufacturing in the future. Therefore, in this article, the definition and components of embodied intelligence are proposed with its novel characteristics. Besides, the capabilities of embodied intelligence in the era of AI foundation model are discussed. Moreover, typical innovative applications of embodied intelligence throughout the whole product life cycle for smart manufacturing are presented with insights. However, embodied intelligence still faces some challenges for implementation. Thus, in the prospect of future, challenges and outlooks of embodied intelligence are discussed for further research and applications.
No abstract available
We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.
No abstract available
多模态工业异常检测正经历从“感知融合”到“认知推理”的范式转移。研究重点已从单纯的RGB-D特征重建,转向利用VLM/MLLM实现零样本泛化与可解释性逻辑分析。新型架构如Mamba和扩散模型的引入进一步提升了检测效率与生成质量,而数字孪生与具身智能的集成则标志着该技术正加速向自动化产线的实战部署跨越。