多模态工业异常检测实现one-for-all
基于多模态大模型(MLLM/VLM)的逻辑推理与可解释性检测
该组文献利用大型视觉语言模型(LVLMs)强大的语义理解和推理能力,将异常检测从单纯的像素级分割提升到异常解释、因果分析和逻辑校验层面。研究重点包括通过思维链(CoT)、强化学习(GRPO)、指令微调以及规则引擎,解决传统方法难以识别的逻辑异常(如组件缺失、顺序错误),并提供人类可理解的检测报告。
- LLM-Based Hybrid Framework for Industrial Anomaly Detection for Smart Manufacturing(Fu Swee Tee, Lau Bee Theng, Mark Tee Kit Tsun, Deron Foo Yijia, 2025, 2025 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET))
- Customizing Visual-Language Foundation Models for Multi-Modal Anomaly Detection and Reasoning(Xiaohao Xu, Yunkang Cao, Yongqi Chen, Weiming Shen, Xiaonan Huang, 2024, 2025 28th International Conference on Computer Supported Cooperative Work in Design (CSCWD))
- Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models(Jiacong Xu, Shao-Yuan Lo, Bardia Safaei, Vishal M. Patel, Isht Dwivedi, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction(Er Jin, Qihui Feng, Yongli Mou, Stefan Decker, G. Lakemeyer, Oliver Simons, Johannes Stegmaier, 2025, ArXiv)
- LR-IAD: Mask-Free Industrial Anomaly Detection with Logical Reasoning(Peijian Zeng, Feiyan Pang, Zhanbo Wang, Aimin Yang, 2025, 2025 IEEE International Conference on Data Mining (ICDM))
- OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning(Shifang Zhao, Yiheng Lin, Lu Han, Yao Zhao, Yunchao Wei, 2025, ArXiv)
- PB-IAD: Utilizing multimodal foundation models for semantic industrial anomaly detection in dynamic manufacturing environments(Bernd Hofmann, Albert Scheck, Joerg Franke, Patrick Bruendl, 2025, ArXiv)
- SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment(Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, Lei Fan, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Towards VLM-based Hybrid Explainable Prompt Enhancement for Zero-Shot Industrial Anomaly Detection(Weichao Cai, Weiliang Huang, Yunkang Cao, Chao Huang, Fei Yuan, Bob Zhang, Jie Wen, 2025, No journal)
- AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection(Yuhao Chao, Jie Liu, Jie Tang, Gangshan Wu, 2025, ArXiv)
- Can Multimodal Large Language Models be Guided to Improve Industrial Anomaly Detection?(Zhiling Chen, Hanning Chen, Mohsen Imani, Farhad Imani, 2025, ArXiv)
- EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO(Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu, Weiqiang Wang, 2025, ArXiv)
- Towards Training-free Anomaly Detection with Vision and Language Foundation Models(Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- RIASA: Enhancing Reasoning Industrial Anomaly Segmentation via Large Vision-Language Models(Zongyun Zhang, Xian Gao, Jiacheng Ruan, Ting Liu, Yuzhuo Fu, 2025, 2025 IEEE International Conference on Multimedia and Expo Workshops (ICMEW))
- IADGPT: Unified LVLM for Few-Shot Industrial Anomaly Detection, Localization, and Reasoning via In-Context Learning(Mengyang Zhao, Teng Fu, Haiyang Yu, Ke Niu, Bin Li, 2025, ArXiv)
- Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection(Yuanze Li, Haolin Wang, Shihao Yuan, Ming Liu, Debin Zhao, Yiwen Guo, Chen Xu, Guangming Shi, Wangmeng Zuo, 2023, ArXiv)
- IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection(Zewen Li, Zitong Yu, Qilang Ye, Weicheng Xie, Wei Zhuo, Linlin Shen, 2025, IEEE Transactions on Instrumentation and Measurement)
- IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection(Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, Chao Huang, 2025, ArXiv)
- EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models(Zongyun Zhang, Jiacheng Ruan, Xian Gao, Ting Liu, Yuzhuo Fu, 2025, 2025 IEEE International Conference on Multimedia and Expo (ICME))
- Collaborative Anomaly Detection Using Agent-Based Modeling in Smart Manufacturing(Shuai Yuan, Haoxu Nong, Yunbo Rao, 2025, 2025 IEEE 8th International Conference on Pattern Recognition and Artificial Intelligence (PRAI))
- Referring Industrial Anomaly Segmentation(Pengfei Yue, Xiaokang Jiang, Yilin Lu, Jianghang Lin, Shengchuan Zhang, Liujuan Cao, 2026, No journal)
- LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection(Weijia Li, Guanglei Chu, Jionghuan Chen, Guo-Sen Xie, Caifeng Shan, Fang Zhao, 2025, ArXiv)
- RuleText-AD: Logical Anomaly Detection via Textual Rule Engines Generated by MLLMs(Leandro Silva, Mauricio Schiezaro, D. Oliveira, 2025, 2025 38th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI))
- Logical Anomaly Detection with Text-based Logic via Component-Aware Contrastive Language-Image Training(Seung-eon Lee, Soopil Kim, Sion An, Sang-Chul Lee, Sanghyun Park, 2025, Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2)
基于视觉-语言预训练(CLIP)的零样本与提示学习策略
此类文献侧重于利用CLIP等预训练模型的跨模态对齐能力,通过提示学习(Prompt Learning)、适配器(Adapter)微调或特征解耦,实现无需针对特定类别训练的零样本(Zero-shot)或少样本(Few-shot)异常检测。研究旨在解决工业场景下标注样本稀缺的问题,提升模型的泛化迁移能力。
- PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection(Xiaofan Li, Zhizhong Zhang, Xin Tan, Chengwei Chen, Yanyun Qu, Yuan Xie, Lizhuang Ma, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- MGFD-CLIP: Multi-Granularity Feature Decoupling for Zero-Shot Industrial Anomaly Detection(Zichun Zhang, Jiehao Chen, 2025, 2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA))
- YOLOSAM: A unified and efficient anomaly detection model based on auto mask prompt(Ruizhi Yu, Weiting Chen, Jiahao Fan, Xiang Li, Zheming Fan, Qing Zhang, 2025, Signal, Image and Video Processing)
- MALM-CLIP: A generative multi-agent framework for multimodal fusion in few-shot industrial anomaly detection(Hanzhi Chen, Jingbin Que, Kexin Zhu, Zhide Chen, F. Zhu, Wencheng Yang, Xu Yang, Xuechao Yang, 2025, Inf. Fusion)
- Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images(Chaoqin Huang, Aofan Jiang, Jinghao Feng, Ya Zhang, Xinchao Wang, Yanfeng Wang, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection(Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, Jiming Chen, 2023, ArXiv)
- AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection(Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, Giacomo Boracchi, 2024, ArXiv)
- FocusPatch AD: Few-Shot Multi-Class Anomaly Detection With Unified Keywords Patch Prompts(Xicheng Ding, Xiaofan Li, Mingang Chen, Jingyu Gong, Yuan Xie, 2025, IEEE Transactions on Image Processing)
- tGARD: Text-Guided Adversarial Reconstruction for Industrial Anomaly Detection(Yuchen Qiang, Jiuxin Cao, Shiwei Zhou, Junyang Yang, Lijia Yu, Bo Liu, 2025, IEEE Transactions on Industrial Informatics)
- Supad: a superordinary zero-shot industrial anomaly detection network based on gated-agnostic multimodal adaptive learning prompts(Xinying Li, Junfeng Jing, Tong Wu, Xin Zhang, Wei Liu, 2026, Journal of Intelligent Manufacturing)
多源传感器(RGB/3D/过程变量)的特征融合与几何建模
该组文献关注物理层面的多模态信息互补,特别是RGB图像与3D点云、深度图以及工业过程变量(如电流、压力)的结合。研究内容涵盖跨模态特征重映射、几何先验增强、图神经网络表示以及在模态缺失或噪声干扰下的鲁棒性检测技术。
- M3DM-NR: RGB-3D Noisy-Resistant Industrial Anomaly Detection via Multimodal Denoising(Chengjie Wang, Haokun Zhu, Jinlong Peng, Yue Wang, Ran Yi, Yunsheng Wu, Lizhuang Ma, Jiangning Zhang, 2024, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- FMFR: Feature-level Multistage Fusion and Remapping for Multimodal Industrial Anomaly Detection(Chunshui Wang, Hengran Zhang, 2026, Journal of Computational Design and Engineering)
- A Streamlined System for Multimodal Industrial Anomaly Detection via 2D and 3D Feature Fusion(Wenbing Zhu, Mingmin Chi, Bo Peng, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Unsupervised Visual-to-Geometric Feature Reconstruction for Vision-Based Industrial Anomaly Detection(Dinh-Cuong Hoang, Phan Xuan Tan, Anh-Nhat Nguyen, Duc-Thanh Tran, van-Hiep Duong, Anh-Truong Mai, D. Pham, Khanh-Toan Phan, Minh-Quang Do, Ta Huu Anh Duong, Tuan-Minh Huynh, Son-Anh Bui, Duc-Manh Nguyen, Viet-Anh Trinh, Khanh-Duong Tran, Thu-Uyen Nguyen, 2025, IEEE Access)
- Multimodal Industrial Anomaly Detection via Hybrid Fusion(Yue Wang, Jinlong Peng, Jiangning Zhang, Ran Yi, Yabiao Wang, Chengjie Wang, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Multimodal Industrial Anomaly Detection via Geometric Prior(Min Li, Jinghui He, Gang Li, Jiachen Li, Jin Wan, Delong Han, 2026, IEEE Transactions on Circuits and Systems for Video Technology)
- Incomplete multimodal industrial anomaly detection via cross-modal distillation(Wenbo Sui, Daniel Lichau, Josselin Lefèvre, Harold Phelippeau, 2024, Inf. Fusion)
- A multimodal industrial anomaly detection method based on mask training and teacher-student joint memory(Yi Liu, Changsheng Zhang, Xingjun Dong, Yufei Yang, 2025, Eng. Appl. Artif. Intell.)
- CPIR: Multimodal Industrial Anomaly Detection via Latent Bridged Cross-modal Prediction and Intra-modal Reconstruction(Shangguan Wen, Hongqiang Wu, Yanchang Niu, Haonan Yin, Jiawei Yu, Bokui Chen, Biqing Huang, 2025, Adv. Eng. Informatics)
- MFGAN: Multimodal Fusion for Industrial Anomaly Detection Using Attention-Based Autoencoder and Generative Adversarial Network(Xinji Qu, Zhuo Liu, C. Wu, Aiqin Hou, Xiaoyan Yin, Zhulian Chen, 2024, Sensors (Basel, Switzerland))
- VLDFNet: Views-Graph and Latent Feature Disentangled Fusion Network for Multimodal Industrial Anomaly Detection(Chenxing Xia, Chaofan Liu, Yicong Zhou, Kuan Ching Li, 2025, IEEE Transactions on Instrumentation and Measurement)
- Multimodal Industrial Anomaly Detection via Uni-Modal and Cross-Modal Fusion(Hao Cheng, Jiaxiang Luo, Xianyong Zhang, 2025, IEEE Transactions on Industrial Informatics)
- Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping(Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti, Luigi Di Stefano, 2023, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Cross-Modal Learning for Anomaly Detection in Complex Industrial Process: Methodology and Benchmark(Gaochang Wu, Yapeng Zhang, Lan Deng, Jingxin Zhang, Tianyou Chai, 2024, IEEE Transactions on Circuits and Systems for Video Technology)
- Zoom-Anomaly: Multimodal vision-Language fusion industrial anomaly detection with synthetic data(Jiaqi Li, Shuhuan Wen, Hamid Reza Karimi, 2026, Inf. Fusion)
- Multimodal Industrial Anomaly Detection by Crossmodal Reverse Distillation(Xinyue Liu, Jianyuan Wang, Biao Leng, Shuo Zhang, 2024, ArXiv)
统一化框架构建、跨域泛化与基准测试评估
这组文献致力于打破“一类一模型”的限制,构建能够同时处理多类别、多领域(工业、医疗等)的统一检测框架(One-for-All)。同时,研究者通过构建大规模、多模态的基准数据集(如MMAD、AnoVox),为评估模型的泛化能力、姿态无关性和常识推理能力提供了标准。
- MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection(Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, Feng Zheng, 2024, No journal)
- PAD: A Dataset and Benchmark for Pose-agnostic Anomaly Detection(Qiang Zhou, Weize Li, Lihan Jiang, Guoliang Wang, Guyue Zhou, Shanghang Zhang, Hao Zhao, 2023, ArXiv)
- AnoVox: A Benchmark for Multimodal Anomaly Detection in Autonomous Driving(Daniel Bogdoll, Iramm Hamdard, Lukas Namgyu Rößler, Felix Geisler, M. Bayram, F. Wang, Jan Imhof, M. Campos, Anushervon Tabarov, Yitian Yang, Hanno Gottschalk, J. M. Zöllner, 2024, No journal)
- Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection(Wenqiao Li, Yao Gu, Xintao Chen, Xiaohao Xu, Ming Hu, Xiaonan Huang, Yingna Wu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark(Alex Costanzino, Pierluigi Zama Ramirez, Luigi Lella, Matteo Ragaglia, A. Oliva, Giuseppe Lisanti, Luigi Di Stefano, 2025, ArXiv)
- UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection(Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao Wang, 2024, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Omni-LLaMA-AD: A Unified Model for Open-Set Visual Anomaly Detection(Rongyu Zhang, Zhanbin Hu, Jiamu Wang, Qiang Zhu, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Unified Unsupervised Anomaly Detection via Matching Cost Filtering(Zhe Zhang, Mingxiu Cai, Gao‐Song Wu, Jing Zhang, Lingqiao Liu, Dacheng Tao, Tianyou Chai, Xiatian Zhu, 2025, ArXiv)
- PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection(Po-Han Huang, Jeng-Lin Li, Po-Hsuan Huang, Ming-Ching Chang, Wei-Chao Chen, 2025, ArXiv)
- MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection(Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, Feng Zheng, 2024, ArXiv)
- CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments(Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, Antoine Bosselut, 2025, ArXiv)
- Accurate industrial anomaly detection with efficient multimodal fusion(Dinh-Cuong Hoang, Phan Xuan Tan, Anh-Nhat Nguyen, Ta Huu Anh Duong, Tuan-Minh Huynh, Duc-Manh Nguyen, Minh-Duc Cao, D. Ngo, Thu-Uyen Nguyen, Khanh-Toan Phan, Minh-Quang Do, Xuan-Tung Dinh, van-Hiep Duong, Ngoc-Anh Hoang, van-Thiep Nguyen, 2025, Array)
- HFAD: Fair Federated Learning and Hybrid Fusion Multimodal Industrial Anomaly Detection(Dohyoung Kim, Kyoungsu Oh, Youngho Lee, 2024, Journal of the Korea Institute of Information and Communication Engineering)
- UniAD: A Real-World Multi-Category Industrial Anomaly Detection Dataset with a Unified CLIP-Based Framework(Junyang Yang, Jiuxin Cao, Chengge Duan, 2025, Inf.)
- A unified vision-language model for cross-product defect detection in glove manufacturing(Yusen Zhao, Liang Tian, Yonggang Wang, 2026, PLOS One)
文本引导的工业异常合成与数据增强技术
该组论文通过生成式方法解决工业异常样本稀缺的痛点。研究重点在于利用文本信息引导扩散模型或变分自编码器生成高质量的伪异常数据,通过构建多样化的合成样本来提升下游无监督检测模型的性能。
- A Unified Anomaly Synthesis Strategy with Gradient Ascent for Industrial Anomaly Detection and Localization(Qiyu Chen, Huiyuan Luo, Chengkan Lv, Zhengtao Zhang, 2024, ArXiv)
- Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation(Mingyu Lee, Jongwon Choi, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
本报告综合了多模态工业异常检测实现“One-for-All”目标的五大核心路径:1) 引入MLLM/VLM实现具备逻辑推理与解释能力的深度检测;2) 利用CLIP等预训练模型通过提示学习攻克零样本泛化难题;3) 深度融合RGB、3D点云及过程变量以增强物理表征;4) 构建统一化框架与大规模多模态基准测试以推动跨领域通用性;5) 探索文本引导的生成式数据增强。整体趋势显示,工业异常检测正从单一的像素级判别向多模态融合、语义化推理及全场景通用的智能化体系演进。
总计68篇相关文献
Industrial environments demand accurate detection of anomalies to maintain product quality and ensure operational safety. Traditional industrial anomaly detection (IAD) methods often lack the flexibility and adaptability needed in dynamic production settings, where new defect types and operational changes continually emerge. Recent advancements in multimodal large language models (MLLMs) have shown promise by combining visual and textual processing capabilities, yet they are often limited by their lack of domain-specific expertise, particularly regarding industry-standard defect tolerances. To overcome limitations, we introduce Echo, a novel multi-expert framework designed to enhance MLLM performance for IAD. Echo integrates four specialized modules: the Reference Extractor retrieves similar normal images to establish contextual baselines; the Knowledge Guide provides critical, industry-specific insights; the Reasoning Expert enables structured, stepwise analysis for complex queries; and the Decision Maker synthesizes information from the preceding modules to deliver precise, context-aware responses. Evaluations on the MMAD benchmark reveal that Echo significantly improves adaptability, precision, and robustness compared to conventional approaches. Our results demonstrate that guided MLLMs, when augmented with expert modules, can effectively bridge the gap between general visual understanding and the specialized requirements of industrial anomaly detection, paving the way for more reliable and interpretable inspection systems.
MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection
In the field of industrial inspection, Multimodal Large Language Models (MLLMs) have a high potential to renew the paradigms in practical applications due to their robust language capabilities and generalization abilities. However, despite their impressive problem-solving skills in many domains, MLLMs' ability in industrial anomaly detection has not been systematically studied. To bridge this gap, we present MMAD, the first-ever full-spectrum MLLMs benchmark in industrial Anomaly Detection. We defined seven key subtasks of MLLMs in industrial inspection and designed a novel pipeline to generate the MMAD dataset with 39,672 questions for 8,366 industrial images. With MMAD, we have conducted a comprehensive, quantitative evaluation of various state-of-the-art MLLMs. The commercial models performed the best, with the average accuracy of GPT-4o models reaching 74.9%. However, this result falls far short of industrial requirements. Our analysis reveals that current MLLMs still have significant room for improvement in answering questions related to industrial anomalies and defects. We further explore two training-free performance enhancement strategies to help models improve in industrial scenarios, highlighting their promising potential for future research.
The robust causal capability of multimodal large language models (MLLMs) holds the potential of detecting defective objects in industrial anomaly detection (IAD). However, most traditional IAD methods lack the ability to provide multiturn human–machine dialogs and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pretrained models have not fully stimulated the ability of large models in anomaly detection tasks. In this article, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ abnormal prompt generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pretrained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose text-guided enhancer (TGE), wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on the specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a multimask fusion (MMF) module to incorporate mask as expert knowledge, which enhances the LLM’s perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at https://github.com/LiZeWen1225/IAD-GPT
The detection of anomalies in manufacturing processes is crucial to ensure product quality and identify process deviations. Statistical and data-driven approaches remain the standard in industrial anomaly detection, yet their adaptability and usability are constrained by the dependence on extensive annotated datasets and limited flexibility under dynamic production conditions. Recent advances in the perception capabilities of foundation models provide promising opportunities for their adaptation to this downstream task. This paper presents PB-IAD (Prompt-based Industrial Anomaly Detection), a novel framework that leverages the multimodal and reasoning capabilities of foundation models for industrial anomaly detection. Specifically, PB-IAD addresses three key requirements of dynamic production environments: data sparsity, agile adaptability, and domain user centricity. In addition to the anomaly detection, the framework includes a prompt template that is specifically designed for iteratively implementing domain-specific process knowledge, as well as a pre-processing module that translates domain user inputs into effective system prompts. This user-centric design allows domain experts to customise the system flexibly without requiring data science expertise. The proposed framework is evaluated by utilizing GPT-4.1 across three distinct manufacturing scenarios, two data modalities, and an ablation study to systematically assess the contribution of semantic instructions. Furthermore, PB-IAD is benchmarked to state-of-the-art methods for anomaly detection such as PatchCore. The results demonstrate superior performance, particularly in data-sparse scenarios and low-shot settings, achieved solely through semantic instructions.
No abstract available
No abstract available
No abstract available
Due to the training configuration, traditional industrial anomaly detection (IAD) methods have to train a specific model for each deployment scenario, which is insufficient to meet the requirements of modern design and manufacturing. On the contrary, large multimodal models~(LMMs) have shown eminent generalization ability on various vision tasks, and their perception and comprehension capabilities imply the potential of applying LMMs on IAD tasks. However, we observe that even though the LMMs have abundant knowledge about industrial anomaly detection in the textual domain, the LMMs are unable to leverage the knowledge due to the modality gap between textual and visual domains. To stimulate the relevant knowledge in LMMs and adapt the LMMs towards anomaly detection tasks, we introduce existing IAD methods as vision experts and present a novel large multimodal model applying vision experts for industrial anomaly detection~(abbreviated to {Myriad}). Specifically, we utilize the anomaly map generated by the vision experts as guidance for LMMs, such that the vision model is guided to pay more attention to anomalous regions. Then, the visual features are modulated via an adapter to fit the anomaly detection tasks, which are fed into the language model together with the vision expert guidance and human instructions to generate the final outputs. Extensive experiments are applied on MVTec-AD, VisA, and PCB Bank benchmarks demonstrate that our proposed method not only performs favorably against state-of-the-art methods, but also inherits the flexibility and instruction-following ability of LMMs in the field of IAD. Source code and pre-trained models are publicly available at \url{https://github.com/tzjtatata/Myriad}.
No abstract available
No abstract available
Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability, to revolutionize IAD. By integrating MLLM with Group Relative Policy Optimization (GRPO), enhanced by our novel Reasoned Outcome Alignment Metric (ROAM), AnomalyR1 achieves a fully end-to-end solution that autonomously processes inputs of image and domain knowledge, reasons through analysis, and generates precise anomaly localizations and masks. Based on the latest multimodal IAD benchmark, our compact 3-billion-parameter model outperforms existing methods, establishing state-of-the-art results. As MLLM capabilities continue to advance, this study is the first to deliver an end-to-end VLM-based IAD solution that demonstrates the transformative potential of ROAM-enhanced GRPO, positioning our framework as a forward-looking cornerstone for next-generation intelligent anomaly detection systems in industrial applications with limited defective data.
No abstract available
Industrial Anomaly Detection (IAD) is critical for ensuring product quality by identifying defects. Traditional methods such as feature embedding and reconstruction-based approaches require large datasets and struggle with scalability. Existing vision-language models (VLMs) and Multimodal Large Language Models (MLLMs) address some limitations but rely on mask annotations, leading to high implementation costs and false positives. Additionally, industrial datasets like MVTec-AD and VisA suffer from severe class imbalance, with defect samples constituting only 23.8 % and 11.1 % of total data respectively. To address these challenges, we propose a reward function that dynamically prioritizes rare defect patterns during training to handle class imbalance. We also introduce a mask-free reasoning framework using Chain of Thought (CoT) and Group Relative Policy Optimization (GRPO) mechanisms, enabling anomaly detection directly from raw images without annotated masks. This approach generates interpretable step-by-step explanations for defect localization. Our method achieves state-of-the-art performance, outperforming prior approaches by 36% in accuracy on MVTec-AD and 16% on VisA. By eliminating mask dependency and reducing costs while providing explainable outputs, this work advances industrial anomaly detection and supports scalable quality control in manufacturing.
Industrial anomaly detection (IAD) plays a crucial role in maintaining the safety and reliability of manufacturing systems. While multimodal large language models (MLLMs) show strong vision-language reasoning abilities, their effectiveness in IAD remains limited without domain-specific adaptation. In this work, we propose EMIT, a unified framework that enhances MLLMs for IAD via difficulty-aware group relative policy optimization (GRPO). EMIT constructs a multi-task IAD dataset and utilizes GPT-generated object text descriptions to compensate for missing defective images. For few-shot anomaly detection, it integrates a soft prompt and heatmap-guided contrastive embeddings derived from patch-level comparisons. To better handle difficult data samples, i.e., cases where the MLLM struggles to generate correct answers, we propose a difficulty-aware GRPO that extends the original GRPO by incorporating a response resampling strategy to ensure the inclusion of correct answers in the sampled responses, as well as an advantage reweighting mechanism to strengthen learning from such difficult data samples. Extensive experiments on the MMAD benchmark demonstrate that EMIT significantly enhances the IAD performance of MLLMs, achieving an average improvement of 7.77\% over the base model (InternVL3-8B) across seven tasks.
Few-Shot Industrial Anomaly Detection (FS-IAD) has important applications in automating industrial quality inspection. Recently, some FS-IAD methods based on Large Vision-Language Models (LVLMs) have been proposed with some achievements through prompt learning or fine-tuning. However, existing LVLMs focus on general tasks but lack basic industrial knowledge and reasoning capabilities related to FS-IAD, making these methods far from specialized human quality inspectors. To address these challenges, we propose a unified framework, IADGPT, designed to perform FS-IAD in a human-like manner, while also handling associated localization and reasoning tasks, even for diverse and novel industrial products. To this end, we introduce a three-stage progressive training strategy inspired by humans. Specifically, the first two stages gradually guide IADGPT in acquiring fundamental industrial knowledge and discrepancy awareness. In the third stage, we design an in-context learning-based training paradigm, enabling IADGPT to leverage a few-shot image as the exemplars for improved generalization to novel products. In addition, we design a strategy that enables IADGPT to output image-level and pixel-level anomaly scores using the logits output and the attention map, respectively, in conjunction with the language output to accomplish anomaly reasoning. To support our training, we present a new dataset comprising 100K images across 400 diverse industrial product categories with extensive attribute-level textual annotations. Experiments indicate IADGPT achieves considerable performance gains in anomaly detection and demonstrates competitiveness in anomaly localization and reasoning. We will release our dataset in camera-ready.
Industrial image anomaly detection is critical for automated manufacturing. However, most existing methods rely on single-category training paradigms, resulting in poor scalability and limited cross-category generalization. These approaches require separate models for each product type and fail to model the complex multi-modal distribution of normal samples in multi-category scenarios. To overcome these limitations, we propose UniCLIP-AD, a unified anomaly detection framework that leverages the general semantic knowledge of CLIP and adapts it to the industrial domain using Low-Rank Adaptation (LoRA). This design enables a single model to effectively handle diverse industrial parts. In addition, we introduce UniAD, a large-scale industrial anomaly detection dataset collected from real production lines. It contains over 25,000 high-resolution images across 7 categories of electronic components, with both pixel-level and image-level annotations. UniAD captures fine-grained, diverse, and realistic defects, making it a strong benchmark for unified anomaly detection. Experiments show that UniCLIP-AD achieves superior performance on UniAD, with an AU-ROC of 92.1% and F1-score of 89.8% in cross-category tasks, outperforming the strongest baselines (CFA and DSR) by 3% AU-ROC and 23.9% F1-score.
Visual Anomaly Detection (VAD) aims to identify abnormal samples in images that deviate from normal patterns, covering multiple domains, including industrial, logical, and medical fields. Due to the domain gaps between these fields, existing VAD methods are typically tailored to each domain, with specialized detection techniques and model architectures that are difficult to generalize across different domains. Moreover, even within the same domain, current VAD approaches often require large amounts of normal samples to train class-specific models, resulting in poor generalizability and hindering unified evaluation across domains. To address this issue, we propose a generalized few-shot VAD method, UniVAD, capable of detecting anomalies across various domains, with a training-free unified model. UniVAD only needs few normal samples as references during testing to detect anomalies in previously unseen objects, without training on the specific domain. Specifically, UniVAD employs a Contextual Component Clustering (C3) module based on clustering and vision foundation models to segment components within the image accurately, and leverages Component-Aware Patch Matching (CAPM) and Graph-Enhanced Component Modeling (GECM) modules to detect anomalies at different semantic levels, which are aggregated to produce the final detection result. We conduct experiments on nine datasets spanning industrial, logical, and medical fields, and the results demonstrate that UniVAD achieves state-of-the-art performance in few-shot anomaly detection tasks across multiple domains, outperforming domain-specific anomaly detection models. Code is available at https://github.com/FantasticGNU/UniVAD.
Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.
Visual anomaly detection (VAD) aims to identify image regions that deviate from established normal patterns. Existing methods often rely on domain-specific training and follow a ''one-class-one-model'' paradigm, limiting scalability. We propose Omni-LLaMA-AD, the first unified multimodal large language model for open-set anomaly detection, capable of handling diverse domains with minimal supervision. Built on a pretrained LLaMA backbone, the model uses a VQGAN-based tokenizer and supports joint vision-language generation. Trained via vision-language alignment and instruction tuning, it achieves effective anomaly detection with only a few normal samples and no domain-specific fine-tuning. Our demo showcases the model's ability to generate high-quality anomaly masks across industrial, medical, and logical datasets, highlighting its strong cross-domain generalization and interactive dialogue-based user experience.
No abstract available
Industrial few-shot anomaly detection (FSAD) requires identifying various abnormal states by leveraging as few normal samples as possible (abnormal samples are unavailable during training). However, current methods often require training a separate model for each category, leading to increased computation and storage overhead. Thus, designing a unified anomaly detection model that supports multiple categories remains a challenging task, as such a model must recognize anomalous patterns across diverse objects and domains. To tackle these challenges, this paper introduces FocusPatch AD, a unified anomaly detection framework based on vision-language models, achieving anomaly detection under few-shot multi-class settings. FocusPatch AD links anomaly state keywords to highly relevant discrete local regions within the image, guiding the model to focus on cross-category anomalies while filtering out background interference. This approach mitigates the false detection issues caused by global semantic alignment in vision-language models. We evaluate the proposed method on the MVTec, VisA, and Real-IAD datasets, comparing them against several prevailing anomaly detection methods. In both image-level and pixel-level anomaly detection tasks, FocusPatch AD achieves significant gains in classification and localization performance, demonstrating excellent generalization and adaptability.
Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB--3D and RGB--Text, enabled by point cloud sensing and vision--language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB--3D, RGB--Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.
Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.
While anomaly detection has made significant progress, generating detailed analyses that incorporate industrial knowledge remains a challenge. To address this gap, we introduce OmniAD, a novel framework that unifies anomaly detection and understanding for fine-grained analysis. OmniAD is a multimodal reasoner that combines visual and textual reasoning processes. The visual reasoning provides detailed inspection by leveraging Text-as-Mask Encoding to perform anomaly detection through text generation without manually selected thresholds. Following this, Visual Guided Textual Reasoning conducts comprehensive analysis by integrating visual perception. To enhance few-shot generalization, we employ an integrated training strategy that combines supervised fine-tuning (SFT) with reinforcement learning (GRPO), incorporating three sophisticated reward functions. Experimental results demonstrate that OmniAD achieves a performance of 79.1 on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. It also shows strong results across multiple anomaly detection benchmarks. These results highlight the importance of enhancing visual perception for effective reasoning in anomaly understanding. All codes and models will be publicly available.
Zero-shot anomaly detection (ZSAD) targets the identification of anomalies within images from arbitrary novel categories. This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-language model (VLM), CLIP. AdaCLIP incorporates learnable prompts into CLIP and optimizes them through training on auxiliary annotated anomaly detection data. Two types of learnable prompts are proposed: static and dynamic. Static prompts are shared across all images, serving to preliminarily adapt CLIP for ZSAD. In contrast, dynamic prompts are generated for each test image, providing CLIP with dynamic adaptation capabilities. The combination of static and dynamic prompts is referred to as hybrid prompts, and yields enhanced ZSAD performance. Extensive experiments conducted across 14 real-world anomaly detection datasets from industrial and medical domains indicate that AdaCLIP outperforms other ZSAD methods and can generalize better to different categories and even domains. Finally, our analysis highlights the importance of diverse auxiliary data and optimized prompts for enhanced generalization capacity. Code is available at https://github.com/caoyunkang/AdaCLIP.
Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is a crucial task when training data is not accessible due to various concerns, \eg, data privacy, yet it is challenging since the models need to generalize to anomalies across different domains where the appearance of foreground objects, abnormal regions, and background features, such as defects/tumors on different products/organs, can vary significantly. Recently large pre-trained vision-language models (VLMs), such as CLIP, have demonstrated strong zero-shot recognition ability in various vision tasks, including anomaly detection. However, their ZSAD performance is weak since the VLMs focus more on modeling the class semantics of the foreground objects rather than the abnormality/normality in the images. In this paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across different domains. The key insight of AnomalyCLIP is to learn object-agnostic text prompts that capture generic normality and abnormality in an image regardless of its foreground objects. This allows our model to focus on the abnormal image regions rather than the object semantics, enabling generalized normality and abnormality recognition on diverse types of objects. Large-scale experiments on 17 real-world anomaly detection datasets show that AnomalyCLIP achieves superior zero-shot performance of detecting and segmenting anomalies in datasets of highly diverse class semantics from various defect inspection and medical imaging domains. Code will be made available at https://github.com/zqhang/AnomalyCLIP.
The vision-language model has brought great improvement to few-shot industrial anomaly detection, which usually needs to design of hundreds of prompts through prompt engineering. For automated scenarios, we first use conventional prompt learning with many-class paradigm as the baseline to automatically learn prompts but found that it can not work well in one-class anomaly detection. To address the above problem, this paper proposes a one-class prompt learning method for few-shot anomaly detection, termed PromptAD. First, we propose semantic concatenation which can transpose normal prompts into anomaly prompts by concatenating normal prompts with anomaly suffixes, thus constructing a large number of negative samples used to guide prompt learning in one-class setting. Furthermore, to mitigate the training challenge caused by the absence of anomaly images, we introduce the concept of explicit anomaly margin, which is used to explicitly control the margin between normal prompt features and anomaly prompt features through a hyper-parameter. For image-level/pixel-level anomaly detection, PromptAD achieves first place in 11/12 few-shot settings on MVTec and VisA. Code is available at https://github.com/FuNz-0/PromptAD.git
Recent advancements in large-scale visual-language pre-trained models have led to significant progress in zero/few-shot anomaly detection within natural image domains. However, the substantial domain divergence between natural and medical images limits the effectiveness of these methodologies in medical anomaly detection. This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. This multi-level adaptation is guided by multi-level, pixel-wise visual-language feature alignment loss functions, which recalibrate the model's focus from object semantics in natural imagery to anomaly identification in medical images. The adapted features exhibit improved generalization across various medical data types, even in zero-shot scenarios where the model encounters unseen medical modalities and anatomical regions during training. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models, with an average AUC improvement of 6.24% and 7.33% for anomaly classification, 2.03% and 2.37% for anomaly segmentation, under the zero-shot and few-shot settings, respectively. Source code is available at: https://github.com/MediaBrain-SJTU/MVFA-AD
Industrial anomaly detection (IAD) is crucial for maintaining product quality in manufacturing, but it faces challenges due to limited data and diverse defect semantics. Zero-shot IAD has seen rapid development, while current methods mainly rely on pixel-level anomaly scores, lacking thorough anomaly region definition. The explanatory details related to domain knowledge, such as the anomaly categories, visual semantics, and potential causes and consequences, are often overlooked. Recently, Large Vision-Language Models (LVLMs) have exhibited remarkable perception and generalization abilities across various visual tasks. In this paper, we introduce RIASA, a Reasoning Industrial Anomaly Segmentation Assistant, which integrates reasoning segmentation into IAD via LVLMs. Building on the foundation of reasoning segmentation framework LISA, RIASA has made domain alignment enhancements through instruction-supervised fine-tuning on: (1) semantic segmentation and (2) visual question answering regarding anomaly details. We construct a multi-modal question-answering instruction dataset using GPT-4V, containing 1,890 high-quality question-answer pairs. Extensive experiments on several public 2D and 3D IAD benchmarks demonstrate that RIASA achieves superior performance in zero-shot anomaly segmentation and generalization. Additionally, RIASA’s multi-modal interaction capability provides professional visual descriptions within the IAD domain.
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential. To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality. We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes. Our project is available at https://guyao2023.github.io/Phys-AD/.
Anomaly detection is vital in various industrial scenarios, including the identification of unusual patterns in production lines and the detection of manufacturing defects for quality control. Existing techniques tend to be specialized in individual scenarios and lack generalization capacities. In this study, our objective is to develop a generic anomaly detection model that can be applied in multiple scenarios. To achieve this, we custom-build generic visual language foundation models that possess extensive knowledge and robust reasoning abilities as anomaly detectors and reasoners. Specifically, we introduce a multi-modal prompting strategy that incorporates domain knowledge from experts as conditions to guide the models. Our approach considers diverse prompt types, including task descriptions, class context, normality rules, and reference images. In addition, we unify the input representation of multi-modality into a 2D image format, enabling multi-modal anomaly detection and reasoning. Our preliminary studies demonstrate that combining visual and language prompts as conditions for customizing the models enhances anomaly detection performance. The customized models showcase the ability to detect anomalies across different data modalities such as images, point clouds, and videos. Qualitative case studies further highlight the anomaly detection and reasoning capabilities, particularly for multi-object scenes and temporal data. Our code is publicly available at https://github.com/Xiaohac-Xu/Customizable-VLM.11More insights of customized foundation models for broader anomaly detection settings are available at Github repo: https://github.com/caoyunkang/GPT4V-for-Generic-Anomaly-Detection.
SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle with industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model, and dataset are available at https://github.com/amoreZgx1n/SAGE.
Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image's visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining high-quality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application to logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve SOTA performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and an F1-max of 83.7% along with explanations of the anomalies. This significantly outperforms the existing SOTA method by 18.1% in AUROC and 4.6% in F1-max score.
Recent advances in industrial anomaly detection have highlighted the need for deeper logical anomaly analysis, where unexpected relationships among objects, counts, and spatial configurations must be identified and explained. Existing approaches often rely on large-scale external reasoning modules or elaborate pipeline designs, hindering practical deployment and interpretability. To address these limitations, we introduce a new task, Reasoning Logical Anomaly Detection (RLAD), which extends traditional anomaly detection by incorporating logical reasoning. We propose a new framework, LAD-Reasoner, a customized tiny multimodal language model built on Qwen2.5-VL 3B. Our approach leverages a two-stage training paradigm that first employs Supervised Fine-Tuning (SFT) for fine-grained visual understanding, followed by Group Relative Policy Optimization (GRPO) to refine logical anomaly detection and enforce coherent, human-readable reasoning. Crucially, reward signals are derived from both the detection accuracy and the structural quality of the outputs, obviating the need for building chain of thought (CoT) reasoning data. Experiments on the MVTec LOCO AD dataset show that LAD-Reasoner, though significantly smaller, matches the performance of Qwen2.5-VL-72B in accuracy and F1 score, and further excels in producing concise and interpretable rationales. This unified design reduces reliance on large models and complex pipelines, while offering transparent and interpretable insights into logical anomaly detection. Code and data will be released.
AI-based automatic visual inspection systems have been extensively researched to streamline various industrial products' labor-intensive anomaly detection processes. Despite significant advancements, detecting logical anomalies remains challenging due to the multitude of rules governing the assembly of multiple components to create a normal product. Existing methods have relied solely on image information for anomaly detection, resulting in limited accuracy as they fail to account for these diverse complex rules. Instead, humans detect anomalies by comparing the image with pre-defined logic which can be clearly expressed with natural language. Inspired by the human decision process, we propose a logical anomaly detection model that leverages text-based logic like human reasoning. With user-defined rules (i.e., positive rules) and logically distinct negative rules, we train the model using component-aware contrastive learning that increases the similarity between images and positive rules while decreasing the similarity with negative rules. However, accurately comparing textual and visual features is challenging due to multiple components, each governed by different rules, within a single image. To address this, we developed a zero-shot related region detection technique, which guides the model's focus on components relevant to each rule. We evaluated the proposed model on three public datasets and achieved state-of-the-art results in a few-shot logical anomaly detection task. Our findings highlight the potential of integrating vision-language models to enhance logical anomaly detection and utilizing text-based logic in complex industrial settings.
Visual anomaly detection is an important tool in industrial quality control that may impact production efficiency and product quality. Traditional supervised methods struggle due to the difficulties and cost to obtain labeled anomaly data, particularly logical anomalies that violate global constraints that yet appear normal samples. This study investigates multimodal large language models (MLLMs) in few-shot prompting configurations to address this issue. We introduce RuleText-AD, a lightweight anomaly detection framework based on textual rule engines automatically generated by MLLMs from minimal visual examples and concise textual schemas. RuleText-AD eliminates the need for external segmentation tools, extensive prompt engineering, or heavy visual processing backbones, enabling efficient anomaly detection via high-level attribute reasoning. Experiments are conducted on the MVTec-LOCO logical anomaly benchmark, demonstrating that Gemini 2.5 Flash provides the best balance between accuracy and computational cost, achieving competitive performance with significantly reduced deployment complexity. This work highlights the potential for scalable, interpretable, and cost-effective anomaly detection solutions suitable for diverse industrial environments. Our method targets logical (global, ruleviolation) anomalies and does not address structural, pixel-level defects.
Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.
Automated anomaly detection is vital to industrial quality control, yet conventional deep learning detectors often struggle with scalability. These models, typically following a rigid “one-model-per-task” paradigm, require separate systems for each product line, increasing operational complexity and cost in diverse manufacturing environments. To address this limitation, we propose a unified defect detection framework based on a Multimodal Large Language Model (MLLM). Our approach utilizes a two-stage fine-tuning strategy: Supervised Fine-Tuning (SFT) to impart domain-specific knowledge, followed by a novel Reinforcement Fine-Tuning (RFT) process that refines visual reasoning. This RFT stage is guided by a multi-faceted verifiable reward function designed to optimize localization accuracy, classification correctness, and output structure. On a challenging real-world glove manufacturing dataset, our RFT-enhanced MLLM achieves a mean Average Precision (mAP) of 0.63, which is comparable to a highly specialized YOLO baseline (0.62). More importantly, a single, unified MLLM trained on a mixed-product dataset maintains competitive performance (mAP 0.61), demonstrating its ability to dynamically handle different products and defect types via natural language prompts. This study validates the feasibility of using a single, flexible MLLM to replace multiple rigid models in complex industrial inspection, offering a scalable and cost-effective paradigm for future intelligent quality control systems. The open-source code will be released at https://github.com/GloamXun/Glove-MLLM.
Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from"Anomaly Perception"to"Anomaly Interpretation". Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, the largest improvement was on the DAGM dataset, with average accuracy 43.3% higher than the 0.5B baseline. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at https://github.com/Yanhui-Lee/IAD-R1.
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/
Industrial Anomaly Detection (IAD) is increasingly critical in advanced manufacturing due to the high cost and operational disruptions caused by machine faults and process deviations. IAD supports early issue detection across the entire manufacturing lifecycle, from raw material handling to final assembly, thereby ensuring efficiency and minimizing unplanned downtime. However, the scarcity of labelled industrial anomaly data poses a major challenge, limiting the effectiveness and generalizability of traditional AI and machine learning models, which also tend to lack interpretability and delay critical decision-making. Recent advancements in Conversational AI, particularly Large Language Models (LLMs), offer promising opportunities to enhance explainability and operate under limited data conditions through their generative and reasoning capabilities. Despite this, LLMs still face notable challenges in object hallucination and limited comprehension of physical and temporal dynamics in complex industrial settings. This research explores the current landscape of IAD and proposes a conceptual hybrid anomaly detection architecture that integrates discriminative visual analysis models with LLMs. The proposed framework leverages scene understanding, anomaly feature detection, and motion abnormality analysis to produce enriched contextual embeddings, enhancing the interpretive capacity of the LLM for more human-readable explanations. While this paper focuses on reviewing existing methods and conceptualizing the proposed framework, future work will involve its implementation, experimentation, and validation to assess performance, interpretability, and practical applicability in real-world smart manufacturing environments.
We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.
Anomaly detection in complex industrial processes plays a pivotal role in ensuring efficient, stable, and secure operation. Existing anomaly detection methods primarily focus on analyzing dominant anomalies using the process variables (such as arc current) or constructing neural networks based on abnormal visual features, while overlooking the intrinsic correlation of cross-modal information. This paper proposes a cross-modal Transformer (dubbed FmFormer), designed to facilitate anomaly detection by exploring the correlation between visual features (video) and process variables (current) in the context of the fused magnesium smelting process. Our approach introduces a novel tokenization paradigm to effectively bridge the substantial dimensionality gap between the 3D video modality and the 1D current modality in a multiscale manner, enabling a hierarchical reconstruction of pixel-level anomaly detection. Subsequently, the FmFormer leverages self-attention to learn internal features within each modality and bidirectional cross-attention to capture correlations across modalities. By decoding the bidirectional correlation features, we obtain the final detection result and even locate the specific anomaly region. To validate the effectiveness of the proposed method, we also present a pioneering cross-modal benchmark of the fused magnesium smelting process, featuring synchronously acquired video and current data for over 2.2 million samples. Leveraging cross-modal learning, the proposed FmFormer achieves state-of-the-art performance in detecting anomalies, particularly under extreme interferences such as current fluctuations and visual occlusion caused by heavy water mist. The presented methodology and benchmark may be applicable to other industrial applications with some amendments. The benchmark will be released at https://github.com/GaochangWu/FMF-Benchmark.
The scale-up of autonomous vehicles depends heavily on their ability to deal with anomalies, such as rare objects on the road. In order to handle such situations, it is necessary to detect anomalies in the first place. Anomaly detection for autonomous driving has made great progress in the past years but suffers from poorly designed benchmarks with a strong focus on camera data. In this work, we propose AnoVox, the largest benchmark for ANOmaly detection in autonomous driving to date. AnoVox incorporates large-scale multimodal sensor data and spatial VOXel ground truth, allowing for the comparison of methods independent of their used sensor. We propose a formal definition of normality and provide a compliant training dataset. AnoVox is the first benchmark to contain both content and temporal anomalies.
Object anomaly detection is an important problem in the field of machine vision and has seen remarkable progress recently. However, two significant challenges hinder its research and application. First, existing datasets lack comprehensive visual information from various pose angles. They usually have an unrealistic assumption that the anomaly-free training dataset is pose-aligned, and the testing samples have the same pose as the training data. However, in practice, anomaly may exist in any regions on a object, the training and query samples may have different poses, calling for the study on pose-agnostic anomaly detection. Second, the absence of a consensus on experimental protocols for pose-agnostic anomaly detection leads to unfair comparisons of different methods, hindering the research on pose-agnostic anomaly detection. To address these issues, we develop Multi-pose Anomaly Detection (MAD) dataset and Pose-agnostic Anomaly Detection (PAD) benchmark, which takes the first step to address the pose-agnostic anomaly detection problem. Specifically, we build MAD using 20 complex-shaped LEGO toys including 4K views with various poses, and high-quality and diverse 3D anomalies in both simulated and real environments. Additionally, we propose a novel method OmniposeAD, trained using MAD, specifically designed for pose-agnostic anomaly detection. Through comprehensive evaluations, we demonstrate the relevance of our dataset and method. Furthermore, we provide an open-source benchmark library, including dataset and baseline methods that cover 8 anomaly detection paradigms, to facilitate future research and application in this domain. Code, data, and models are publicly available at https://github.com/EricLee0224/PAD.
Industrial Anomaly Detection (IAD) is vital for manufacturing, yet traditional methods face significant challenges: unsupervised approaches yield rough localizations requiring manual thresholds, while supervised methods overfit due to scarce, imbalanced data. Both suffer from the"One Anomaly Class, One Model"limitation. To address this, we propose Referring Industrial Anomaly Segmentation (RIAS), a paradigm leveraging language to guide detection. RIAS generates precise masks from text descriptions without manual thresholds and uses universal prompts to detect diverse anomalies with a single model. We introduce the MVTec-Ref dataset to support this, designed with diverse referring expressions and focusing on anomaly patterns, notably with 95% small anomalies. We also propose the Dual Query Token with Mask Group Transformer (DQFormer) benchmark, enhanced by Language-Gated Multi-Level Aggregation (LMA) to improve multi-scale segmentation. Unlike traditional methods using redundant queries, DQFormer employs only"Anomaly"and"Background"tokens for efficient visual-textual integration. Experiments demonstrate RIAS's effectiveness in advancing IAD toward open-set capabilities. Code: https://github.com/swagger-coder/RIAS-MVTec-Ref.
Industrial anomaly detection is critical for ensuring product quality, preventing equipment failure, and maintaining efficient manufacturing processes. Traditional methods often struggle with the diversity and complexity of potential anomalies and require extensive labeled data for specific tasks. Recent advancements in large pre-trained vision models and large language models, such as Grounding DINO for flexible object detection and Segment Anything Model V2 (SAM V2) for precise segmentation and Llama 3 for large corpus, offer new possibilities for more adaptable anomaly detection systems. This paper proposes an agent-based framework that leverages the power of Grounding DINO, SAM V2, and Llama 3 for zero-shot industrial anomaly detection. Our method employs an intelligent agent to first localize potential anomalies using text prompts with Grounding DINO. These prompts are carefully designed to highlight common industrial anomalies to guide the model in detecting specific deviations from the norm. The identified regions are then segmented precisely using SAM V2, which generates pixel-level masks based on the bounding boxes detected in the localization step. Once the anomaly regions are segmented, the agent generates an anomaly mask that highlights the anomalous regions. Finally, the agent visualizes the anomaly detection results by overlaying the mask onto the original image, allowing human inspectors to identify and address potential defects. We evaluate our approach on the MVTec and VisA anomaly detection dataset. Compared with the mainstream methods, our method has at least increased 1.9% on accuracy, demonstrating its capability to effectively localize and segment various types of industrial anomalies across different object categories without requiring task-specific training data, thus significantly reducing costs.
Industrial Anomaly Detection (IAD) is critical to ensure product quality during manufacturing. Although existing zero-shot defect segmentation and detection methods have shown effectiveness, they cannot provide detailed descriptions of the defects. Furthermore, the application of large multi-modal models in IAD remains in its infancy, facing challenges in balancing question-answering (QA) performance and mask-based grounding capabilities, often owing to overfitting during the fine-tuning process. To address these challenges, we propose a novel approach that introduces a dedicated multi-modal defect localization module to decouple the dialog functionality from the core feature extraction. This decoupling is achieved through independent optimization objectives and tailored learning strategies. Additionally, we contribute to the first multi-modal industrial anomaly detection training dataset, named Defect Detection Question Answering (DDQA), encompassing a wide range of defect types and industrial scenarios. Unlike conventional datasets that rely on GPT-generated data, DDQA ensures authenticity and reliability and offers a robust foundation for model training. Experimental results demonstrate that our proposed method, Explainable Industrial Anomaly Detection Assistant (EIAD), achieves outstanding performance in defect detection and localization tasks. It not only significantly enhances accuracy but also improves interpretability. These advancements highlight the potential of EIAD for practical applications in industrial settings.
Zero-Shot Industrial Anomaly Detection (ZSIAD) aims to identify and localize anomalies in industrial images from unseen categories. Owing to the powerful generalization capabilities, Vision-Language Models (VLMs) have achieved growing interest in ZSIAD. To guide the model toward understanding and localizing the semantically complex industrial anomalies, existing VLM-based methods have attempted to provide additional prompts to the model through learnable text prompt templates. However, these zero-shot methods lack detailed descriptions of specific anomalies, making it difficult to classify and segment the diverse range of industrial anomalies accurately. To address the aforementioned issue, we firstly propose the multi-stage prompt generation agent for ZSIAD. Specifically, we leverage the Multi-modal Language Large Model (MLLM) to articulate the detailed differential information between normal and test samples, which can provide detailed text prompts to the model through further refinement and anti-false alarm constraint. Moreover, we introduce the Visual Fundamental Model (VFM) to generate anomaly-related attention prompts for more accurate localization of anomalies with varying sizes and shapes. Extensive experiments on seven real-world industrial anomaly detection datasets have shown that the proposed method not only outperforms recent SOTA methods, but also its explainable prompts provide the model with a more intuitive basis for anomaly identification.
No abstract available
Industrial anomaly detection aims to identify and localize defective regions in images. Among various architectures, reconstruction-based methods have demonstrated exceptional performance. These methods reconstruct anomalous samples into normal ones and identify anomalies through a comparison between them. However, reconstruction process within these methods often focuses on PL similarity, overlooking the high-frequency consistency between the input and output, which constrains the model’s accuracy. This article proposes tGARD, a novel text-guided adversarial reconstruction method for anomaly detection. Specifically, we introduce feature aggregation module, using nonlocal block and dilated convolution to handle complex anomaly patterns. Subsequently, text-guided reconstruction module is meticulously designed to harness CLIP’s multimodal alignment capabilities, allowing for a semantically controllable reconstruction process. This control is achieved by incorporating dynamic text embeddings derived from the CLIP encoder within discriminator. Meanwhile, during reconstruction process, high-frequency details are preserved through a convolutional adversarial discriminator. Finally, category-aware loss weighting strategy is conceived to balance similarity and adversarial loss. Experiments demonstrate that our model achieves significant improvements in anomaly localization, surpassing all reconstruction-based models on MVTec-AD. It also establishes a new state-of-the-art on VisA dataset, outperforming all existing architectures.
We propose a text-guided variational image generation method to address the challenge of getting clean data for anomaly detection in industrial manufacturing. Our method utilizes text information about the target object, learned from extensive text library documents, to generate non-defective data images resembling the input image. The proposed framework ensures that the generated non-defective images align with anticipated distributions derived from textual and image-based knowledge, ensuring stability and generality. Experimental results demonstrate the effectiveness of our approach, surpassing previous methods even with limited non-defective data. Our approach is validated through generalization tests across four baseline models and three distinct datasets. We present an additional analy-sis to enhance the effectiveness of anomaly detection models by utilizing the generated images.
2D-based Industrial Anomaly Detection has been widely discussed, however, multimodal industrial anomaly detection based on 3D point clouds and RGB images still has many untouched fields. Existing multimodal industrial anomaly detection methods directly concatenate the multimodal features, which leads to a strong disturbance between features and harms the detection performance. In this paper, we propose Multi-3D-Memory (M3DM), a novel multimodal anomaly detection method with hybrid fusion scheme: firstly, we design an unsupervised feature fusion with patch-wise contrastive learning to encourage the interaction of different modal features; secondly, we use a decision layer fusion with multiple memory banks to avoid loss of information and additional novelty classifiers to make the final decision. We further propose a point feature alignment operation to better align the point cloud and RGB features. Extensive experiments show that our multi-modal industrial anomaly detection model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTec-3D AD dataset. Code at github.com/nomewang/M3DM.
Recent advancements have shown the potential of lever-aging both point clouds and images to localize anomalies. Nevertheless, their applicability in industrial manufacturing is often constrained by significant drawbacks, such as the use of memory banks, which leads to a substantial in-crease in terms of memory footprint and inference times. We propose a novel light and fast framework that learns to map features from one modality to the other on nominal samples and detect anomalies by pinpointing inconsisten-cies between observed and mapped features. Extensive ex-periments show that our approach achieves state-of-the-art detection and segmentation performance in both the stan-dard and few-shot settings on the MVTec 3D-AD dataset while achieving faster inference and occupying less memory than previous multimodal AD methods. Furthermore, we propose a layer pruning technique to improve memory and time efficiency with a marginal sacrifice in performance.
Constructing comprehensive multimodal feature representations from RGB images (RGB) and point clouds (PT) in 2D–3D multimodal anomaly detection (MAD) methods is very important to reveal various types of industrial anomalies. For multimodal representations, most of the existing MAD methods often consider the explicit spatial correspondence between the modality-specific features extracted from RGB and PT through space-aligned fusion, while overlook the implicit interaction relationships between them. In this study, we propose a uni-modal and cross-modal fusion (UCF) method, which comprehensively incorporates the implicit relationships within and between modalities in multimodal representations. Specifically, UCF first establishes uni-modal and cross-modal embeddings to capture intramodal and intermodal relationships through uni-modal reconstruction and cross-modal mapping. Then, an adaptive nonequal fusion method is proposed to develop fusion embeddings, with the aim of preserving the primary features and reducing interference of the uni-modal and cross-modal embeddings. Finally, uni-modal, cross-modal, and fusion embeddings are all collaborated to reveal anomalies existing in different modalities. Experiments conducted on the MVTec 3D-AD benchmark and the real-world surface mount inspection demonstrate that the proposed UCF outperforms existing approaches, particularly in precise anomaly localization.
Accurate detection and precise localization of anomalies during precision component manufacturing are essential to maintaining high product quality. Multimodal industrial anomaly detection (MIAD) harnesses data from diverse sensors to effectively identify and pinpoint defects in industrial products. Recent MIAD approaches have made significant progress but often ignore point cloud data global contextual semantics and modality-specific information, resulting in an incomplete representation of point cloud and inadequate multimodal fusion. To confront these issues head-on, we propose a robust feature representation and comprehensive multimodal feature fusion network [views-graph and latent feature disentangled fusion network (VLDFNet)] for anomaly detection in industrial high-precision components. VLDFNet mainly consists of a point cloud views-graph representation model and a multimodal disentangled feature latent space fusion module. Specifically, the point cloud views-graph representation model explores spatial locations and semantic relationships between views using multilevel graph fusion. The multimodal disentangled feature latent space fusion module disentangles multimodal features into shared and specific representations to mitigate the omission of modality-specific information. VLDFNet introduces a cross-modal shared feature interaction (CSFI) strategy to extract coherent semantic information by aligning and integrating cross-modal features. Comprehensive experimental results on multiple datasets demonstrate that our method significantly outperforms existing approaches in detection accuracy.
No abstract available
Recent studies of multimodal industrial anomaly detection (IAD) based on 3D point clouds and RGB images have highlighted the importance of exploiting the redundancy and complementarity among modalities for accurate classification and segmentation. However, achieving multimodal IAD in practical production lines remains a work in progress. It is essential to consider the trade-offs between the costs and benefits associated with the introduction of new modalities while ensuring compatibility with current processes. Existing quality control processes combine rapid in-line inspections, such as optical and infrared imaging with high-resolution but time-consuming near-line characterization techniques, including industrial CT and electron microscopy to manually or semi-automatically locate and analyze defects in the production of Li-ion batteries and composite materials. Given the cost and time limitations, only a subset of the samples can be inspected by all in-line and near-line methods, and the remaining samples are only evaluated through one or two forms of in-line inspection. To fully exploit data for deep learning-driven automatic defect detection, the models must have the ability to leverage multimodal training and handle incomplete modalities during inference. In this paper, we propose CMDIAD, a Cross-Modal Distillation framework for IAD to demonstrate the feasibility of a Multi-modal Training, Few-modal Inference (MTFI) pipeline. Our findings show that the MTFI pipeline can more effectively utilize incomplete multimodal information compared to applying only a single modality for training and inference. Moreover, we investigate the reasons behind the asymmetric performance improvement using point clouds or RGB images as the main modality of inference. This provides a foundation for our future multimodal dataset construction with additional modalities from manufacturing scenarios.
No abstract available
No abstract available
We demonstrate an end-to-end system for real-time, multimodal industrial anomaly detection (IAD), built upon a custom hardware platform for synchronized 2D and 3D data acquisition. Our core contribution is a novel cross-modal residual mechanism that identifies defects by quantifying predictive errors between visual and geometric feature spaces. Instead of traditional concatenation, our dual-stream architecture mutually predicts features across modalities, leveraging the prediction residual's magnitude as a direct and robust anomaly indicator. The entire system achieves sub-second inference from acquisition to decision, enabled by efficient depth map analysis that circumvents the complexity of direct point cloud processing, offering a deployable solution for high-speed inspection.
The purpose of multimodal industrial anomaly detection is to detect complex geometric shape defects such as subtle surface deformations and irregular contours that are difficult to detect in 2D-based methods. However, current multimodal industrial anomaly detection lacks the effective use of crucial geometric information like surface normal vectors and 3D shape topology, resulting in low detection accuracy. In this paper, we propose a novel Geometric Prior-based Anomaly Detection network (GPAD). Firstly, we propose a point cloud expert model to perform fine-grained geometric feature extraction, employing differential normal vector computation to enhance the geometric details of the extracted features and generate geometric prior. Secondly, we propose a two-stage fusion strategy to efficiently leverage the complementarity of multimodal data as well as the geometric prior inherent in 3D points. We further propose attention fusion and anomaly regions segmentation based on geometric prior, which enhance the model’s ability to perceive geometric defects. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the State-of-the-art (SOTA) methods in detection accuracy on both MVTec-3D AD and Eyecandies datasets.
Multimodal industrial anomaly detection (IAD), which integrates RGB and 3D information, has become one of the key technical directions for improving detection robustness and accuracy.Although prevailing cross-modal feature-mapping methods are efficient and lightweight, they still suffer from two major limitations. First, they typically adopt a one-way modeling paradigm that regresses one modality from another and lack explicit interaction within a unified representation space, making it difficult to detect local, small-magnitude anomalies that appear only in a single modality.Second, fusion-reconstruction methods derived from this paradigm rely on a single fusion stream optimized with a reconstruction loss. When trained solely on normal samples, this design can overgeneralize and lacks a parallel branch to enforce consistency constraints on the fused representations, which in turn limits reliable discrimination between normal and anomalous patterns in complex multimodal scenarios. To address these issues, we propose FMFR, a feature-level multistage fusion and remapping framework that jointly models multistage feature fusion and cross-modal remapping. The framework consists of a fusion-reconstruction branch and a remapping-fusion branch, which are jointly constrained by a multi-order consistency loss. In the fusion-reconstruction branch, a reconstruction loss supervises the intermediate fusion layers, encouraging them to learn joint representations that retain complete information and to reconstruct features without losing critical details. In the remapping-fusion branch, the network learns bidirectional mappings between modalities and re-fuses the remapped features, while the multi-order consistency loss is used to align its fused representations with those of the fusion-reconstruction branch. During inference, FMFR jointly leverages intra-modal reconstruction residuals, cross-modal remapping residuals, and the consistency deviation between the fused embeddings of the two branches to construct multi-source anomaly maps. This design forces anomalies to simultaneously violate both intra-modal and cross-modal priors, thereby suppressing the overgeneralization of a single fusion stream and enhancing the visibility of local anomaly structures that exist only in a single modality as well as the overall robustness of anomaly detection. Experimental results on the MVTec 3D-AD dataset demonstrate that FMFR achieves competitive state-of-the-art performance on both anomaly detection and anomaly segmentation tasks.
No abstract available
Anomaly detection plays a critical role in ensuring safe, smooth, and efficient operation of machinery and equipment in industrial environments. With the wide deployment of multimodal sensors and the rapid development of Internet of Things (IoT), the data generated in modern industrial production has become increasingly diverse and complex. However, traditional methods for anomaly detection based on a single data source cannot fully utilize multimodal data to capture anomalies in industrial systems. To address this challenge, we propose a new model for anomaly detection in industrial environments using multimodal temporal data. This model integrates an attention-based autoencoder (AAE) and a generative adversarial network (GAN) to capture and fuse rich information from different data sources. Specifically, the AAE captures time-series dependencies and relevant features in each modality, and the GAN introduces adversarial regularization to enhance the model’s ability to reconstruct normal time-series data. We conduct extensive experiments on real industrial data containing both measurements from a distributed control system (DCS) and acoustic signals, and the results demonstrate the performance superiority of the proposed model over the state-of-the-art TimesNet for anomaly detection, with an improvement of 5.6% in F1 score.
Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I introduces the Suspected References Selection module to filter a few normal samples from the training dataset, using the multimodal features extracted by the Initial Feature Extraction, and a Suspected Anomaly Map Computation module to generate a suspected anomaly map to focus on abnormal regions as reference. Stage-II uses the suspected anomaly maps of the reference samples as reference, and inputs image, point cloud, and text information to achieve denoising of the training samples through intra-modal comparison and multi-scale aggregation operations. Finally, Stage-III proposes the Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion modules to learn the pattern of the training dataset, enabling anomaly detection and segmentation while filtering out noise. Extensive experiments show that M3DM-NR outperforms state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection.
Industrial anomaly detection involves identifying abnormal regions in products and plays a crucial role in quality inspection. While 2D image-based anomaly detection has been extensively explored, combining two-dimensional (2D) images with three-dimensional (3D) point clouds remains less studied. Existing multimodal methods often combine features from different modalities, leading to feature interference and degraded performance. To overcome this, we propose a novel framework for unsupervised industrial anomaly detection that leverages both visual and geometric information. Specifically, we use pre-trained 2D and 3D models to extract visual features from color images and geometric features from 3D point clouds. Instead of directly fusing these features, we propose a geometric feature reconstruction network that predicts 3D geometric features from the 2D visual features. During training, we minimize the difference between the predicted geometric features and the extracted geometric features, enabling the model to learn how 2D appearance correlates with 3D structure in anomaly-free images. During inference, this learned relationship allows the model to detect anomalies: significant discrepancies between the reconstructed and actual geometric features indicate abnormal regions. Evaluated on the MVTec 3D-AD dataset, our method achieves state-of-the-art performance with an average image-level AUROC score of 0.968, surpassing previous approaches. Additionally, it provides fast inference at 8.2 frames per second with a memory footprint of only 1045 MB, making it highly efficient for industrial applications.
Anomaly synthesis strategies can effectively enhance unsupervised anomaly detection. However, existing strategies have limitations in the coverage and controllability of anomaly synthesis, particularly for weak defects that are very similar to normal regions. In this paper, we propose Global and Local Anomaly co-Synthesis Strategy (GLASS), a novel unified framework designed to synthesize a broader coverage of anomalies under the manifold and hypersphere distribution constraints of Global Anomaly Synthesis (GAS) at the feature level and Local Anomaly Synthesis (LAS) at the image level. Our method synthesizes near-in-distribution anomalies in a controllable way using Gaussian noise guided by gradient ascent and truncated projection. GLASS achieves state-of-the-art results on the MVTec AD (detection AUROC of 99.9\%), VisA, and MPDD datasets and excels in weak defect detection. The effectiveness and efficiency have been further validated in industrial applications for woven fabric defect detection. The code and dataset are available at: \url{https://github.com/cqylunlun/GLASS}.
本报告综合了多模态工业异常检测实现“One-for-All”目标的五大核心路径:1) 引入MLLM/VLM实现具备逻辑推理与解释能力的深度检测;2) 利用CLIP等预训练模型通过提示学习攻克零样本泛化难题;3) 深度融合RGB、3D点云及过程变量以增强物理表征;4) 构建统一化框架与大规模多模态基准测试以推动跨领域通用性;5) 探索文本引导的生成式数据增强。整体趋势显示,工业异常检测正从单一的像素级判别向多模态融合、语义化推理及全场景通用的智能化体系演进。