多模态工业异常检测实现one-for-all

本报告综合了多模态工业异常检测实现“One-for-All”目标的五大核心路径：1) 引入MLLM/VLM实现具备逻辑推理与解释能力的深度检测；2) 利用CLIP等预训练模型通过提示学习攻克零样本泛化难题；3) 深度融合RGB、3D点云及过程变量以增强物理表征；4) 构建统一化框架与大规模多模态基准测试以推动跨领域通用性；5) 探索文本引导的生成式数据增强。整体趋势显示，工业异常检测正从单一的像素级判别向多模态融合、语义化推理及全场景通用的智能化体系演进。

共 67 篇文献，5 个研究方向

基于多模态大模型（MLLM/VLM）的逻辑推理与可解释性检测

该组文献利用大型视觉语言模型（LVLMs）强大的语义理解和推理能力，将异常检测从单纯的像素级分割提升到异常解释、因果分析和逻辑校验层面。研究重点包括通过思维链（CoT）、强化学习（GRPO）、指令微调以及规则引擎，解决传统方法难以识别的逻辑异常（如组件缺失、顺序错误），并提供人类可理解的检测报告。相关文献: Fu Swee Tee et. al, 2025 等 24 篇文献

基于视觉-语言预训练（CLIP）的零样本与提示学习策略

此类文献侧重于利用CLIP等预训练模型的跨模态对齐能力，通过提示学习（Prompt Learning）、适配器（Adapter）微调或特征解耦，实现无需针对特定类别训练的零样本（Zero-shot）或少样本（Few-shot）异常检测。研究旨在解决工业场景下标注样本稀缺的问题，提升模型的泛化迁移能力。相关文献: Xiaofan Li et. al, 2024 等 10 篇文献

多源传感器（RGB/3D/过程变量）的特征融合与几何建模

该组文献关注物理层面的多模态信息互补，特别是RGB图像与3D点云、深度图以及工业过程变量（如电流、压力）的结合。研究内容涵盖跨模态特征重映射、几何先验增强、图神经网络表示以及在模态缺失或噪声干扰下的鲁棒性检测技术。相关文献: Chengjie Wang et. al, 2024 等 16 篇文献

统一化框架构建、跨域泛化与基准测试评估

这组文献致力于打破“一类一模型”的限制，构建能够同时处理多类别、多领域（工业、医疗等）的统一检测框架（One-for-All）。同时，研究者通过构建大规模、多模态的基准数据集（如MMAD、AnoVox），为评估模型的泛化能力、姿态无关性和常识推理能力提供了标准。相关文献: Xi Jiang et. al, 2024 等 15 篇文献

文本引导的工业异常合成与数据增强技术

该组论文通过生成式方法解决工业异常样本稀缺的痛点。研究重点在于利用文本信息引导扩散模型或变分自编码器生成高质量的伪异常数据，通过构建多样化的合成样本来提升下游无监督检测模型的性能。相关文献: Qiyu Chen et. al, 2024 等 2 篇文献

总计68篇相关文献

Can Multimodal Large Language Models be Guided to Improve Industrial Anomaly Detection?

多模态大型语言模型能否被引导以提升工业异常检测能力？

Zhiling Chen, Hanning Chen, Mohsen Imani 等, 2025-ArXiv

Industrial environments demand accurate detection of anomalies to maintain product quality and ensure operational safety. Traditional industrial anomaly detection (IAD) methods often lack the flexibility and adaptability needed in dynamic production settings, where new defect types and operational changes continually emerge. Recent advancements in multimodal large language models (MLLMs) have shown promise by combining visual and textual processing capabilities, yet they are often limited by their lack of domain-specific expertise, particularly regarding industry-standard defect tolerances. To overcome limitations, we introduce Echo, a novel multi-expert framework designed to enhance MLLM performance for IAD. Echo integrates four specialized modules: the Reference Extractor retrieves similar normal images to establish contextual baselines; the Knowledge Guide provides critical, industry-specific insights; the Reasoning Expert enables structured, stepwise analysis for complex queries; and the Decision Maker synthesizes information from the preceding modules to deliver precise, context-aware responses. Evaluations on the MMAD benchmark reveal that Echo significantly improves adaptability, precision, and robustness compared to conventional approaches. Our results demonstrate that guided MLLMs, when augmented with expert modules, can effectively bridge the gap between general visual understanding and the specialized requirements of industrial anomaly detection, paving the way for more reliable and interpretable inspection systems.