CLIP 分类异常检测
零样本(Zero-shot)工业与医疗异常检测框架
侧重于利用预训练CLIP的通用泛化能力,设计多模态交互或特征对齐模块,在无需目标样本的情况下实现工业缺陷检测与医学影像分析。
- MVREC: A General Few-shot Defect Classification Model Using Multi-View Region-Context(Shuai Lyu, Rongchen Zhang, Zeqi Ma, Fangjian Liao, Dongmei Mo, Wai Keung Wong, 2025, Proceedings of the AAAI Conference on Artificial Intelligence)
- AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection(Bin-Bin Gao, Yue Zhou, Jiangtao Yan, Yuezhi Cai, Weixi Zhang, Meng Wang, Jun Liu, Yong Liu, Lei Wang, Chengjie Wang, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- A Self-Distilled Vision-Language Model for Industrial Defect Classification(Subin Choi, Daun Jeong, D. Park, Hansang Cho, 2026, 2026 International Conference on Electronics, Information, and Communication (ICEIC))
- WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation(Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, O. Dabeer, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- A Dual-State-Based Surface Anomaly Detection Model for Rail Transit Trains Using Vision-Language Model(Kaiyan Lei, Zhiquan Qi, 2025, IEEE Transactions on Instrumentation and Measurement)
- Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images(Chaoqin Huang, Aofan Jiang, Jinghao Feng, Ya Zhang, Xinchao Wang, Yanfeng Wang, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Sea-CLIP: Mining Semantic-Aware Representations for Few-Shot Anomaly Detection with CLIP(Xiao Guo, Zhimin Chen, C. D. Castillo, Hongcheng Wang, Xiaoming Liu, 2026, 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))
- OV-AS: Zero-Shot/Few-Shot Open-Vocabulary Anomaly Segmentation Based on CLIP(Jingtong Mo, Yuzhuo Fu, Ting Liu, 2024, 2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE))
- G-Anomaly: A Pyramid Graph Transformer-Based Vision-Language Model for General Industrial Anomaly Detection(Jiaqi Li, Shuhuan Wen, Bin Fang, 2026, IEEE Transactions on Automation Science and Engineering)
- SEM-CLIP 2.0: Precise Zero-/Few-Shot Learning for Nanoscale Defect Detection in SEM Image(Qian Jin, Ruidong Li, Yuqi Jiang, Yumeng Liu, Xudong Lu, Yining Chen, Dawei Gao, Qi Sun, Cheng Zhuo, 2026, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
- AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation(Qingqing Fang, Wenxi Lv, Qinliang Su, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Zero-Shot Defect Detection With Anomaly Attribute Awareness via Textual Domain Bridge(Zhe Zhang, Shu Chen, Jian Huang, Jie Ma, 2025, IEEE Sensors Journal)
- FP-CLIP : Foreground-panorama prompt learning for zero-shot anomaly detection(Ao Lu, Jincun Liu, Yaoguang Wei, Yan Meng, Dong An, 2025, Digital Signal Processing)
- CLIP-DSA: A CLIP-Based Discriminative and Self-supervised Framework for Few-Shot Anomaly Detection(S. Zeng, Yijun Chen, Muyang Li, Yuqi Wu, Jizhou Tian, 2025, Lecture Notes in Computer Science)
- FE-CLIP: Frequency Enhanced CLIP Model for Zero-Shot Anomaly Detection and Segmentation(Tao Gong, Qi Chu, Bin Liu, Wei Zhou, Nenghai Yu, 2025, 2025 IEEE/CVF International Conference on Computer Vision (ICCV))
- An efficient and scale-aware zero-shot industrial anomaly detection technique based on optimized CLIP(Yahui Cheng, Guojun Wen, Aoshuang Luo, Shuang Mei, Hongbo Dong, Xingyue Liu, 2025, Measurement)
- CoDe-CLIP: Contrastive decoupling for zero-shot industrial anomaly detection(Kaiwen Wang, Hao Liu, Jiuzhen Liang, Yachaona Li, Fan Wu, Yang Cheng, Zhikai Wang, 2026, Information Sciences)
- An Open-Vocabulary Industrial Anomaly Detection method based on CLIP and LLM(Junxiong Wang, Chunrui Li, Yi Zhang, Ziwei Liu, 2026, Advanced Engineering Informatics)
- DCS: A Zero-Shot Anomaly Detection Framework with DINO-CLIP-SAM Integration(Yan Wan, Y. Lang, Li Yao, 2026, Applied Sciences)
- MultiADS: Defect-Aware Supervision for Multi-Type Anomaly Detection and Segmentation in Zero-Shot Learning(Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj, Stefan Schmid, Steffen Staab, Claudia Plant, 2025, 2025 IEEE/CVF International Conference on Computer Vision (ICCV))
- LECLIP: Boosting Zero-Shot Anomaly Detection With Local Enhanced CLIP(Yuyao Liu, Qingyong Li, Zhehong Wang, Jien Kato, Jie Zhang, Wen Wang, 2025, IEEE Transactions on Instrumentation and Measurement)
- RareCLIP: Rarity-Aware Online Zero-Shot Industrial Anomaly Detection(Jianfang He, Minh-Duc Cao, Silong Peng, Qiong Xie, 2025, 2025 IEEE/CVF International Conference on Computer Vision (ICCV))
- Zero-Shot Industrial Anomaly Detection via CLIP-DINOv2 Multimodal Fusion and Stabilized Attention Pooling(Junjie Jiang, Zongxiang He, Anping Wan, Khalil Al-Bukhaiti, Kaiyang Wang, Peiyi Zhu, Xiaomin Cheng, 2025, Electronics)
- HieClip: Hierarchical CLIP with Explicit Alignment for Zero-Shot Anomaly Detection(Liujie Hua, Xiu Su, Yueyi Luo, Shan You, Jun Long, 2025, ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection(Donghyeong Kim, Chaewon Park, Suhwan Cho, Hyeonjeong Lim, Minseok Kang, Jungho Lee, Sangyoun Lee, 2025, Pattern Recognition)
- AA-CLIP: Enhancing Zero-Shot Anomaly Detection via Anomaly-Aware CLIP(Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, S.Kevin Zhou, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- IAD-CLIP: Vision-Language Models for Zero-Shot Industrial Anomaly Detection(Zhuo Li, Yifei Ge, Qi Li, Lin Meng, 2024, 2024 International Conference on Advanced Mechatronic Systems (ICAMechS))
小样本(Few-shot)提示学习与参数适配
通过设计可学习的Prompt工程、适配器或精细化特征微调策略,利用少量领域样本提升模型在特定工业或医疗场景下的识别精度。
- Few-Shot Data Augmentation Image Anomaly Detection Based on Clip(Zhaowei Zeng, Dan Li, Xujun Li, 2026, 2026 5th International Symposium on Computer Applications and Information Technology (ISCAIT))
- Reconsidering learnable fine-grained text prompts for few-shot anomaly detection in visual-language models(Delong Han, Luo Xu, Mingle Zhou, Jin Wan, Min Li, Gang Li, 2024, Neural Networks)
- Metal Image Defect Recognition Based on CLIP and Computer Vision(Yanqi Wu, Qinzi Luo, 2023, 2023 IEEE 3rd International Conference on Data Science and Computer Application (ICDSCA))
- Multi-label sewer defect classification based on CLIP with fine-to-coarse contextual representations(Yisu Ge, Jialu Guo, Zhihao Yang, Zhaomin Chen, Liyan Chen, Guodao Zhang, 2026, Advanced Engineering Informatics)
- TGRF-CLIP: CLIP-Based Text-Guided Fusion of Visual Residuals for Few-Shot Anomaly Detection(Hongliang Yan, Xinshun Xu, 2026, Expert Systems with Applications)
- Normal-Variation-Aware Cross-Domain Zero-Shot Anomaly Detection via Multi-Prompt Learning with CLIP(M. Tsuchiya, Tsubasa Hirakawa, Takayoshi Yamashita, H. Fujiyoshi, 2026, IEEE Access)
- CLIP-Core: A Few-Sample Anomaly Detection Method for Surface Defects(Liang Xu, J. Rao, Shuyou Lin, 2026, IEEE Access)
- PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection(Xiaofan Li, Zhizhong Zhang, Xin Tan, Chengwei Chen, Yanyun Qu, Yuan Xie, Lizhuang Ma, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Self-Supervised CLIP-Guided for Few-Shot Industrial Anomaly Detection(Yingwen Chen, Ying Xu, Tianlei Wang, Yikui Zhai, Kanghong Tan, Jianhong Zhou, Pasquale Coscia, A. Genovese, C. L. P. Chen, 2026, IEEE Transactions on Instrumentation and Measurement)
- Pro-CLIP: Residual learning and object-agnostic prompts for few-shot anomaly detection(Yuqing Zhao, Min Meng, Jigang Wu, 2025, Neurocomputing)
- AnomalyNLP: Noisy-Label Prompt Learning for Few-Shot Industrial Anomaly Detection(L. Hua, Jin Qian, 2025, Electronics)
- SEM-CLIP: Precise Few-Shot Learning for Nanoscale Defect Detection in Scanning Electron Microscope Image(Qian Jin, Yuqi Jiang, Xudong Lu, Yumeng Liu, Yining Chen, Dawei Gao, Qi Sun, Cheng Zhuo, 2024, Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design)
- Few-Shot Anomaly Detection via Personalization(Sangkyung Kwak, Jongheon Jeong, Hankook Lee, W. Kim, Dong-Ho Seo, Woojin Yun, Wonjin Lee, Jinwoo Shin, 2024, IEEE Access)
- CLIP-Vision Guided Few-Shot Metal Surface Defect Recognition(Tianlei Wang, Zeliang Li, Ying Xu, Yikui Zhai, Xiaofen Xing, K. Guo, Pasquale Coscia, A. Genovese, Vincenzo Piuri, F. Scotti, 2025, IEEE Transactions on Industrial Informatics)
视频异常检测与时序行为分析
专门针对视频中的时序依赖与异常行为定位,利用CLIP的跨模态语义对齐优势处理视频流的复杂逻辑。
- WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection(Min Li, Jing Sang, Yuanyao Lu, Li-Fen Du, 2025, Journal of Imaging)
- VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection(Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, Yanning Zhang, 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
- OW-YW-VAD: Open-World YOLO-World-Guided Video Anomaly Detection(Youxi Li, Xiangjun Chen, Liming Wang, Xiaocheng Huang, Qiang Liu, 2026, Research Square)
多模态大模型推理与开放词汇架构
结合MLLMs、GPT-4V等大模型能力或开放词汇探测技术,解决多场景、跨类别异常检测中对复杂语境的推理与描述需求。
- Towards Training-free Anomaly Detection with Vision and Language Foundation Models(Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Prompt engineering for zero‐shot and few‐shot defect detection and classification using a visual‐language pretrained model(Gunwoo Yong, Kahyun Jeon, Daeyoung Gil, Ghang Lee, 2022, Computer-Aided Civil and Infrastructure Engineering)
- Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models(Jiacong Xu, Shao-Yuan Lo, Bardia Safaei, Vishal M. Patel, Isht Dwivedi, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Leveraging Large Language Model for Robust Industrial Image Anomaly Detection(Zining Wang, Longquan Dai, 2024, 2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC))
- UniAD: A Real-World Multi-Category Industrial Anomaly Detection Dataset with a Unified CLIP-Based Framework(Junyang Yang, Jiuxin Cao, Chengge Duan, 2025, Information)
- Open-Vocabulary Crack Object Detection Through Attribute-Guided Similarity Probing(Hyemin Yoon, Sangjin Kim, 2025, Applied Sciences)
- VLLM-LAD: Visual Large Language Model for Zero-shot Logical Anomaly Detection(Yun Peng, Xiao Lin, Nachuan Ma, Chengjiu Liu, Qijun Chen, 2026, IEEE Transactions on Circuits and Systems for Video Technology)
- DyC-CLIP: Dynamic context-aware multi-modal prompt learning for zero-shot anomaly detection(Peng Chen, Fangjun Huang, Chao Huang, 2026, Pattern Recognition)
- IA-CLIP: A Single-Source Industrial Anomaly Detection Method for Multi-Target Domain Generalization(Yaohua Guo, Guoai Xu, Jianping Yin, 2026, IEEE Transactions on Automation Science and Engineering)
- PLOVAD: Prompting Vision-Language Models for Open Vocabulary Video Anomaly Detection(Chen Xu, Ke Xu, Xinghao Jiang, Tanfeng Sun, 2025, IEEE Transactions on Circuits and Systems for Video Technology)
- Three-Stage CLIP-Based Approach for Rebar Rust Classification in Precast Concrete Products(Toshiaki Kawagoshi, Naoto Hoshikawa, 2025, 2025 Thirteenth International Symposium on Computing and Networking Workshops (CANDARW))
- Classification of Surface Defects in Hot-Rolled Steel Strips Using Contrastive Learning Image Pre-Trained Models(Hui Tang, Yingjie Zhao, Pengkun Yang, Guangyi Li, Hua Yan, 2025, 2025 5th International Conference on Neural Networks, Information and Communication Engineering (NNICE))
领域综述与前沿应用扩展
包含异常检测领域的基础综述及针对医学影像等垂直领域的应用性研究。
- Enhancing medical anomaly detection via text-adapted few-shot learning with visual-language models(Keming Mao, Shengbin Hou, Haoming Fang, Jianzhe Zhao, Xinlu Xiao, 2026, The Visual Computer)
- A survey of deep learning for industrial visual anomaly detection(Zhuo Li, Yuhao Yan, Xiangheng Wang, Yifei Ge, Lin Meng, 2025, Artificial Intelligence Review)
基于CLIP的异常检测研究已从初步的语义迁移发展为包含零样本通用检测、小样本提示学习、视频时序分析及多模态推理的成熟生态。研究核心正由全局特征对齐转向针对特定领域架构适配、多模态逻辑增强及开放词汇场景的深度开发,通过系统化的架构设计解决了零样本与小样本场景下的泛化瓶颈与推理难题。
总计58篇相关文献
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle to design prompt templates, handle complex token interactions, or require fine-tuning on target domains, resulting in limited flexibility. In this work, we present a simple yet effective AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and provides a training-free approach on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods.
Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved through a straightforward yet effective two-stage approach: it first creates anomaly-aware text anchors to differentiate normal and abnormal semantics clearly, then aligns patch-level visual features with these anchors for precise anomaly localization. This two-stage strategy, with the help of residual adapters, gradually adapts CLIP in a controlled manner, achieving effective AD while maintaining CLIP's class knowledge. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. The code is available at https://github.com/Mwxinnn/AA-CLIP.
Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP's zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.
… CLIP and improve the detection capability for anomalies … anomaly localization ability. Experimental results demonstrated that our model achieves a salient zero-shot anomaly detection …
Few-shot Anomaly Detection (FSAD) is a classic computer vision task, and recent FSAD methods utilize the pre-trained Vision-Language model, i.e., CLIP, to achieve remarkable performance. However, existing CLIP-based approaches disregard object semantics, a crucial factor for enhancing FSAD by guiding comparisons between semantically corresponding patches. To address this limitation, we propose Sea-CLIP, a novel method that integrates semantic-aware representations from DINOv2 to enhance FSAD representation learning. Specifically, Sea-CLIP first leverages a Patch Matching module that uses semantic-aware representations to obtain coarse anomaly segmentation masks. These anomaly masks guide a lightweight Anomaly Matching Decoder (AMD) to jointly utilize CLIP and DINOv2 features for FSAD, and AMD innovatively formulates FSAD as a feature-matching task. Also, unlike prior patch-matching works that directly compute anomaly scores, our method utilizes the AMD to refine coarse predictions into a precise anomaly mask. Our Sea-CLIP achieves state-of-the-art performance on MVTec and VisA datasets, and we provide a detailed analysis of contributions from semantic-aware representations in identifying anomaly patterns.
Weakly Supervised Video Anomaly Detection (WSVAD) is a critical task in computer vision. It aims to localize and recognize abnormal behaviors using only video-level labels. Without frame-level annotations, it becomes significantly challenging to model temporal dependencies. Given the diversity of abnormal events, it is also difficult to model semantic representations. Recently, the cross-modal pre-trained model Contrastive Language-Image Pretraining (CLIP) has shown a strong ability to align visual and textual information. This provides new opportunities for video anomaly detection. Inspired by CLIP, WSVAD-CLIP is proposed as a framework that uses its cross-modal knowledge to bridge the semantic gap between text and vision. First, the Axial-Graph (AG) Module is introduced. It combines an Axial Transformer and Lite Graph Attention Networks (LiteGAT) to capture global temporal structures and local abnormal correlations. Second, a Text Prompt mechanism is designed. It fuses a learnable prompt with a knowledge-enhanced prompt to improve the semantic expressiveness of category embeddings. Third, the Abnormal Visual-Guided Text Prompt (AVGTP) mechanism is proposed to aggregate anomalous visual context for adaptively refining textual representations. Extensive experiments on UCF-Crime and XD-Violence datasets show that WSVAD-CLIP notably outperforms existing methods in coarse-grained anomaly detection. It also achieves superior performance in fine-grained anomaly recognition tasks, validating its effectiveness and generalizability.
Zero-shot anomaly detection (ZSAD) is a critical task that detects anomalies without any training samples from the target application, which is crucial for applications in diverse fields such as industrial quality control and medical imaging analysis. Recent advances have seen the application of contrastive language-image pretraining (CLIP) in ZSAD, exploiting its robust visual-linguistic alignment and zero-shot learning capabilities. However, CLIP is primarily designed for natural image classification, emphasizing global visual embeddings, while anomaly detection (AD) requires a more accurate representation of anomalous regions and more precise local visual embeddings. To overcome these limitations, this article proposes the local enhanced CLIP (LECLIP) framework for ZSAD. LECLIP incorporates a local alignment (LA) module that divides images into blocks and aligns them with learnable text embeddings, ensuring precise relevance expression. Furthermore, a training-free echo-attention (EA) is proposed to complement the traditional QKV attention, enabling the model to capture both global and local image details effectively, thus providing a more accurate and detailed image representation. Experimental results show that LECLIP achieves superior performance on 15 challenging datasets, including six industrial datasets and nine medical datasets. Code is available at https://github.com/lyy70/LECLIP
Recently, the progress of foundation models such as CLIP and SAM has shown the great potential of zero-shot anomaly detection tasks. However, existing methods usually rely on general descriptions such as “abnormal”, and the semantic coverage is insufficient, making it difficult to express fine-grained anomaly semantics. In addition, CLIP primarily performs global-level alignment, and it is difficult to accurately locate minor defects, while the segmentation quality of SAM is highly dependent on prompt constraints. In order to solve these problems, we proposed DCS, a unified framework that integrates Grounding DINO, CLIP and SAM through three key innovations. First of all, we introduced FinePrompt for adaptive learning, which significantly enhanced the modeling ability of exception semantics by building a fine-grained exception description library and adopting learnable text embeddings. Secondly, we have designed an Adaptive Dual-path Cross-modal Interaction (ADCI) module to achieve more effective cross-modal information exchange through dual-path fusion. Finally, we proposed a Box-Point Prompt Combiner (BPPC), which combines box prior information provided by DINO with the point prompt generated by CLIP, so as to guide SAM to generate finer and more complete segmentation results. A large number of experiments have proved the effectiveness of our method. On the MVTec-AD and VisA datasets, DCS has achieved the most state-of-the-art zero-shot anomaly detection results.
Multi-label sewer defect classification based on CLIP with fine-to-coarse contextual representations
… variations and localized defect features, resulting in limited performance in practical sewer defect classification. Therefore, a CLIP based multi-label sewer defect classification method is …
Metal surface defect recognition (MSDR) based on deep learning encounters the challenge of few-shot expert-labeled data. In this study, we proposed a CLIP-vision guided self supervised learning (CVGSSL) framework for representation learning of unlabeled data, completing MSDR using few-shot labeled data. This framework initially generates rich and diverse representation information through multiple CLIP-Vs to ensure effective SSL pretraining, followed by the design of an MLP-adapter to distill knowledge and adapt these representations to recognition tasks. In addition, we constructed a self-constrained loss to address the inherent problem of intraclass and interclass distance ambiguity that causes the representation to fall into an equivocal decision margin. Following label-free pretraining of CVGSSL, the downstream model adapts to one-shot to four-shot defect recognition tasks through fine-tuning. Experimental results demonstrate that CVGSSL outperforms state-of-the-art SSL methods across three public metal surface defect datasets, with the efficacy of the approach validated through extensive ablation experiments.
In the field of integrated circuit manufacturing, the detection and classification of nanoscale wafer defects are critical for subsequent root cause analysis and yield enhancement. The complex background patterns observed in scanning electron microscope (SEM) images and the diverse textures of the defects pose significant challenges. Traditional methods usually suffer from insufficient data, labels, and poor transferability. In this paper, we propose a novel few-shot learning approach, SEM-CLIP, for accurate defect classification and segmentation. SEM-CLIP customizes the Contrastive Language-Image Pretraining (CLIP) model to better focus on defect areas and minimize background distractions, thereby enhancing segmentation accuracy. We employ text prompts enriched with domain knowledge as prior information to assist in precise analysis. Additionally, our approach incorporates feature engineering with textual guidance to categorize defects more effectively. SEM-CLIP requires little annotated data, substantially reducing labor demands in the semiconductor industry. Extensive experimental validation demonstrates that our model achieves impressive classification and segmentation results under few-shot learning scenarios.
Few-shot anomaly detection plays a crucial role in automation inspection in industry. With only a minimal number of normal samples, this method can achieve anomaly identification and localization. Recent research has demonstrated that vision-language models—such as CLIP, which has undergone contrastive language–image pre-training—exhibit strong zero and few-shot generalization capabilities. However, existing CLIP-based methods present issues, such as an inability to fully leverage invariant image features, interference from feature noise, reliance on manual prompt design, and unstable image-level scoring. To address these challenges, this paper proposes the CLIP-Core model, which integrates a Mahalanobis distance memory bank and a trainable linear layer to enhance feature measurement and discrimination capabilities. Additionally, we introduce a trainable prompt and text feature reuse mechanism and further propose the CLIP-Core+ model based on CLIP-Core, resolving problems related to prompt design and inadequate image-level scoring. The experimental results on public datasets demonstrate that CLIP-Core+ outperforms existing few-shot anomaly detection methods with only four normal samples, achieving a 12.3% improvement in pixel-level Average Precision (AP). On the CSAD dataset, our method improves both image and pixel-level area under the receiver operating characteristic curve (Auroc) by 7% and 10.1%, respectively. In the glass bottle appearance defect dataset, our proposed method enhances image-level Auroc by 7.1%.
Visual anomaly classification and segmentation are vital for automating industrial quality inspection. The focus of prior research in the field has been on training custom models for each quality inspection task, which requires task-specific images and annotation. In this paper we move away from this regime, addressing zero-shot and few-normal-shot anomaly classification and segmentation. Recently CLIP, a vision-language model, has shown revolutionary generality with competitive zero-/few-shot performance in comparison to full-supervision. But CLIP falls short on anomaly classification and segmentation tasks. Hence, we propose window-based CLIP (WinCLIP) with (1) a compositional ensemble on state words and prompt templates and (2) efficient extraction and aggregation of window/patch/image-level features aligned with text. We also propose its few-normal-shot extension Win-CLIP+, which uses complementary information from normal images. In MVTec-AD (and VisA), without further tuning, WinCLIP achieves 91.8%/85.1% (78.1%/79.6%) AU-ROC in zero-shot anomaly classification and segmentation while WinCLIP + does 93.1%/95.2% (83.8%/96.4%) in 1-normal-shot, surpassing state-of-the-art by large margins.
Few-shot defect multi-classification (FSDMC) is an emerging trend in quality control within industrial manufacturing. However, current FSDMC research often lacks generalizability due to its focus on specific datasets. Additionally, defect classification heavily relies on contextual information within images, and existing methods fall short of effectively extracting this information. To address these challenges, we propose a general FSDMC framework called MVREC, which offers two primary advantages: (1) MVREC extracts general features for defect instances by incorporating the pre-trained AlphaCLIP model. (2) It utilizes a region-context framework to enhance defect features by leveraging mask region input and multi-view context augmentation. Furthermore, Few-shot Zip-Adapter(-F) classifiers within the model are introduced to cache the visual features of the support set and perform few-shot classification. We also introduce MVTec-FS, a new FSDMC benchmark based on MVTec AD, which includes 1228 defect images with instance-level mask annotations and 46 defect types. Extensive experiments conducted on MVTec-FS and four additional datasets demonstrate its effectiveness in general defect classification and its ability to incorporate contextual information to improve classification performance.
In the steel manufacturing industry, detecting surface defects in hot-rolled steel strips is essential for product quality and safety. This paper introduces a classification method for such defects using contrastive learning image pre-training (CLIP) models, leveraging a dataset with various defect types. The approach harnesses CLIP's feature extraction capabilities and contrastive learning to improve defect recognition accuracy. Additionally, gradient-weighted class activation mapping (Grad-CAM) is employed to visualize model attention, offering an unsupervised localization of defects. This technique enhances model interpretability and provides insights into the model's decision-making process. The study demonstrates the CLIP model's effectiveness in defect identification, surpassing existing methods in accuracy and robustness. The integration of Grad-CAM also aids in model optimization by highlighting areas of interest during defect classification. This research not only advances automated defect detection in hot-rolled steel strips but also contributes to the broader field of industrial defect detection and classification.
The classification of metal defect image plays a very important role in actual industrial production. Most of the previous work has used fully supervised models for classification, and these methods require a large number of labeled samples. However, it is difficult to collect a large number of specific images, and labeling images is expensive and time consuming. In this paper, a method of metal defect image classification based on CLIP model is proposed. The method does not need training set, but only needs pre-training test set to test directly. During the test phase the GEDT module and the OPT module were used. The GEDT module makes full use of the relevant information between features. OPT uses Transductive manner to achieve the final classification. Experiments are carried out on a proposed dataset. The results show that this method is significantly better than other existing methods on one-shot and five-shot. On the common setting, the method achieves an average improvement of 8. 25% and 12. 51% on one shot and five shot over the next best method. On cross-domain Settings, the method achieves an average improvement of 11% and 9% on one-shot and five-shot over the next best method.
Defect detection is a critical task in the manufacturing domain that is essential for ensuring product quality and operational efficiency. However, this task poses significant challenges due to imbalanced datasets, subtle differences between defective and non-defective samples, and the need for generalization across diverse operational conditions. To address these challenges, we propose a domain-specialized CLIP framework that incorporates self-distilled contrastive loss to improve robustness in imbalanced datasets and enhance the detection of subtle defect patterns. Our approach leverages adaptive prompts by integrating domainspecific metadata, such as product categories and management IDs, enabling context-aware classification across diverse products and conditions. Additionally, we formulate defect detection as a binary classification task during inference, where image-text representations are compared using similarity-based classification to determine the most probable class. Experiments on realworld manufacturing datasets demonstrate the effectiveness and adaptability of our approach while consistently outperforming standard classification models.
… set classification, facilitating more effective defect classification and enabling the detection of unknown defect … extensive data annotation, SEM-CLIP 2.0 significantly reduces labor costs. …
Zero‐shot learning, applied with vision‐language pretrained (VLP) models, is expected to be an alternative to existing deep learning models for defect detection, under insufficient dataset. However, VLP models, including contrastive language‐image pretraining (CLIP), showed fluctuated performance on prompts (inputs), resulting in research on prompt engineering—optimization of prompts for improving performance. Therefore, this study aims to identify the features of a prompt that can yield the best performance in classifying and detecting building defects using the zero‐shot and few‐shot capabilities of CLIP. The results reveal the following: (1) domain‐specific definitions are better than general definitions and images; (2) a complete sentence is better than a set of core terms; and (3) multimodal information is better than single‐modal information. The resulting detection performance using the proposed prompting method outperformed that of existing supervised models.
Visual defect detection is crucial for industrial quality control in intelligent manufacturing. Previous research requires target-specific data to train the model for each inspection task. However, due to the challenges of collecting proprietary data and model-training time costs, zero-shot defect detection (ZSDD) has become an emerging topic in the field. ZSDD, which requires models trained with auxiliary data, can detect defects on different products without target-data training. Recently, large pretrained vision-language models (VLMs), such as contrastive language-image pre-training model (CLIP), have demonstrated revolutionary generality with competitive zero-shot performance across various downstream tasks. However, VLMs have limitations in defect detection, which are designed to focus on identifying category semantics of the objects rather than sensing object attributes (defective/nondefective). The current VLMs-based ZSDD methods require manually crafted text prompts to guide the discovery of anomaly attributes. In this article, we propose a novel ZSDD method, namely attribute-aware CLIP, to adapt CLIP for anomaly attribute discovery without designing specific textual prompts. The core is designing a textual domain bridge, which transforms simple general textual prompt features into prompt embeddings better aligned with the attribute awareness. This enables the model to perceive the attributes of objects by text-image feature matching, bridging the gap between object semantic recognition and attribute discovery. Additionally, we perform component clustering on the images to break down the overall object semantics, encouraging the model to focus on attribute awareness. Extensive experiments on 16 real-world defect datasets demonstrate that our method achieves state-of-the-art (SOTA) ZSDD performance in diverse class-semantic datasets.
We propose a three-stage CLIP approach consisting of Reference Sample Method (Pure CLIP: Contrastive Language-Image Pre-training), Concept Extraction Learning Method (ML-Pipeline), and Level Determination Integration Method (ML-Hybrid) to objectify and enhance the accuracy of subjective rebar rust evaluation. The approach achieves efficient learning and high-precision classification in limited data environments, with ML-Hybrid demonstrating optimal performance with an F1 score of $\mathbf{1. 0 0 0}$. Pure CLIP reduces development effort by $87.5 \%$, confirming economic feasibility for multi-product deployment. Additionally, robustness against temporal changes has been verified, and practical applicability as an industrial application has been demonstrated through implementation in actual production lines. This method contributes to the standardization of rebar rust determination and suggests applicability to general classification problems categorized by degree.
… defects. To address these limitations, we propose CLIP-Based Text-Guided Residual Fusion(TGRF-CLIP), … residuals framework that enhances CLIP’s defect recognition capability while …
Few-shot industrial anomaly detection (FSAD) aims to identify unseen defects using only a limited number of normal samples. However, most existing approaches still rely heavily on auxiliary industrial datasets for training. In this paper, we propose a novel self-supervised contrastive language-image pretraining (CLIP)-guided for FSAD, which eliminates the need for auxiliary industrial data. Specifically, we first introduce a pseudo-anomaly generation strategy that synthesizes both structural and textural anomalies. Then, leveraging the cross-modal semantic understanding capability of CLIP, we contrast the multiscale visual features with learnable textual prompts to achieve anomaly localization grounded in language semantics. Inspired by the human cognitive process of identifying anomalies through reference comparison, we introduce a support set composed of a few normal samples and perform semantic-level feature alignment with the query set via CLIP visual (CLIP-V) encoder, thereby enhancing anomaly discrimination. Furthermore, we also introduce an adapter to alleviate the semantic offset problem between text and image modalities in industrial scenarios of CLIP, and enhance the model’s robustness to the spatial structure differences between the query set and the support set. Extensive experiments conducted on the MVTec AD, the VisA, the BTAD, and the MPDD datasets demonstrate that our method achieves competitive results under the few-shot setting. Moreover, its effectiveness and deployability are validated through real-world applications in battery spot-welding defect inspection. The code is available at https://github.com/YiKuiZhai/SCF-AD
… that the proposed FE-CLIP has good generalization across different domains and achieves superior zero-shot performance of detecting and segmenting anomalies in 10 datasets of …
Zero-shot anomaly detection (ZSAD) aims to identify anomalies in unseen categories by leveraging CLIP's zero-shot capabilities to match text prompts with visual features. A key challenge in ZSAD is learning general prompts stably and utilizing them effectively, while maintaining both generalizability and category specificity. Although general prompts have been explored in prior works, achieving their stable optimization and effective deployment remains a significant challenge. In this work, we propose GenCLIP, a novel framework that learns and leverages general prompts more effectively through multi-layer prompting and dual-branch inference. Multi-layer prompting integrates category-specific visual cues from different CLIP layers, enriching general prompts with more comprehensive and robust feature representations. By combining general prompts with multi-layer visual features, our method further enhances its generalization capability. To balance specificity and generalization, we introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization. The complementary outputs from both branches improve the stability and reliability of anomaly detection across unseen categories. Additionally, we propose an adaptive text prompt filtering mechanism, which removes irrelevant or atypical class names not encountered during CLIP's training, ensuring that only meaningful textual inputs contribute to the final vision-language alignment.
Large image-language models(LLM) have made significant progress in zero-shot anomaly detection(ZSAD), however, the semantic gap between images and text limits their performance in hierarchical learning. In this paper, we propose the hierarchical alignment clip(HieClip) framework, to achieve hierarchical alignment between images and text. Specifically, we introduce learnable hierarchical textual(LHT) to reduce the representation differences between various levels of images and text, while performing multi-level comprehensive discrimination. Additionally, the dynamically adjusting the weights of features at different levels, improving the model’s ability to capture both global and local information. Experiments on public industrial datasets demonstrate HieClip’s effectiveness, showing significant accuracy improvement, and its strong generalization capabilities were further validated on medical datasets. Compared to existing methods, HieClip excels in anomaly detection tasks, particularly in industrial inspection and medical diagnosis scenarios.
Precise optical inspection in industrial applications is crucial for minimizing scrap rates and reducing the associated costs. Besides merely detecting if a product is anomalous or not, it is crucial to know the distinct types of defects, such as a bent, cut, or scratch. The ability to recognize the “exact” defect type enables automated treatments of the anomalies in modern production lines. Current methods are limited to solely detecting whether a product is defective or not, without providing any insights into the defect type, but nevertheless detecting and identifying multiple defects. We propose MultiADS, a zero-shot learning approach, able to perform Multi-type Anomaly Detection and Segmentation. The architecture of MultiADS comprises CLIP and extra linear layers to align the visual and textual representation in a joint feature space. To the best of our knowledge, our proposal is the first approach to perform a multitype anomaly segmentation task in zero-shot learning. Contrary to the other baselines, our approach i) generates specific anomaly masks for each distinct defect type, ii) learns to distinguish defect types, and iii) simultaneously identifies multiple defect types present in an anomalous product. Additionally, our approach outperforms zerolfew-shot learning SoTA methods on imagelevel and pixel-level anomaly detection and segmentation tasks on five commonly used datasets: MVTec-AD, Visa, MPDD, MAD, and Real-IAD.
… Pro-CLIP, a novel framework for few-shot anomaly detection … capture general normality and abnormality concepts, ensuring … utilizes few-shot normal samples to refine anomaly detection …
The vision-language model has brought great improvement to few-shot industrial anomaly detection, which usually needs to design of hundreds of prompts through prompt engineering. For automated scenarios, we first use conventional prompt learning with many-class paradigm as the baseline to automatically learn prompts but found that it can not work well in one-class anomaly detection. To address the above problem, this paper proposes a one-class prompt learning method for few-shot anomaly detection, termed PromptAD. First, we propose semantic concatenation which can transpose normal prompts into anomaly prompts by concatenating normal prompts with anomaly suffixes, thus constructing a large number of negative samples used to guide prompt learning in one-class setting. Furthermore, to mitigate the training challenge caused by the absence of anomaly images, we introduce the concept of explicit anomaly margin, which is used to explicitly control the margin between normal prompt features and anomaly prompt features through a hyper-parameter. For image-level/pixel-level anomaly detection, PromptAD achieves first place in 11/12 few-shot settings on MVTec and VisA. Code is available at https://github.com/FuNz-0/PromptAD.git
Even with a plenty amount of normal samples, anomaly detection has been considered as a challenging machine learning task due to its one-class nature, i. e., the lack of anomalous samples in training time. It is only recently that a few-shot regime of anomaly detection became feasible in this regard, e. g., with a help from large vision-language pre-trained models such as CLIP, despite its wide applicability. In this paper, we explore the potential of large text-to-image generative models in performing few-shot industrial anomaly detection. Specifically, recent text-to-image models have shown unprecedented ability to generalize from few images to extract their common and unique concepts, and even encode them into a textual token to “personalize” the model: so-called textual inversion. Here, we question whether this personalization is specific enough to discriminate the given images from their potential anomalies, which are often, e. g., open-ended, local, and hard-to-detect. We observe that standard textual inversion exhibits a weaker understanding in localized details within objects, which is not enough for detecting industrial anomalies accurately. Thus, we explore the utilization of model personalization to address anomaly detection and propose Anomaly Detection via Personalization (ADP). ADP enables extracting fine-grained local details shared in the images with simple-yet an effective regularization scheme from the zero-shot transferability of CLIP. We also propose a self-tuning scheme to further optimize the performance of our detection pipeline, leveraging synthetic data generated from the personalized generative model. Our experiments show that the proposed inversion scheme could achieve state-of-the-art results on two industrial anomaly benchmarks, MVTec-AD and VisA, in the regime of few normal samples.
… tasks, including anomaly detection. Capitalizing on CLIP’s powerful cross-modal alignment, WinCLIP [1] introduced a framework for zero-shot/few-shot anomaly classification and …
The Contrastive Language-Image Pretraining (CLIP) model demonstrates remarkable generalization capabilities and exhibits substantial potential for few-shot anomaly detection in industrial applications. Therefore, further enhancing its performance is of significant importance. To address its low efficiency in utilizing text-image pair data, this paper proposes a semantic splicing module that converts normal text into anomalous text and combines it with the original image to form new negative sample pairs, thereby achieving data sample augmentation. Since CLIP inherently lacks the ability to distinguish between positive and negative sample pairs, we design an anomaly boundary module that projects CLIP features onto a single hypersphere, ensuring the distance between normal images and anomalous text is greater than that between normal images and normal text. This strengthens the model's discriminative capacity between normal and abnormal samples, thereby boosting detection accuracy. In anomaly detection experiments on MVTec and VisA dataset, our method achieves superior metrics compared to PatchCore, WinCLIP and RWDA.
Few-Shot Industrial Anomaly Detection (FSIAD) is an essential yet challenging problem in practical scenarios such as industrial quality inspection. Its objective is to identify previously unseen anomalous regions using only a limited number of normal support images from the same category. Recently, large pre-trained vision-language models (VLMs), such as CLIP, have exhibited remarkable few-shot image-text representation abilities across a range of visual tasks, including anomaly detection. Despite their promise, real-world industrial anomaly datasets often contain noisy labels, which can degrade prompt learning and detection performance. In this paper, we propose AnomalyNLP, a new Noisy-Label Prompt Learning approach designed to tackle the challenge of few-shot anomaly detection. This framework offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of VLMs for industrial anomaly detection. First, we design a Noisy-Label Prompt Learning (NLPL) strategy. This strategy utilizes feature learning principles to suppress the influence of noisy samples via Mean Absolute Error (MAE) loss, thereby improving the signal-to-noise ratio and enhancing overall model robustness. Furthermore, we introduce a prompt-driven optimal transport feature purification method to accurately partition datasets into clean and noisy subsets. For both image-level and pixel-level anomaly detection, AnomalyNLP achieves state-of-the-art performance across various few-shot settings on the MVTecAD and VisA public datasets. Qualitative and quantitative results on two datasets demonstrate that our method achieves the largest average AUC improvement over baseline methods across 1-, 2-, and 4-shot settings, with gains of up to 10.60%, 10.11%, and 9.55% in practical anomaly detection scenarios.
… CLIP, a novel dynamic context-aware prompt learning method for ZSAD. DyC-CLIP enhances anomaly … into textual prompts, reducing the reliance on product-specific prompts. To further …
… We extend CLIP with learnable text prompts, layer-wise ConvBlocks, and a lightweight … , and visually atypical yet non-anomalous appearances through multi-prompt competition. …
… zero-shot anomaly segmentation approaches is their high rate of false positives, often caused by confusing anomalies … To tackle this issue, we propose an advanced zero-shot anomaly …
This paper presents an efficient zero-shot industrial anomaly detection (IAD) framework based on visual-language models. Industrial anomaly detection usually adopts an unsupervised learning approach, which achieves excellent detection performance though. However, it is still difficult to recognize some more complicated anomalies, such as rotational defects. At this point, more detailed features are needed to describe the image. With the excellent performance of contrastive language-image pretraining (CLIP), this paper proposes a zero-shot industrial anomaly detection framework IAD-CLIP based on visual language models. The framework contains a pre-trained CLIP model, a training-free adaptation module and a test-time adaptation mechanism. The training-free adaptation module uses a value-value attention mechanism and a state prompt space. The pre-trained CLIP model is used for feature extraction and the training-free adaptation module processes the extracted features through visual coders and text encoders for anomaly detection and localization. A test-time adaptation mechanism is used to improve the anomaly localization performance during the testing phase. The experimental results on the industrial anomaly detection dataset MVTec AD show that IAD-CLIP achieves 92.1% AUROC, 94.6% AUPR, and 91.9% F1Max, respectively. This result validates the significant effect of the IAD-CLIP framework proposed in this paper in the industrial anomaly detection task.
Industrial visual inspection demands high-precision anomaly detection amid scarce annotations and unseen defects. This paper introduces a zero-shot framework leveraging multimodal feature fusion and stabilized attention pooling. CLIP’s global semantic embeddings are hierarchically aligned with DINOv2’s multi-scale structural features via a Dual-Modality Attention (DMA) mechanism, enabling effective cross-modal knowledge transfer for capturing macro- and micro-anomalies. A Stabilized Attention-based Pooling (SAP) module adaptively aggregates discriminative representations using self-generated anomaly heatmaps, enhancing localization accuracy and mitigating feature dilution. Trained solely in auxiliary datasets with multi-task segmentation and contrastive losses, the approach requires no target-domain samples. Extensive evaluation across seven benchmarks (MVTec AD, VisA, BTAD, MPDD, KSDD, DAGM, DTD-Synthetic) demonstrates state-of-the-art performance, achieving 93.4% image-level AUROC, 94.3% AP, 96.9% pixel-level AUROC, and 92.4% AUPRO on average. Ablation studies confirm the efficacy of DMA and SAP, while qualitative results highlight superior boundary precision and noise suppression. The framework offers a scalable, annotation-efficient solution for real-world industrial anomaly detection.
… regions in industrial images without requiring any anomalous training samples. Recent CLIP-… CoDe-CLIP, a practical enhancement of CLIP tailored for industrial surface inspection. Our …
Vision based Industrial anomaly detection (IAD) faces dual challenges of scarce annotated data and generalization of cross production lines in intelligent manufacturing. However, …
Industrial image anomaly detection is critical for automated manufacturing. However, most existing methods rely on single-category training paradigms, resulting in poor scalability and limited cross-category generalization. These approaches require separate models for each product type and fail to model the complex multi-modal distribution of normal samples in multi-category scenarios. To overcome these limitations, we propose UniCLIP-AD, a unified anomaly detection framework that leverages the general semantic knowledge of CLIP and adapts it to the industrial domain using Low-Rank Adaptation (LoRA). This design enables a single model to effectively handle diverse industrial parts. In addition, we introduce UniAD, a large-scale industrial anomaly detection dataset collected from real production lines. It contains over 25,000 high-resolution images across 7 categories of electronic components, with both pixel-level and image-level annotations. UniAD captures fine-grained, diverse, and realistic defects, making it a strong benchmark for unified anomaly detection. Experiments show that UniCLIP-AD achieves superior performance on UniAD, with an AU-ROC of 92.1% and F1-score of 89.8% in cross-category tasks, outperforming the strongest baselines (CFA and DSR) by 3% AU-ROC and 23.9% F1-score.
In industrial manufacturing, ensuring product quality is of paramount importance. A key component of this process is anomaly detection, which aims to promptly identify defective products to reduce operational losses. However, practical industrial environments are characterized by complexity, including limited availability of labeled data, a wide variety of defect categories, and frequent changes in these categories. Such factors pose significant challenges to the effective cross-domain generalization of anomaly detection methods. To address this limitation, IA-CLIP, a novel framework that enhances cross-domain generalization for industrial anomaly detection, is proposed. IA-CLIP integrates global and local prompts with contrastive learning to overcome the limitations of existing approaches. The proposed class-agnostic global-local semantic prompts enable the model to capture general patterns of normality and anomaly without relying on object-specific semantics. We further introduce a Similarity-aware Triplet Contrastive Learning strategy to facilitate complementary learning between global and local prompts, and an Adaptive Focal Contrastive Learning scheme to help the model focus more effectively on hard-to-identify anomalous regions. Extensive experiments on nine real-world target-domain datasets, covering 50 categories of industrial products, demonstrate that IA-CLIP achieves impressive cross-domain generalization performance in realistic industrial settings. Code and data will be released upon publication. Note to Practitioners—IA-CLIP tackles the challenge of inaccessible sample data in industrial manufacturing by enabling cross-domain generalization for anomaly detection. It integrates both global and local prompts and leverages image-text contrastive learning to capture fine-grained visual features. This allows IA-CLIP to generalize effectively across diverse industrial scenarios without requiring retraining on each new domain. The method supports both large-area anomaly detection and fine-grained localization of defects in complex industrial textures, such as metal nuts, meshes, fabrics, and PCBs. Extensive evaluations on 50 object surface categories across 9 target domains demonstrate its practical value. IA-CLIP offers a promising solution for real-world industrial applications where acquiring labeled anomaly data is costly or infeasible.
… of offline and batch zero-shot paradigms in industrial anomaly detection. • We propose RareCLIP, a pioneering framework that combines CLIP’s zero-shot capability with dynamic rarity …
… introduction to twelve industrial anomaly detection methods, … 2D and 3D datasets for industrial visual anomaly detection. In … in the field of industrial anomaly detection. Beyond analysis, …
Video anomaly detection (VAD) confronts significant challenges arising from data scarcity in real-world open scenarios, encompassing sparse annotations, labeling costs, and limitations on closed-set class definitions, particularly when scene diversity surpasses available training data. Although current weakly-supervised VAD methods offer partial alleviation, their inherent confinement to closed-set paradigms renders them inadequate in open-world contexts. Therefore, this paper explores open vocabulary video anomaly detection (OVVAD), leveraging abundant vision-related language data to detect and categorize both seen and unseen anomalies. To this end, we propose a robust framework, PLOVAD, designed to prompt tuning large-scale pretrained image-based vision-language models (I-VLMs) for the OVVAD task. PLOVAD consists of two main modules: the Prompting Module, featuring a learnable prompt to capture domain-specific knowledge and an anomaly-specific prompt crafted by a large language model (LLM) to capture semantic nuances and enhance generalization; and the Temporal Module, which integrates temporal information using graph attention network (GAT) stacking atop frame-wise visual features to address the transition from static images to videos. Extensive experiments on four benchmarks demonstrate the superior detection and categorization performance of our approach in the OVVAD task without bringing excessive parameters.
Existing anomaly detection methods mainly focus on unsupervised learning, resulting in low generalization and single type segmentation. Hence, we introduce open-vocabulary detection pattern into anomaly segmentation field to achieve multi-semantic segmentation and propose a zero-shot/few-shot open-vocabulary anomaly segmentation OV-AS based on CLIP. (1) In zero-shot, introduce pretraining on the open-vocabulary segmentation models for specific anomaly domain knowledge learning based on an extracted Anomaly Domain-General dataset. Then, introduce fine-tuning on the image encoder of CLIP for further narrowing the gap with target domain. (2) In few-shot, introduce few-anomaly-shot learning on target datasets. Results show that in zero-shot, our model achieves 35.7/13.3 F1-max on MVTec-AD/VisA benchmarks, essentially surpassing state-of-the-art. In few-shot, our model achieves 43.0 mIoU on VISION benchmark under 10-shot, comparable to supervised learning.
Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.
Open-world video anomaly detection must handle unseen anomalies, changing scene rules, and limited labels. Existing methods still rely too much on clip-level scores or closed-set semantics. This study presents OW-YW-VAD, an object-centric framework for open-world video anomaly detection. The framework uses an open-vocabulary detector to extract targets and prompt responses. It then builds trajectories and interaction cues. An object-centric temporal encoder, a normal memory bank, scene dynamics priors, and semantic uncertainty are fused for anomaly decision. The training objective combines classification, localization, memory compactness, weak supervision, and temporal smoothness. Experiments on UBnormal, ShanghaiTech, and UCF-Crime cover open-set recognition, fine-grained localization, and long-video weak supervision. The proposed method achieves 76.10, 98.60, and 90.12 on the three benchmarks. The results show stable detection, accurate localization, and strong robustness in open-world settings. The code is available at https://github.com/1846659840/OW-YW-VAD.
Timely detection of road surface defects such as cracks and potholes is critical for ensuring traffic safety and reducing infrastructure maintenance costs. While recent advances in image-based deep learning techniques have shown promise for automated road defect detection, existing models remain limited to closed-set detection settings, making it difficult to recognize newly emerging or fine-grained defect types. To address this limitation, we propose an attribute-aware open-vocabulary crack detection (AOVCD) framework, which leverages the alignment capability of pretrained vision–language models to generalize beyond fixed class labels. In this framework, crack types are represented as combinations of visual attributes, enabling semantic grounding between image regions and natural language descriptions. To support this, we extend the existing PPDD dataset with attribute-level annotations and incorporate a multi-label attribute recognition task as an auxiliary objective. Experimental results demonstrate that the proposed AOVCD model outperforms existing baselines. In particular, compared to CLIP-based zero-shot inference, the proposed model achieves approximately a 10-fold improvement in average precision (AP) for novel crack categories. Attribute classification performance—covering geometric, spatial, and textural features—also increases by 40% in balanced accuracy (BACC) and 23% in AP. These results indicate that integrating structured attribute information enhances generalization to previously unseen defect types, especially those involving subtle visual cues. Our study suggests that incorporating attribute-level alignment within a vision–language framework can lead to more adaptive and semantically grounded defect recognition systems.
Recent advancements in large-scale visual-language pre-trained models have led to significant progress in zero/few-shot anomaly detection within natural image domains. However, the substantial domain divergence between natural and medical images limits the effectiveness of these methodologies in medical anomaly detection. This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. This multi-level adaptation is guided by multi-level, pixel-wise visual-language feature alignment loss functions, which recalibrate the model's focus from object semantics in natural imagery to anomaly identification in medical images. The adapted features exhibit improved generalization across various medical data types, even in zero-shot scenarios where the model encounters unseen medical modalities and anatomical regions during training. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models, with an average AUC improvement of 6.24% and 7.33% for anomaly classification, 2.03% and 2.37% for anomaly segmentation, under the zero-shot and few-shot settings, respectively. Source code is available at: https://github.com/MediaBrain-SJTU/MVFA-AD
Visual-language alignment is crucial for enhancing the domain adaptability of industrial anomaly detection models. However, the existing methods overlook the importance of structured image representation, fail to further distinguish topological differences between anomalies and the inherent textures of products, which reduces the accuracy of semantic matching. To address this problem, we propose a novel industrial anomaly detection model G-Anomaly, to preserve the topological structure of the sample images and further enhance the model’s domain adaptability. We designed Pyramid Graph Transformer as a visual encoder to extract multi-scale visual features, which can directly preserve the structural relationships between different regions of the image, and also optimize the over-smoothing issue present in deep graph networks, thereby retaining the distinguishability of anomalous nodes. Additionally, we design a Multi-level Domain Adapter that ensures semantic consistency of anomalous features across different scales and contexts by performing visual-language matching at various resolutions and levels of abstraction. This enhances the model’s domain adaptability for anomaly detection for a wide range of industrial products. We collect and craft an actual solar panel dataset PV_actual AD, and conduct extensive experiments on the public dataset MVTec AD as well as the actual solar panel dataset PV_actual AD. This has demonstrated that G-Anomaly not only performs well in standard testing environments but also exhibits robustness and domain adaptability for anomaly detection tasks in real-world scenarios. Note to Practitioners—G-Anomaly extracts image features by identifying spatial topological relationships and aligning them with textual descriptions of anomalies at a semantic level. This approach enables effective anomaly detection across a wide range of products. This approach improves the generalization of anomaly patterns learned from the source domain to other target domains without additional anomaly samples for any fine-tuning. G-Anomaly is not only applicable to large-area anomaly detection but also maintains a high degree of semantic sensitivity to anomalous parts in industrial samples with complex textures or notches, such as metal nuts, grids, and solar panels. It has practical significance for industrial applications.
Few-Shot Anomaly Detection (FSAD) in industrial images aims to identify abnormalities using only a few normal images, which is crucial for industrial scenarios where sample training is limited. The recent advances in large-scale pre-trained visual-language models have brought significant improvements to the FSAD, which typically requires hundreds of text prompts to be manually crafted through prompt engineering. However, manually designed text prompts cannot accurately match the informative features of different categories across diverse images, and the domain gap between train and test datasets can severely impact the generalization capability of text prompts. To address these issues, we propose a visual-language model based on fine-grained learnable text prompts as a unified general framework for FSAD in industry. Firstly, we design a Fine-grained Text Prompts Adapter (FTPA) and an associated registration loss to enhance the efficiency of text prompts. The manually designed text prompts are improved and optimized by capturing normal and abnormal semantic information in the image, so that the text prompts can describe the image semantic information at a finer granularity. In addition, we introduce a Dynamic Modulation Mechanism (DMM) to avoid potential errors in text prompts post-training due to the agnostic during cross-dataset detection. This is achieved by explicitly modulating the branch guided by few-shot images and the branch guided by fine-grained text prompts. Extensive experiments demonstrate that our proposed method achieves state-of-the-art few-shot industrial anomaly detection and segmentation performance. In the 4-shot, the AUROC of the anomaly classification and anomaly segmentation achieves 98.3%, 96.3%, and 93.8%, 97.9% on the MVTec-AD and VisA datasets, respectively.
The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarse-grained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and features are released at https://github.com/nwpu-zxr/VadCLIP.
Currently, large visual language models (LVLMs), such as MiniGPT-4 and LLaVA, have demonstrated the ability to understand images and have achieved outstanding performance across various visual tasks. Despite the impressive performance of large visual language models (LVLMs) in image understanding, they still exhibit limitations in specific domains, such as industrial anomaly detection (lAD). Particularly when dealing with tasks that require specialized knowledge, these models often struggle to capture subtle details in images, resulting in unsatisfactory performance in the lAD field. Moreover, existing lAD solutions typically provide only a scoring and approximate location of anomalous objects, relying on manually set thresholds to determine whether an object is anomalous. This approach has limited generalization capabilities and often fails to meet expectations in practical applications. In this study, we explore the potential of integrating LVLMs with traditional lAD methods to address these issues. Specifically, we selected LLa V A as the foundational large visual language model (L VLM) and designed a visual decoder with an intra-class adaptive scoring mechanism to address L VLM's limitations in capturing fine details. Our research successfully eliminates the need for manual threshold settings, achieving a fully automated anomaly detection and localization process. Through extensive experiments on the MVTec-AD[l] and VisA[2] datasets, our method demonstrated outstanding detection performance, thereby validating its effectiveness.
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/
… This large-scale, diverse dataset is designed for logical anomaly detection, … for zero-shot logical anomaly detection, offering a robust … visual-language models to improve logical anomaly …
For the anomaly detection on the surface of rail transit train body (RTTB-AD), due to the scarcity of anomalies, the complexity and variability of the detection environment, and the exceptionally high identification rate required by practical application, the task is quite challenging. This article proposes a novel differential-based anomaly detection model (DSE-AD) for the surface of rail train bodies based on visual-language model. It utilizes the differences between history and current images of the same position on the same train type to achieve anomaly localization, while addressing nonanomalous changes interference caused by the environment. Specifically, we first propose the normal-abnormal dual-state contrast prompt suitable for rail trains, and fine-grained align the image features with the prompt features from the pretrained encoder to obtain the task-specific dual-state feature representation. Next, we propose the dual-state difference enhancement (DSDE) module, which utilizes a learnable difference attention matrix to enhance the anomaly-specific dual-state information, allowing the model to focus on the anomaly semantics. Finally, a anomaly highlight module (AHM) is designed in the inference process to reduce nonanomalous predictions by improving the discrimination of abnormal features. Experiments show that DSE-AD is able to adapt to the complex and variable detection environment, and outperforms other methods in both same-domain and cross-domain detection, especially for unknown anomalies. And it shows robustness in dealing with the interference of changes between the history and current images, as well as faster convergence and independence of the pretrained model scale.
Medical image anomaly detection (AD) is crucial for early disease diagnosis, yet it faces challenges such as data heterogeneity and scarcity of annotated samples. This paper …
基于CLIP的异常检测研究已从初步的语义迁移发展为包含零样本通用检测、小样本提示学习、视频时序分析及多模态推理的成熟生态。研究核心正由全局特征对齐转向针对特定领域架构适配、多模态逻辑增强及开放词汇场景的深度开发,通过系统化的架构设计解决了零样本与小样本场景下的泛化瓶颈与推理难题。