基于喉镜影像与病历文本深度对齐的喉部疾病智能诊断研究
医疗影像增强与临床疾病知识对齐
这些文献专门针对医疗领域(如X射线、放射科报告),探讨如何通过整合解剖学结构、病理特征和疾病知识库来实现影像与文本的深度对齐,从而生成准确的诊断报告或合成高质量医疗影像。
- Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation(Wenting Chen, Linlin Shen, Jingyang Lin, Jiebo Luo, Xiang Li, Yixuan Yuan, 2023, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation(Sang-Jun Park, Keun-Soo Heo, Dong-Hee Shin, Young-Han Son, Ji-Hye Oh, Tae-Eui Kam, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting(Wenting Chen, Pengyu Wang, Hui Ren, Lichao Sun, Quanzheng Li, Yixuan Yuan, Xiang Li, 2024, International Conference on Medical Image Computing and Computer-Assisted Intervention)
- DKA-RG: Disease-Knowledge-Enhanced Fine-Grained Image–Text Alignment for Automatic Radiology Report Generation(Heng Yin, Wei Wu, Yongtao Hao, 2024, Electronics)
细粒度空间与多层次语义匹配机制
该组研究关注对齐过程中的颗粒度问题,通过局部区域(patches/subregions)与具体词项(words/tags)的关联、双向一致性约束以及多尺度特征融合,解决粗粒度对齐导致的信息缺失问题。
- Improving Image-Text Matching With Bidirectional Consistency of Cross-Modal Alignment(Zhe Li, Lei Zhang, Kun Zhang, Yongdong Zhang, Zhendong Mao, 2024, IEEE Transactions on Circuits and Systems for Video Technology)
- Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training(Longtian Qiu, Shan Ning, Xuming He, 2024, AAAI Conference on Artificial Intelligence)
- Enhancing image–text matching through multi-level semantic consistency alignment(Liqi Zhu, Dezhi Han, Xiang Shen, Chongqing Chen, Kuan Ching Li, 2025, The Visual Computer)
- Global-local prompts guided image-text embedding, alignment and aggregation for multi-label zero-shot learning(Tiecheng Song, Yu Huang, Feng Yang, Anyong Qin, Yue Zhao, Chenqiang Gao, 2025, Journal of Visual Communication and Image Representation)
- Image-Text Alignment using Adaptive Cross-attention with Transformer Encoder for Scene Graphs(Juyong Song, Sunghyun Choi, 2021, Proceedings of the British Machine Vision Conference 2021)
文本引导的视觉分割与病灶定位技术
此类文献探讨如何利用文本中的语义线索指导视觉任务中的像素级或区域级定位,包括指代性图像分割(RIS)、弱监督语义分割和异常检测,旨在提升对病灶等特定实体的空间识别能力。
- Harnessing Text Insights With Visual Alignment for Medical Image Segmentation(Qingjie Zeng, Huan Luo, Zilin Lu, Yutong Xie, Zhiyong Wang, Yanning Zhang, Yong Xia, 2025, IEEE Transactions on Medical Imaging)
- SimCLIP: Refining Image-Text Alignment with Simple Prompts for Zero-/Few-shot Anomaly Detection(Chenghao Deng, Haote Xu, Xiaolu Chen, Haodi Xu, Xiaotong Tu, Xinghao Ding, Yue Huang, 2024, Proceedings of the 32nd ACM International Conference on Multimedia)
- Extending CLIP’s Image-Text Alignment to Referring Image Segmentation(Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, Suha Kwak, 2023, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers))
- DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation(Soojin Jang, Jungmin Yun, Junehyoung Kwon, Eunju Lee, YoungBin Kim, 2024, European Conference on Computer Vision)
视觉-语言预训练模型的领域适配与泛化
这些研究致力于优化如CLIP等大规模预训练模型,解决其在下游任务中的单标签偏见、跨领域分布偏移(域泛化)以及语义鸿沟问题,通过适配器、蒸馏或提示微调技术增强模型的适应性。
- CLIPCEIL: Domain Generalization through CLIP via Channel rEfinement and Image-text aLignment(Xi Yu, Shinjae Yoo, Yuewei Lin, 2024, Advances in Neural Information Processing Systems 37)
- Beyond General Alignment: Fine-Grained Entity-Centric Image-Text Matching with Multimodal Attentive Experts(Yaxiong Wang, Lianwei Wu, Lechao Cheng, Zhun Zhong, Yujiao Wu, Meng Wang, 2025, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)
- TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias(Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim, 2024, European Conference on Computer Vision)
跨模态对齐优化与质量评估框架
该组论文涵盖了提升图文匹配性能的通用策略,如软标签对齐、自适应嵌入、指令增强以及针对生成模型对齐质量的量化评估方法,确保多模态系统在检索和评估中的稳健性。
- Adaptive Cross-Modal Embeddings for Image-Text Alignment(Jonatas Wehrmann, Rodrigo C. Barros, Camila Kolling, 2020, Proceedings of the AAAI Conference on Artificial Intelligence)
- Removing Distributional Discrepancies in Captions Improves Image-Text Alignment(Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Y. Lee, Krishna Kumar Singh, 2024, European Conference on Computer Vision)
- Instruction-Augmented Multimodal Alignment for Image-Text and Element Matching(Xinli Yue, Jianhui Sun, Junda Lu, Liangchao Yao, Fan Xia, Tianyi Wang, Fengyun Rao, Jing Lyu, Yuetang Deng, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Unified learning for image-text alignment via multi-scale feature fusion(Jingfeng Zhou, Meng Wang, 2025, Computer Vision and Image Understanding)
- Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval(Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang, 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
- Image-Text Alignment and Retrieval Using Light-Weight Transformer(Wenrui Li, Xiaopeng Fan, 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
本组文献共同探讨了视觉与语言模态深度对齐的前沿技术,尤其在医疗影像分析领域展现了从通用预训练模型(如CLIP)向细粒度医疗知识对齐的演进。研究重点涵盖了如何利用临床知识强化诊断报告生成、通过细粒度匹配提升病灶定位精度、优化分割任务中的跨模态语义一致性,以及建立更科学的对齐质量评估体系。这些技术为基于喉镜影像与病历文本的喉部疾病智能诊断提供了从底层特征匹配到高层逻辑推理的完整方法论支持。
总计22篇相关文献
The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on two widely used benchmarks, surpassing previous approaches in both report generation and clinical efficacy metrics, thereby enhancing the trustworthiness of radiology reports.
Domain generalization (DG) is a fundamental yet challenging topic in machine learning. Recently, the remarkable zero-shot capabilities of the large pre-trained vision-language model (e.g., CLIP) have made it popular for various downstream tasks. However, the effectiveness of this capacity often degrades when there are shifts in data distribution during testing compared to the training data. In this paper, we propose a novel method, known as CLIPCEIL, a model that utilizes Channel rEfinement and Image-text aLignment to facilitate the CLIP to the inaccessible out-of-distribution test datasets that exhibit domain shifts. Specifically, we refine the feature channels in the visual domain to ensure they contain domain-invariant and class-relevant features by using a lightweight adapter. This is achieved by minimizing the inter-domain variance while maximizing the inter-class variance. In the meantime, we ensure the image-text alignment by aligning text embeddings of the class descriptions and their corresponding image embedding while further removing the domain-specific features. Moreover, our model integrates multi-scale CLIP features by utilizing a self-attention fusion module, technically implemented through one Transformer layer. Extensive experiments on five widely used benchmark datasets demonstrate that CLIPCEIL outperforms the existing state-of-the-art methods. The source code is available at https://github.com/yuxi120407/CLIPCEIL .
Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which produces a compact visual representation for matching text representation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.
Data scarcity and privacy concerns limit the availability of high-quality medical images for public use, which can be mitigated through medical image synthesis. However, current medical image synthesis methods often struggle to accurately capture the complexity of detailed anatomical structures and pathological conditions. To address these challenges, we propose a novel medical image synthesis model that leverages fine-grained image-text alignment and anatomy-pathology prompts to generate highly detailed and accurate synthetic medical images. Our method integrates advanced natural language processing techniques with image generative modeling, enabling precise alignment between descriptive text prompts and the synthesized images' anatomical and pathological details. The proposed approach consists of two key components: an anatomy-pathology prompting module and a fine-grained alignment-based synthesis module. The anatomy-pathology prompting module automatically generates descriptive prompts for high-quality medical images. To further synthesize high-quality medical images from the generated prompts, the fine-grained alignment-based synthesis module pre-defines a visual codebook for the radiology dataset and performs fine-grained alignment between the codebook and generated prompts to obtain key patches as visual clues, facilitating accurate image synthesis. We validate the superiority of our method through experiments on public chest X-ray datasets and demonstrate that our synthetic images preserve accurate semantic information, making them valuable for various medical applications.
No abstract available
In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment. Project page: \url{https://yuheng-li.github.io/LLaVA-score/}
We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.
Automatic radiology report generation is a task that combines artificial intelligence and medical information processing, and it fully relies on computer vision and natural language processing techniques. Nowadays, automatic radiology report generation is still a very challenging task because it requires semantically adequate alignment of data from two modalities: radiology images and text. Existing approaches tend to focus on coarse-grained alignment at the global level and do not take into account the disease characteristics of radiology images at fine-grained semantics, which results in the generated reports potentially omitting key disease diagnostic descriptions. In this work, we propose a new approach, disease-knowledge-enhanced fine-grained image–text alignment for automatic radiology report generation (DKA-RG). The method combines global and disease-level alignment, thus facilitating the extraction of fine-grained disease features by the model. Our approach also introduces a knowledge graph to inject medical domain expertise into the model. Our proposed DKA-RG consists of two training steps: the image–report alignment stage and the image-to-report generation stage. In the alignment stage, we use global contrastive learning to align images and texts from a high level and also augment disease contrastive learning with medical knowledge to enhance the disease detection capability. In the report generation stage, the report text generated from the images is more accurate in describing the disease information thanks to sufficient alignment. Through extensive quantitative and qualitative experiments on two widely used datasets, we validate the effectiveness of our DKA-RG on the task of radiology report generation. Our DKA-RG achieves superior performance on multiple types of metrics (natural language generation and clinical efficacy metrics) compared to existing methods, demonstrating that the method can improve the reliability and accuracy of automatic radiology report generation systems.
Weakly supervised semantic segmentation (WSSS) approaches typically rely on class activation maps (CAMs) for initial seed generation, which often fail to capture global context due to limited supervision from image-level labels. To address this issue, we introduce DALNet, Dense Alignment Learning Network that leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our key insight is to employ a dual-level alignment strategy: (1) Global Implicit Alignment (GIA) to capture global semantics by maximizing the similarity between the class token and the corresponding text embeddings while minimizing the similarity with background embeddings, and (2) Local Explicit Alignment (LEA) to improve object localization by utilizing spatial information from patch tokens. Moreover, we propose a cross-contrastive learning approach that aligns foreground features between image and text modalities while separating them from the background, encouraging activation in missing regions and suppressing distractions. Through extensive experiments on the PASCAL VOC and MS COCO datasets, we demonstrate that DALNet significantly outperforms state-of-the-art WSSS methods. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
Recently, large pre-trained vision-language models, such as CLIP, have demonstrated significant potential in zero-/few-shot anomaly detection tasks. However, existing methods not only rely on expert knowledge to manually craft extensive text prompts but also suffer from a misalignment of high-level language features with fine-level vision features in anomaly segmentation tasks. In this paper, we propose a method, named SimCLIP, which focuses on refining the aforementioned misalignment problem through bidirectional adaptation of both Multi-Hierarchy Vision Adapter (MHVA) and Implicit Prompt Tuning (IPT). In this way, our approach requires only a simple binary prompt to efficiently accomplish anomaly classification and segmentation tasks in zero-shot scenarios. Furthermore, we introduce its few-shot extension, SimCLIP+, integrating the relational information among vision embeddings and skillfully merging the cross-modal synergy information between vision and language to address downstream anomaly detection tasks. Extensive experiments on two challenging datasets prove the more remarkable generalization capacity of our method compared to the current SOTA approaches. Our code is available at https://github.com/CH-ORGI/SimCLIP.
Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP’s inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP’s image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP’s image-text alignment to RIS.
To address these issues, we propose a novel Adaptive patch-word Matching (AdaMatch) model to correlate chest X-ray (CXR) image regions with words in medical reports and apply it to CXR-report generation to provide explainability for the generation process. AdaMatch exploits the fine-grained relation between adaptive patches and words to provide explanations of specific image regions with corresponding words. To capture the abnormal regions of varying sizes and positions, we introduce the Adaptive Patch extraction (AdaPatch) module to acquire the adaptive patches for these regions adaptively. In order to provide explicit explainability for CXR-report generation task, we propose an AdaMatch-based bidirectional large language model for Cyclic CXR-report generation (AdaMatch-Cyclic). It employs the AdaMatch to obtain the keywords for CXR images and `keypatches' for medical reports as hints to guide CXR-report generation. Extensive experiments on two publicly available CXR datasets prove the effectiveness of our method and its superior performance to existing methods.
No abstract available
<jats:p />
Neural image and text encoders have been proposed to align the abstract image and symbolic text representation. Global-local and local-local information integration be-tween two modalities are essential for an effective alignment. In this paper, we present RELation-aware Adaptive Cross-attention (RELAX) that achieves state-of-the-art performance in cross-modal retrieval tasks by incorporating several novel improvements. First, cross-attention methods integrate global-local information via weighted global feature of a modality (taken as value) for a local feature of the other modality (taken as query). We can make more accurate alignments if we could also consider the global weights of the query modality. To this end, we introduce adaptive embedding to consider the weights. Second, to enhance the usage of scene-graphs that can capture the high-level relation of local features, we introduce transformer encoders for textual scene graphs to align with visual scene graphs. Lastly, we use NT-XEnt loss that takes the weighted sum of the samples based on their importance. We show that our approach is effective in extensive experiments that outperform other state-of-the-art models.
Recent progress in aligning images with texts has achieved remarkable results, however, existing models tend to serve general queries and often fall short when dealing with detailed query requirements. In this paper, we work towards Entity-centric Image-Text Matching (EITM), a finer-grained image-text matching task that aligns texts and images centered around specific entities. The main challenge in EITM lies in bridging the substantial semantic gap between entity-related information in texts and images, which is more pronounced than in general image-text matching problems. To address this challenge, we adopt CLIP as our foundational model and devise a Multimodal Attentive Experts (MMAE)-based contrastive learning to adapt CLIP into an expert for EITM problem. Particularly, the core of our multimodal attentive experts learning is to generate explanation texts by Large Language Models (LLMs) as bridging clues. In specific, we first employ off-the-shelf LLMs to generate explanatory text. This text, along with the original image and text, is then fed into our Multimodal Attentive Experts module to narrow the semantic gap within a unified semantic space. Upon the enriched feature representations generated by MMAE, we have further developed an effective Gated Integrative Image-text Matching (GI-ITM) strategy. GI-ITM utilizes an adaptive gating mechanism to combine features from MMAE, followed by applying image-text matching constraints to enhance the alignment precision. Our method has been extensively evaluated on three social media news benchmarks: N24News, VisualNews, and GoodNews. The experimental results demonstrate that our approach significantly outperforms competing methods. Our code is available at: https://github.com/wangyxxjtu/ETE.
With the rapid advancement of text-to-image (T2I) generation models, assessing the semantic alignment between generated images and text descriptions has become a significant research challenge. Current methods, including those based on Visual Question Answering (VQA), still struggle with fine-grained assessments and precise quantification of image-text alignment. This paper presents an improved evaluation method named Instruction-augmented Multimodal Alignment for Image-Text and Element Matching (iMatch), which evaluates image-text semantic alignment by fine-tuning multimodal large language models. We introduce four innovative augmentation strategies: First, the QAlign strategy creates a precise probabilistic mapping to convert discrete scores from multimodal large language models into continuous matching scores. Second, a validation set augmentation strategy uses pseudo-labels from model predictions to expand training data, boosting the model's generalization performance. Third, an element augmentation strategy integrates element category labels to refine the model's understanding of image-text matching. Fourth, an image augmentation strategy employs techniques like random lighting to increase the model's robustness. Additionally, we propose prompt type augmentation and score perturbation strategies to further enhance the accuracy of element assessments. Our experimental results show that the iMatch method significantly surpasses existing methods, confirming its effectiveness and practical value. Furthermore, our iMatch won first place in the CVPR NTIRE 2025 Text to Image Generation Model Quality Assessment - Track 1 Image-Text Alignment.
Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.
Pre-trained vision-language models (VLMs) and language models (LMs) have recently garnered significant attention due to their remarkable ability to represent textual concepts, opening up new avenues in vision tasks. In medical image segmentation, efforts are being made to integrate text and image data using VLMs and LMs. However, current text-enhanced approaches face several challenges. First, using separate pre-trained vision and text models to encode image and text data can result in semantic shifts. Second, while VLMs can establish the correspondence between visual and textual features when pre-trained on paired image-text data, this alignment often deteriorates during segmentation tasks due to misalignment between the text and vision components in ongoing learning. In this paper, we propose TeViA, a novel approach that seamlessly integrates with various vision and text models, irrespective of their pre-training relationships. This integration is achieved through a segmentation-specific text-to-vision alignment design, ensuring both information gain and semantic consistency. Specifically, for each training data, a foreground visual representation is extracted from the segmentation head and used to supervise projection layers, thereby adjusting the textual features to better contribute to the segmentation task. Additionally, a historic visual prototype is created by aggregating target semantics from all training data and is updated using a momentum-based manner. This prototype aims to enhance the visual representation of each data instance by establishing feature-level connections, which in turn refines the textual features. The superiority of TeViA is validated on five public datasets, exhibiting over 6% Dice improvements compared to vision-only methods. Code is available at: https://github.com/jgfiuuuu/TeViA
No abstract available
Image-text matching is a fundamental task in bridging the semantics between vision and language. The key challenge lies in establishing accurate alignment between two heterogeneous modalities. Existing cross-modal fine-grained matching methods normally include two alignment directions, “word to region” and “region to word”, and the overall image-text similarity is calculated from the alignments. However, the alignment of these two directions is typically independent, that is, the alignment of “word to region” and “region to word” is irrelevant, so the alignment consistency cannot be guaranteed in two directions, which inevitably introduces inconsistent alignments, leading to potential inaccurate image-text matching results. In this paper, we propose a novel Bidirectional cOnsistency netwOrks for cross-Modal alignment (BOOM), which achieves more accurate cross-modal semantic alignments by imposing explicit consistency constraints in both directions. Specifically, according to three aspects reflected by alignment consistency, i.e., significance, wholeness, and alignment orderliness, we design a novel systematic multi-granularity consistency constraints: point-wise consistency, which enforces consistency of the most significant single word item in bidirectional alignments; set-wise consistency, which maintains more comprehensive and accurate bidirectional entire alignment values consistent and order-wise consistency, which ensures order consistency of bidirectional alignment results. Bidirectional cross-modal alignment between words and regions is corrected from three different perspectives: maximum, distribution, and order. Extensive experiments on two benchmarks, i.e., Flickr30K and MS-COCO, demonstrate that our BOOM achieves state-of-the-art performance.
No abstract available
本组文献共同探讨了视觉与语言模态深度对齐的前沿技术,尤其在医疗影像分析领域展现了从通用预训练模型(如CLIP)向细粒度医疗知识对齐的演进。研究重点涵盖了如何利用临床知识强化诊断报告生成、通过细粒度匹配提升病灶定位精度、优化分割任务中的跨模态语义一致性,以及建立更科学的对齐质量评估体系。这些技术为基于喉镜影像与病历文本的喉部疾病智能诊断提供了从底层特征匹配到高层逻辑推理的完整方法论支持。