基于喉镜影像与病历文本深度对齐的喉部疾病智能诊断研究

医疗影像增强与临床疾病知识对齐

这些文献专门针对医疗领域（如X射线、放射科报告），探讨如何通过整合解剖学结构、病理特征和疾病知识库来实现影像与文本的深度对齐，从而生成准确的诊断报告或合成高质量医疗影像。

Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation（Wenting Chen, Linlin Shen, Jingyang Lin, Jiebo Luo, Xiang Li, Yixuan Yuan, 2023, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)）
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation（Sang-Jun Park, Keun-Soo Heo, Dong-Hee Shin, Young-Han Son, Ji-Hye Oh, Tae-Eui Kam, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)）
Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting（Wenting Chen, Pengyu Wang, Hui Ren, Lichao Sun, Quanzheng Li, Yixuan Yuan, Xiang Li, 2024, International Conference on Medical Image Computing and Computer-Assisted Intervention）
DKA-RG: Disease-Knowledge-Enhanced Fine-Grained Image–Text Alignment for Automatic Radiology Report Generation（Heng Yin, Wei Wu, Yongtao Hao, 2024, Electronics）

细粒度空间与多层次语义匹配机制

该组研究关注对齐过程中的颗粒度问题，通过局部区域（patches/subregions）与具体词项（words/tags）的关联、双向一致性约束以及多尺度特征融合，解决粗粒度对齐导致的信息缺失问题。

Improving Image-Text Matching With Bidirectional Consistency of Cross-Modal Alignment（Zhe Li, Lei Zhang, Kun Zhang, Yongdong Zhang, Zhendong Mao, 2024, IEEE Transactions on Circuits and Systems for Video Technology）
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training（Longtian Qiu, Shan Ning, Xuming He, 2024, AAAI Conference on Artificial Intelligence）
Enhancing image–text matching through multi-level semantic consistency alignment（Liqi Zhu, Dezhi Han, Xiang Shen, Chongqing Chen, Kuan Ching Li, 2025, The Visual Computer）
Global-local prompts guided image-text embedding, alignment and aggregation for multi-label zero-shot learning（Tiecheng Song, Yu Huang, Feng Yang, Anyong Qin, Yue Zhao, Chenqiang Gao, 2025, Journal of Visual Communication and Image Representation）
Image-Text Alignment using Adaptive Cross-attention with Transformer Encoder for Scene Graphs（Juyong Song, Sunghyun Choi, 2021, Proceedings of the British Machine Vision Conference 2021）

文本引导的视觉分割与病灶定位技术

此类文献探讨如何利用文本中的语义线索指导视觉任务中的像素级或区域级定位，包括指代性图像分割（RIS）、弱监督语义分割和异常检测，旨在提升对病灶等特定实体的空间识别能力。

Harnessing Text Insights With Visual Alignment for Medical Image Segmentation（Qingjie Zeng, Huan Luo, Zilin Lu, Yutong Xie, Zhiyong Wang, Yanning Zhang, Yong Xia, 2025, IEEE Transactions on Medical Imaging）
SimCLIP: Refining Image-Text Alignment with Simple Prompts for Zero-/Few-shot Anomaly Detection（Chenghao Deng, Haote Xu, Xiaolu Chen, Haodi Xu, Xiaotong Tu, Xinghao Ding, Yue Huang, 2024, Proceedings of the 32nd ACM International Conference on Multimedia）
Extending CLIP’s Image-Text Alignment to Referring Image Segmentation（Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, Suha Kwak, 2023, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)）
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation（Soojin Jang, Jungmin Yun, Junehyoung Kwon, Eunju Lee, YoungBin Kim, 2024, European Conference on Computer Vision）

视觉-语言预训练模型的领域适配与泛化

这些研究致力于优化如CLIP等大规模预训练模型，解决其在下游任务中的单标签偏见、跨领域分布偏移（域泛化）以及语义鸿沟问题，通过适配器、蒸馏或提示微调技术增强模型的适应性。

CLIPCEIL: Domain Generalization through CLIP via Channel rEfinement and Image-text aLignment（Xi Yu, Shinjae Yoo, Yuewei Lin, 2024, Advances in Neural Information Processing Systems 37）
Beyond General Alignment: Fine-Grained Entity-Centric Image-Text Matching with Multimodal Attentive Experts（Yaxiong Wang, Lianwei Wu, Lechao Cheng, Zhun Zhong, Yujiao Wu, Meng Wang, 2025, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval）
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias（Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim, 2024, European Conference on Computer Vision）

跨模态对齐优化与质量评估框架

该组论文涵盖了提升图文匹配性能的通用策略，如软标签对齐、自适应嵌入、指令增强以及针对生成模型对齐质量的量化评估方法，确保多模态系统在检索和评估中的稳健性。

Adaptive Cross-Modal Embeddings for Image-Text Alignment（Jonatas Wehrmann, Rodrigo C. Barros, Camila Kolling, 2020, Proceedings of the AAAI Conference on Artificial Intelligence）
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment（Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Y. Lee, Krishna Kumar Singh, 2024, European Conference on Computer Vision）
Instruction-Augmented Multimodal Alignment for Image-Text and Element Matching（Xinli Yue, Jianhui Sun, Junda Lu, Liangchao Yao, Fan Xia, Tianyi Wang, Fengyun Rao, Jing Lyu, Yuetang Deng, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)）
Unified learning for image-text alignment via multi-scale feature fusion（Jingfeng Zhou, Meng Wang, 2025, Computer Vision and Image Understanding）
Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval（Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang, 2024, Proceedings of the AAAI Conference on Artificial Intelligence）
Image-Text Alignment and Retrieval Using Light-Weight Transformer（Wenrui Li, Xiaopeng Fan, 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)）

基于喉镜影像与病历文本深度对齐的喉部疾病智能诊断研究

本组文献共同探讨了视觉与语言模态深度对齐的前沿技术，尤其在医疗影像分析领域展现了从通用预训练模型（如CLIP）向细粒度医疗知识对齐的演进。研究重点涵盖了如何利用临床知识强化诊断报告生成、通过细粒度匹配提升病灶定位精度、优化分割任务中的跨模态语义一致性，以及建立更科学的对齐质量评估体系。这些技术为基于喉镜影像与病历文本的喉部疾病智能诊断提供了从底层特征匹配到高层逻辑推理的完整方法论支持。

共 22 篇文献，5 个研究方向

医疗影像增强与临床疾病知识对齐

这些文献专门针对医疗领域（如X射线、放射科报告），探讨如何通过整合解剖学结构、病理特征和疾病知识库来实现影像与文本的深度对齐，从而生成准确的诊断报告或合成高质量医疗影像。相关文献: Wenting Chen et. al, 2023 等 4 篇文献

细粒度空间与多层次语义匹配机制

该组研究关注对齐过程中的颗粒度问题，通过局部区域（patches/subregions）与具体词项（words/tags）的关联、双向一致性约束以及多尺度特征融合，解决粗粒度对齐导致的信息缺失问题。相关文献: Zhe Li et. al, 2024 等 5 篇文献

文本引导的视觉分割与病灶定位技术

此类文献探讨如何利用文本中的语义线索指导视觉任务中的像素级或区域级定位，包括指代性图像分割（RIS）、弱监督语义分割和异常检测，旨在提升对病灶等特定实体的空间识别能力。相关文献: Qingjie Zeng et. al, 2025 等 4 篇文献

视觉-语言预训练模型的领域适配与泛化

这些研究致力于优化如CLIP等大规模预训练模型，解决其在下游任务中的单标签偏见、跨领域分布偏移（域泛化）以及语义鸿沟问题，通过适配器、蒸馏或提示微调技术增强模型的适应性。相关文献: Xi Yu et. al, 2024 等 3 篇文献

跨模态对齐优化与质量评估框架

该组论文涵盖了提升图文匹配性能的通用策略，如软标签对齐、自适应嵌入、指令增强以及针对生成模型对齐质量的量化评估方法，确保多模态系统在检索和评估中的稳健性。相关文献: Jonatas Wehrmann et. al, 2020 等 6 篇文献

总计22篇相关文献

DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation

DART：基于疾病感知的图像-文本对齐与自校正重对齐，以实现可信的放射学报告生成

Sang-Jun Park, Keun-Soo Heo, Dong-Hee Shin 等, 2025-2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on two widely used benchmarks, surpassing previous approaches in both report generation and clinical efficacy metrics, thereby enhancing the trustworthiness of radiology reports.