针对多模态大模型的跨模态推理越狱攻击研究进展：92篇文献的方向梳理与核心结论

针对多模态大模型的跨模态推理越狱攻击

本报告综合了针对多模态大模型（MLLM/VLM）跨模态推理越狱攻击的最新研究成果。研究体系已从早期的单一像素级对抗扰动，演进为利用模型深度推理能力、跨模态逻辑解构及隐蔽语义注入的复杂攻击手段。报告涵盖了从底层对齐机制的失效分析、多样化攻击技术的开发、系统化安全评估基准的建立，到基于安全微调与推理时干预的防御加固策略，构建了完整的“攻、防、评、析”研究闭环，并特别关注了金融、医疗等高风险垂直领域的应用安全。

共 90 篇文献，6 个研究方向

多模态基础模型架构与跨模态对齐机制研究

这组文献关注多模态大模型（如CLIP及其变体）的基础架构、训练优化技术（如长文本、区域聚焦）以及跨模态表示空间的内在机制。这些研究揭示了模型如何实现视觉与文本的语义桥接，为理解跨模态越狱攻击的根源（如对齐带来的安全退化）提供了理论基础。相关文献: Quan Sun et. al, 2023 等 20 篇文献

基于对抗性扰动与黑盒优化的通用越狱攻击

该组文献主要探讨通过梯度优化、扩散模型迁移或黑盒搜索技术生成对抗性图像扰动、视觉补丁或文本后缀。这些方法旨在寻找通用的攻击向量，利用模型在处理非自然分布输入时的脆弱性，在不同模型间实现高成功率的越狱。相关文献: Wenzhuo Xu et. al, 2024 等 20 篇文献

基于推理链诱导与隐蔽语义注入的高级攻击

这组文献关注更具策略性的攻击手段，包括利用视觉推理链（Visual CoT）、逻辑解构（如流程图、ASCII艺术）、游戏化陷阱、以及将恶意指令隐蔽嵌入图像或音频中。这些攻击利用了模型的高级认知能力和跨模态一致性漏洞，使安全过滤器难以通过简单的关键词匹配进行拦截。相关文献: Renmiao Chen et. al, 2025 等 18 篇文献

多模态安全性评估基准与自动化红队测试

该组文献致力于构建标准化的评估框架、大规模基准测试集（如JailBreakV-28K）和风险分类学。通过自动化红队测试手段，系统性地衡量商业和开源模型在面临各类越狱攻击时的脆弱性，并揭示现有安全对齐的局限性。相关文献: René Peinl et. al, 2025 等 9 篇文献

跨模态安全对齐与推理时防御加固策略

这组文献探讨了缓解越狱风险的防御方案，涉及安全偏好对齐（SPA-VL）、跨模态遗忘学习、推理时的Token剪枝（SafePTR）、对比解码（SafeCoDe）以及动态防御框架。研究重点在于如何在不损害模型通用能力的前提下，增强其对恶意跨模态输入的识别与拦截能力。相关文献: Beitao Chen et. al, 2025 等 18 篇文献

特定领域应用场景下的安全性与隐私风险

该组文献针对多模态模型在特定垂直领域（如金融、医疗、机器人控制）或特定交互机制（如长短期记忆）下的安全性进行深入探讨，揭示了行业特定业务逻辑与多模态交互结合时产生的新型风险点。相关文献: Francesco Marchiori et. al, 2025 等 5 篇文献

总计92篇相关文献

Medical MLLM Is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models

医疗多模态大型语言模型易受攻击：跨模态越狱和匹配错误攻击

Xijie Huang, Xinyuan Wang, Hantao Zhang 等, 2025-Proceedings of the AAAI Conference on Artificial Intelligence

Security concerns related to Large Language Models (LLMs) have been extensively explored; however, the safety implications for Multimodal Large Language Models (MLLMs), particularly in medical contexts (MedMLLMs), remain inadequately addressed. This paper investigates the security vulnerabilities of MedMLLMs, focusing on their deployment in clinical environments where the accuracy and relevance of question-and-answer interactions are crucial for addressing complex medical challenges. We introduce and redefine two attack types: mismatched malicious attack (2M-attack) and optimized mismatched malicious attack (O2M-attack), by integrating existing clinical data with atypical natural phenomena. Using the comprehensive 3MAD dataset that we developed, which spans a diverse range of medical imaging modalities and adverse medical scenarios, we performed an in-depth analysis and proposed the MCM optimization method. This approach significantly improves the attack success rate against MedMLLMs. Our evaluations, which include white-box attacks on LLaVA-Med and transfer (black-box) attacks on four other SOTA models, reveal that even MedMLLMs designed with advanced security mechanisms remain vulnerable to breaches. This study highlights the critical need for robust security measures to enhance the safety and reliability of open-source MedMLLMs, especially in light of the potential impact of jailbreak attacks and other malicious exploits in clinical applications. Warning: Medical jailbreaking may generate content that includes unverified diagnoses and treatment recommendations. Always consult professional medical advice.