大型推理模型(LRMs)越狱攻击
越狱攻击的系统性表征、分类与评测/基准框架(DoAnythingNow/JailbreakHub等)
从“越狱是什么、有哪些模式/策略、如何稳定评估与基准化”出发,建立系统性分类与可复现测量框架,并讨论DoAnythingNow类工作对越狱能力表征、成功率与持续性评估的总体方法学贡献。
- A Hitchhiker’s Guide to Jailbreaking ChatGPT via Prompt Engineering(Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, 2024, Proceedings of the 4th International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things)
- A Review of “Do Anything Now” Jailbreak Attacks in Large Language Models: Potential Risks, Impacts, and Defense Strategies(Wan Chong Choi, Cyril F. Chang, Sze May Ng, Iek Chong Choi, 2025, ResearchGate)
- "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models(Xinyue Shen, Zeyuan Chen, M. Backes, Yun Shen, Yang Zhang, 2023, Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security)
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study(Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, Yang Liu, 2023, ArXiv Preprint)
- Jailbreaking Large Language Models: Safety Alignment, Response Quality, Computational Cost(Jonas Rosengren, J. Brynielsson, F. Johansson, Patrik Jonell, 2025, 2025 International Conference on Machine Learning and Applications (ICMLA))
- Jailbreak Distillation: Renewable Safety Benchmarking(Jingyu Zhang, Ahmed Elgohary, Xiawei Wang, A S M Iftekhar, Ahmed Magooda, Benjamin Van Durme, Daniel Khashabi, Kyle Jackson, 2025, ArXiv Preprint)
面向LRM推理链的专门化越狱与LRM自主/代理式越狱
聚焦LRM与推理链条的特异漏洞:一方面研究推理模型作为自主攻击代理的自治扩展风险;另一方面提出面向推理链的专门化攻击(如针对其脆弱推理路径);同时从部署形态角度对比模型级与代理级攻击面差异(agentic loop中的额外漏洞面)。
- Large reasoning models are autonomous jailbreak agents(Thilo Hagendorff, Erik Derner, Nuria Oliver, 2025, Nature Communications)
- A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos(Yao Yang, Xuan Tong, Ruofan Wang, Yixu Wang, Leya Li, Liang Liu, Yan Teng, Yingchun Wang, 2025, Findings of the Association for Computational Linguistics: ACL 2025)
- A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos(Yao Yang, Xuan Tong, Ruofan Wang, Yixu Wang, Leya Li, Liang Liu, Yan Teng, Yingchun Wang, 2025, Findings of the Association for Computational Linguistics: ACL 2025)
- Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B(Ilham Wicaksono, Zekun Wu, Rahul Patel, Theo King, Adriano Koshiyama, Philip Treleaven, 2025, ArXiv Preprint)
提示工程/文本推理型越狱:语义翻译、结构化推理诱导与可迁移构造
以文本对话/推理流程为核心攻击面,强调用结构、推理链与语义/形式化变换来生成越狱:包括多轮推理型构造、思维诱导与攻击机理解析、以及通过语义翻译/形式化结构提升可读性与跨场景可迁移性。
- Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models(Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, Dacheng Tao, 2025, Findings of the Association for Computational Linguistics: EMNLP 2025)
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks(Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, 2024, ArXiv Preprint)
- IntentObfuscator: A Jailbreaking Method via Confusing LLM with Prompts(Shang Shang, Zhongjiang Yao, Yepeng Yao, Liya Su, Zijing Fan, Xiaodan Zhang, Zhengwei Jiang, 2024, Lecture Notes in Computer Science)
- Thoughts Behind Attack: Enhancing Security Against Jailbreak Attacks Using Chain-of-Thought(Zhe Tao, Bing Xu, Muyun Yang, Hongjiao Guan, Wenpeng Lu, Hailong Cao, Conghui Zhu, Tiejun Zhao, 2025, Lecture Notes in Computer Science)
- Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation(Qizhang Li, Xiaochen Yang, Wangmeng Zuo, Yiwen Guo, 2024, ArXiv Preprint)
- DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak(Hao Wang, Hao Li, Jiachen Zhu, Xinyuan Wang, Chengwei Pan, Mingqian Huang, Lei Sha, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing)
- Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning(Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu, 2025, ArXiv Preprint)
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study(Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, Yang Liu, 2023, ArXiv Preprint)
自动化发现与红队/攻击搜索(RL/fuzzing/优化目标度量)
将越狱发现建模为可搜索/可优化的过程:使用RL、fuzzing与基于评分/偏好目标的搜索来生成有效且多样的攻击样本,强调度量定义、探索效率与Pareto最优路径,而非仅依赖手工模板。
- Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning(Weiyang Guo, Zesheng Shi, Zhuo Li, Yequan Wang, Xuebo Liu, Wenya Wang, Fangming Liu, Min Zhang, Jing Li, 2025, ArXiv Preprint)
- Red-Teaming LLMs with Token Control Score: Efficient, Universal, and Transferable Jailbreaks(L. Park, T.-H. Kwon, 2025, 2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID))
- Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs(Chetan Pathade, 2025, ArXiv Preprint)
- GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts(Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing, 2023, ArXiv Preprint)
- Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models(Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh, David Zhang, Eric Hsin, Li Chen, Ankit Jain, Matt Fredrikson, Akash Bharadwaj, 2025, ArXiv Preprint)
越狱提示生成的可扩展方法:蒸馏型攻击器、进化persona与蒸馏/生成扩散式构造
围绕“规模化生成高质量越狱”的工程路线:通过知识蒸馏压缩攻击知识以降低提示工程/查询成本;并用进化方法自动构造persona prompts以减少拒答率;同时吸收扩散/生成式操控带来的多样性增强。
- KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs(Buyun Liang, Kwan Ho Ryan Chan, Darshan Thaker, Jinqi Luo, René Vidal, 2025, ArXiv Preprint)
- Enhancing Jailbreak Attacks on LLMs via Persona Prompts(Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang, 2025, ArXiv Preprint)
- DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak(Hao Wang, Hao Li, Jiachen Zhu, Xinyuan Wang, Chengwei Pan, Mingqian Huang, Lei Sha, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing)
多轮到单轮与上下文盲区利用:对抗提示重写/结构枚举式嵌入
研究在交互成本约束下维持攻击强度:将多轮越狱压缩为单轮,利用结构化/枚举式/代码式嵌入去触发模型的上下文盲区,从而降低人力与对话轮数依赖。
- M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs(Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim, 2025, ArXiv Preprint)
意图混淆与隐身规避型越狱:分解重建、双向/双层混淆与毒性诱导
以“更难被识别”为目标:通过意图锚定隐藏、prompt分解与重建、查询-响应双向混淆、语义/分词/干扰项(distractors)与连续序列扰动提升隐蔽性与可迁移成功率;并结合毒性诱导机制在黑盒条件下提升ASR与危害性。
- Dual Intention Escape: Penetrating and Toxic Jailbreak Attack against Large Language Models(Yanni Xue, Jiakai Wang, Zixin Yin, Yuqing Ma, Haotong Qin, Renshuai Tao, Xianglong Liu, 2025, Proceedings of the ACM on Web Conference 2025)
- DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers(Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh, 2024, ArXiv Preprint)
- WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response(Tianrong Zhang, Bochuan Cao, Yuanpu Cao, Lu Lin, Prasenjit Mitra, Jinghui Chen, 2024, ArXiv Preprint)
- Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)(Bhagyajit Pingua, Deepak Murmu, Meenakshi Kandpal, Jyotirmayee Rautaray, Pranati Mishra, R. Barik, Manob Saikia, 2024, PeerJ Computer Science)
- Chinese semantic obfuscation blackbox jailbreak for domestic large models(Xinxin Yue, Zhiyong Zhang, Junchang Jing, Weiguo Wang, Simin Tang, Mengdan Xue, 2026, Cybersecurity)
- Distractor-Based Jailbreaking Attacks in Language Models and Associated Changes in Chain-of-Thought Content (Student Abstract)(T. Rowney, X. Ying, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains(Bijoy Ahmed Saiem, MD Sadik Hossain Shanto, Rakib Ahsan, Md. Rafi Ur Rashid, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop))
- WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response(Tianrong Zhang, Bochuan Cao, Yuanpu Cao, Lin Lu, Prasenjit Mitra, Jinghui Chen, 2025, Findings of the Association for Computational Linguistics: NAACL 2025)
编码/表示层混淆与加密解码触发型越狱(BitBypass/Lock&Decode)
强调不依赖传统模板提示工程,而是从信息表示/触发机制入手:通过位流/编码层伪装与密码学式混淆+解码触发,在黑盒下绕过安全对齐。
- BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage(Kalyan Nakka, Nitesh Saxena, 2025, ArXiv Preprint)
- Lock and Decode: Obfuscated Prompt Cryptography as a Mechanism for Circumventing Large Language Model Security Paradigms(Suhaas Yadavalli, Arpitha Shivaswaroopa, Mehul S. Raval, Anshul Bhowmik, A.R. Rizwana Shaikh, Megha P. Arakeri, 2025, SSRN Electronic Journal)
查询-响应联合混淆与游戏化/语义分解越狱(WordGame等)
聚焦“对齐语义解析的扰动方式”:用文字游戏、语义分解与对齐语料覆盖不足来在查询与生成响应两端共同施加掩蔽,从而提高绕过成功率。
- WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response(Tianrong Zhang, Bochuan Cao, Yuanpu Cao, Lin Lu, Prasenjit Mitra, Jinghui Chen, 2025, Findings of the Association for Computational Linguistics: NAACL 2025)
- WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response(Tianrong Zhang, Bochuan Cao, Yuanpu Cao, Lu Lin, Prasenjit Mitra, Jinghui Chen, 2024, ArXiv Preprint)
多模态与跨模态越狱(VLM/MLLM:视觉扰动、跨模态一致性、跨模态安全边界绕过)
将攻击面从纯文本扩展到VLM/MLLM与跨模态通道:通过视觉扰动、跨模态语义一致性/隐身、以及系统或多模态输入的自对抗与概率建模来提升可迁移性与隐蔽性。
- JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering(Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models(Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li, Liangli Zhen, Xiaohua Xu, 2025, ArXiv Preprint)
- Efficient LLM-Jailbreaking via Multimodal-LLM Jailbreak(Haoxuan Ji, Zheng Lin, Zhenxing Niu, Xinbo Gao, Gang Hua, 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
- Jailbreaking Multimodal Large Language Models via Consistent Cross-Modal Backgrounds(Meng Yang, Peirou Liang, Zhiqian Wu, Yong Liao, 2025, 2025 IEEE 24th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom))
- Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs(Xiang Li, Chong Zhang, Jia Wang, Fangyu Wu, Yushi Li, Xiaobo Jin, 2025, ArXiv Preprint)
- Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts(Yuanwei Wu, Xiang Li, Yixin Liu, Pan Zhou, Lichao Sun, 2023, ArXiv Preprint)
- Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts(Yi Liu, Chengjun Cai, Xiaoli Zhang, Xingliang Yuan, Cong Wang, 2024, Proceedings of the 32nd ACM International Conference on Multimedia)
- Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters(Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda, 2026, ArXiv Preprint)
- Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning(Chenyu Zhang, Lanjun Wang, Yiwen Ma, Wenhui Li, Guoqing Jin, An-An Liu, 2025, Proceedings of the AAAI Conference on Artificial Intelligence)
- Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs(Mingyu Yu, Lana Liu, Zhehao Zhao, Wei Wang, Sujuan Qin, 2026, ArXiv Preprint)
- Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?(Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu, 2024, ArXiv Preprint)
系统/管线级越狱与检索增强(Knowledge-to-Action/Agentic)
针对LLM驱动的“检索增强/知识到行动”管线:研究如何诱导检索到有害指令/要点,并进一步利用奖励追逐偏置生成可直接执行的危险内容;攻击发生在检索-生成耦合与行动生成环节。
- SearchAttack: Red-Teaming LLMs against Knowledge-to-Action Threats under Online Web Search(Yu Yan, Sheng Sun, Mingfeng Li, Zheming Yang, Chiwei Zhu, Fei Ma, Benfeng Xu, Min Liu, Qi Li, 2026, ArXiv Preprint)
自动化多轮红队与多步交互式越狱(AutoAdv/MAD-MAX/MDP等)
将攻击提升到长时域决策:通过多步对话迭代、强化学习/MDP建模与代理式跟进来提升成功率与多样性,使攻击适配真实场景中的持续交互与上下文演化。
- AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models(Aashray Reddy, Andrew Zagula, Nicholas Saban, 2025, ArXiv Preprint)
- MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming(Stefan Schoepf, Muhammad Zaid Hameed, Ambrish Rawat, Kieran Fraser, Giulio Zizzo, Giandomenico Cornacchia, Mark Purcell, 2025, ArXiv Preprint)
- Automatic LLM Red Teaming(Roman Belaire, Arunesh Sinha, Pradeep Varakantham, 2025, ArXiv Preprint)
- RedAgent: an Autonomous Agent for Context-aware Red Teaming of LLM Jailbreaks(Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Zhongjie Ba, Kui Ren, 2026, IEEE Transactions on Dependable and Secure Computing)
基于搜索与优化的越狱提示发现(GAP/SMJ/剪枝与few-shot改进)
把越狱发现形式化为搜索/优化问题:利用结构共享(图结构)、剪枝降低查询成本,并在语义约束下进行进化或few-shot自适应,以提升发现效率与成功率。
- Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation(Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, Yanjun Qi, 2025, ArXiv Preprint)
- Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs(Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien Chang, 2024, ArXiv Preprint)
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses(Chao Du, Jing Jiang, Min‐Hsuan Lin, Qian Liu, Tianyu Pang, Xiaosen Zheng, 2024, Advances in Neural Information Processing Systems 37)
越狱评估与鲁棒性刻画:度量定义、基准数据与潜在越狱(latent jailbreak)
聚焦“如何评估”:定义越狱有效性的量化指标与评估框架,构建带层次标注或潜在恶意嵌入的基准(latent jailbreak),并从推理/对齐差距角度对齐越狱问题以便系统研究鲁棒性。
- Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models(Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan, 2023, ArXiv Preprint)
- Jailbreaks as Inference-Time Alignment: A Framework for Understanding Safety Failures in LLMs(James Beetham, Souradip Chakraborty, Mengdi Wang, Furong Huang, A. S. Bedi, Mubarak Shah, 2026, Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers))
- Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?(Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu, 2024, ArXiv Preprint)
- JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering(Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
面向护栏规避的防护绕过与AML/输入注入对抗(Guardrails/检测系统脆弱性)
研究安全护栏与检测模块的可规避性:通过字符/输入注入与算法化对抗(AML)绕过检测,并用词排序/构造等方法评估对ASR与拒答的影响,回答“防护为何失效”。
- Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems(William Hackett, Lewis Birch, Stefan Trawicki, Neeraj Suri, Peter Garraghan, 2025, ArXiv Preprint)
检测与前置防护:拒绝相关损失/梯度特征、通用识别框架与提示变换一致性
以“检测”为核心:利用拒绝相关损失景观与梯度/特征信号识别越狱意图,或提出通用检测框架(如输入变换与输出差异一致性)作为前置防线,而不是直接改模型输出。
- Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes(Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho, 2024, ArXiv Preprint)
- JailGuard: A Universal Detection Framework for Prompt-based Attacks on LLM Systems(Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Ming Hu, Jie Zhang, Yang Liu, Shiqing Ma, Chao Shen, 2025, ACM Transactions on Software Engineering and Methodology)
- Prompt-Based Jailbreaking of Leading LLM Chatbots: A Survey of Attacks and Defenses(Brynn Knowlton, Jovani Campa, Davide Gallo, Khalil Dajani, Nabeel Alzahrani, 2026, IEEE Transactions on Artificial Intelligence)
推理占位、对抗情景外推与安全上下文检索(Defender:CoT occupation/ASE/SCR)
防御侧从推理环节介入:通过推理占位(CoT occupation)和对抗情景外推(ASE)降低攻击成功率,并结合安全上下文检索(SCR)提高系统在开放场景下的安全稳健性。
- CoT defender: Preemptive chain-of-thought occupation for jailbreak attack mitigation(Xiaokang Li, Jin Liu, Yongqiang Tang, Zhiwen Xie, Yihe Wang, Xiao Yu, Long Zhao, Bo Huang, 2026, Neural Networks)
- Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval(Taiye Chen, Zeming Wei, Ang Li, Yisen Wang, 2025, ArXiv Preprint)
- Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models(Md. Rafi Ur Rashid, Vishnu Asutosh Dasu, Ye Wang, Gang Tan, Shagufta Mehnaz, 2025, Proceedings of the AAAI Conference on Artificial Intelligence)
- Defending Large Language Models Against Jailbreak Attacks Through Chain of Thought Prompting(Yanfei Cao, Naijie Gu, Xinyue Shen, Daiyuan Yang, Xingmin Zhang, 2024, 2024 International Conference on Networking and Network Applications (NaNA))
通用越狱的守护与提示级固化:Constitutional Classifiers / PAT / SelfDefend
面向“通用/普适(universal)越狱”设计可部署的守护机制:通过合宪式规则生成分类器抵御跨分布攻击;并用提示级固化/对抗调优或影子守护方法在较小开销下增强防护。
- Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming(Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez, 2025, ArXiv Preprint)
- Fight Back Against Jailbreaking via Prompt Adversarial Tuning(Yichuan Mo, Yuji Wang, Zeming Wei, Yisen Wang, 2024, Advances in Neural Information Processing Systems 37)
- SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner(Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Shuai Wang, Yingjiu Li, Yang Liu, Ning Liu, Juergen Rahmel, 2024, ArXiv Preprint)
机制级对齐移除/参数干预(安全拒绝机制的结构性绕过)
从模型内部安全机制出发:将安全对齐视为可定位的结构/嵌入后门或拒绝机制,通过参数剪枝、后门化理解与结构相似样本隔离等方式实现机制级绕过或移除。
- TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts(Torsten Krauß, Hamid Dashtbani, Alexandra Dmitrienko, 2025, ArXiv Preprint)
内部激活操控与细粒度机制攻击(ActMan)
聚焦中间激活的细粒度操控:通过注意力/隐蔽异常激活穿透检测阶段,并在后期改变风险评估链路以增强潜在危害。
- Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models(Haojie Hao, Jiakai Wang, Aishan Liu, Yuqing Ma, Haotong Qin, Yuanfang Guo, Xianglong Liu, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
越狱与安全链/实验流程/数据划分:SafeChain等方法学补充
从实验设计与数据划分流程角度为越狱研究提供方法学支撑:围绕随机抽取、分割设置与WildJailbreak等方向的扩展研究,强调研究可复现与评测公平性。
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities(Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran, 2025, Findings of the Association for Computational Linguistics: ACL 2025)
攻击效率与跨任务评测:低查询成本(ICE)与面向多任务的评测扩展(BiSceneEval)
强调可操作性与评测覆盖:通过单次/低查询成本提升成功率,并构建跨问答与文本生成等任务的扩展评测体系以暴露现有评测对QA场景的盲区。
- Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion(Tie Jun Cui, Youdong Mao, Peipei Liu, Congying Liu, Datao You, 2025, Findings of the Association for Computational Linguistics: ACL 2025)
合并后,文献主要覆盖八类并列方向:①越狱的系统性表征/分类与评测基准;②LRM推理链与自主代理式越狱;③文本推理型越狱的构造与可迁移生成;④自动化红队/攻击搜索与优化;⑤多模态与检索增强/系统管线(knowledge-to-action)带来的额外攻击面;⑥隐身规避型(混淆、分解重建、双层混淆、编码加密触发)攻击;⑦检测与防御(检测框架、推理占位/对抗情景外推/安全检索、提示级固化与通用守护);⑧机制级干预(激活操控、参数/对齐机制移除)与实验流程方法学。整体体现LRMs越狱研究从“提示层模板”走向“推理链/管线/机制级”并同步强化“可搜索发现 + 可复现评测 + 可部署防御”的闭环。
总计112篇相关文献
Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking.As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values.Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity.This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models.Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss.Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq diffusion model, allowing more flexible token modifications.This approach preserves the semantic content of the original prompt while producing harmful content.Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the diffusion model's output distribution differentiable, eliminating the need for iterative token search.Extensive experiments on Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous methods across various evaluation metrics, including attack success rate (ASR), fluency, and diversity.
… led to the emergence of various jailbreak technologies, including role-… LLM black-box jailbreak framework called IntentObfuscator, designed to obscure the true intention of user prompts …
Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks capable of producing inappropriate content. Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences. However, existing methods entail significant computational costs. In this paper, we first present a finding that the difference in output distributions between jailbreak and benign prompts can be employed for detecting jailbreak prompts. Based on this finding, we propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts through the confidence of the first token. Furthermore, we enhance the detection performance of FJD through the integration of virtual instruction learning. Extensive experiments on aligned LLMs show that our FJD can effectively detect jailbreak prompts with almost no additional computational costs during LLM inference.
… jailbreak to refer to any interaction that causes an LLM to violate its intended safety policies or produce restricted content. An adversarial prompt … Prompt injection is a particular subclass …
The systems and software powered by Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have played a critical role in numerous scenarios. However, current LLM systems are vulnerable to prompt-based attacks, with jailbreaking attacks enabling the LLM system to generate harmful content, while hijacking attacks manipulate the LLM system to perform attacker-desired tasks, underscoring the necessity for detection tools. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework deployed on top of LLM systems for prompt-based attacks across text and image modalities. JailGuard operates on the principle that attacks are inherently less robust than benign ones. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants’ responses on the target model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. The evaluation on the dataset containing 15 known attack types suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81–25.73% and 12.20–21.40%.
Large language models (LLMs) have become transformative tools in areas like text generation, natural language processing, and conversational AI. However, their widespread use introduces security risks, such as jailbreak attacks, which exploit LLM’s vulnerabilities to manipulate outputs or extract sensitive information. Malicious actors can use LLMs to spread misinformation, manipulate public opinion, and promote harmful ideologies, raising ethical concerns. Balancing safety and accuracy require carefully weighing potential risks against benefits. Prompt Guarding (Prompt-G) addresses these challenges by using vector databases and embedding techniques to assess the credibility of generated text, enabling real-time detection and filtering of malicious content. We collected and analyzed a dataset of Self Reminder attacks to identify and mitigate jailbreak attacks, ensuring that the LLM generates safe and accurate responses. In various attack scenarios, Prompt-G significantly reduced jailbreak success rates and effectively identified prompts that caused confusion or distraction in the LLM. Integrating our model with Llama 2 13B chat reduced the attack success rate (ASR) to 2.08%. The source code is available at: https://doi.org/10.5281/zenodo.13501821.
Natural language prompts serve as an essential interface between users and Large Language Models (LLMs) like GPT-3.5 and GPT-4, which are employed by ChatGPT to produce outputs across various tasks. However, prompts crafted with malicious intent, known as jailbreak prompts, can circumvent the restrictions of LLMs, posing a significant threat to systems integrated with these models. Despite their critical importance, there is a lack of systematic analysis and comprehensive understanding of jailbreak prompts. Our paper aims to address this gap by exploring key research questions to enhance the robustness of LLM systems: 1) What common patterns are present in jailbreak prompts? 2) How effectively can these prompts bypass the restrictions of LLMs? 3) With the evolution of LLMs, how does the effectiveness of jailbreak prompts change? To address our research questions, we embarked on an empirical study targeting the LLMs underpinning ChatGPT, one of today's most advanced chatbots. Our methodology involved categorizing 78 jailbreak prompts into 10 distinct patterns, further organized into three jailbreak strategy types, and examining their distribution. We assessed the effectiveness of these prompts on GPT-3.5 and GPT-4, using a set of 3,120 questions across 8 scenarios deemed prohibited by OpenAI. Additionally, our study tracked the performance of these prompts over a 3-month period, observing the evolutionary response of ChatGPT to such inputs. Our findings offer a comprehensive view of jailbreak prompts, elucidating their taxonomy, effectiveness, and temporal dynamics. Notably, we discovered that GPT-3.5 and GPT-4 could still generate inappropriate content in response to malicious prompts without the need for jailbreaking. This underscores the critical need for effective prompt management within LLM systems and provides valuable insights and data to spur further research in LLM testing and jailbreak prevention.
The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.
… We focus on jailbreaking attacks that aim to discover prompts to mislead LLMs producing … we successfully break these models’ safety alignment, demonstrating the effectiveness of our …
… quantifies the gap between safe and unsafe objectives under adversarial prompting. This … vulnerability of a safety-aligned model π∗safe(·|x) to jailbreaking, we define a notion of the …
Large language models are often equipped with safety alignment mechanisms designed to prevent generation of harmful or other unwanted content. However, an increasing number of jailbreaking techniques attempt to circumvent these safeguards, raising significant safety concerns. This paper introduces an open-source evaluation framework that analyzes jailbreaking effectiveness in several dimensions: refusal bypass rate, harmful response quality, impact on general model capabilities, and computational cost. In the study, prompt injection, sampling exploits, and model manipulation techniques are examined across four open-weight instruction-tuned large language models. The results demonstrate that high refusal bypass does not necessarily equate to practical safety compromise. Specifically, model manipulation methods like single refusal direction ablation achieve a high attack success rate, but often degrade general capabilities and require significant computational resources. Meanwhile, sampling-based exploits show a minimal practical threat when assessed with a robust model classifier. The findings emphasize the importance of comprehensive, multi-dimensional evaluation to accurately characterize jailbreaking effectiveness and safety risks in large language models.
With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks. In this work, we first highlight an important safety gap to describe that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safety reward model through controlled decoding to defend against jailbreak attacks. Additionally, we provide a mathematical characterization of Immune, offering insights on why it improves safety against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model’s original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.
… By highlighting failure modes and limitations of existing methods to align LLMs for safety, we … Baseline As a control, we test a none jailbreak that simply echoes each prompt verbatim. …
Jailbreaking – bypassing built-in safety mechanisms in AI models – has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts covering several sensitive domains. This setup yielded an overall jailbreak success rate across all model combinations of 97.14%. Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models, highlighting the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents. Here, the authors demonstrate that large reasoning models can autonomously plan and execute persuasive multi-turn attacks to systematically bypass safety mechanisms in widely used AI systems.
Large Reasoning Models (LRMs) have recently demonstrated impressive performances across diverse domains.However, how the safety of Large Language Models (LLMs) benefits from enhanced reasoning capabilities against jailbreak queries remains unexplored.To bridge this gap, in this paper, we propose Reasoningto-Defend (R2D), a novel training paradigm that integrates a safety-aware reasoning mechanism into LLMs' generation process.This enables self-evaluation at each step of the reasoning process, forming safety PIVOT TOKENS as indicators of the safety status of responses.Furthermore, in order to improve the accuracy of predicting PIVOT TOKENS, we propose Contrastive Pivot Optimization (CPO), which enhances the model's perception of the safety status of given dialogues.LLMs dynamically adjust their response strategies during reasoning, significantly enhancing their safety capabilities defending jailbreak attacks.Extensive experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.This highlights the substantial potential of safety-aware reasoning in improving robustness of LRMs and LLMs against various jailbreaks.
Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, Dacheng Tao. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025.
Large Reasoning Models (LRMs) have significantly advanced beyond traditional Large Language Models (LLMs) with their exceptional logical reasoning capabilities, yet these improvements introduce heightened safety risks.When subjected to jailbreak attacks, their ability to generate more targeted and organized content can lead to greater harm.Although some studies claim that reasoning enables safer LRMs against existing LLM attacks, they overlook the inherent flaws within the reasoning process itself.To address this gap, we propose the first jailbreak attack targeting LRMs, exploiting their unique vulnerabilities stemming from the advanced reasoning capabilities.Specifically, we introduce a CHAOS MACHINE, a novel component to transform attack prompts with diverse one-to-one mappings.The chaos mappings iteratively generated by the machine are embedded into the reasoning chain, which strengthens the variability and complexity and also promotes a more robust attack.Based on this, we construct the MOUSETRAP framework, which makes attacks projected into nonlinearlike low sample spaces with mismatched generalization enhanced.Also, due to the more competing objectives, LRMs gradually maintain the inertia of unpredictable iterative reasoning and fall into our trap.Success rates of the Mousetrap attacking o1-mini, Claude-Sonnet and Gemini-Thinking are as high as 96%, 86% and 98% respectively on our toxic dataset TROTTER.On benchmarks such as AdvBench, StrongREJECT, and HarmBench, attacking Claude-Sonnet, well-known for its safety, Mousetrap can astonishingly achieve success rates of 87.5%, 86.58% and 93.13% respectively.Attention: This paper contains inappropriate, offensive and harmful content.
Text-to-Image (T2I) models typically deploy safety mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attack methods manually design instructions for the LLM to generate adversarial prompts, which effectively exposing safety vulnerabilities of T2I models. However, existing methods have two limitations: 1) relying on manually exhaustive strategies for designing adversarial prompts, lacking a unified framework, and 2) requiring numerous queries to achieve a successful attack, limiting their practical applicability. To address this issue, we propose Reason2Attack~(R2A), which aims to enhance the effectiveness and efficiency of the LLM in jailbreaking attacks. Specifically, we first use Frame Semantics theory to systematize existing manually crafted strategies and propose a unified generation framework to generate CoT adversarial prompts step by step. Following this, we propose a two-stage LLM reasoning training framework guided by the attack process. In the first stage, the LLM is fine-tuned with CoT examples generated by the unified generation framework to internalize the adversarial prompt generation process grounded in Frame Semantics. In the second stage, we incorporate the jailbreaking task into the LLM's reinforcement learning process, guided by the proposed attack process reward function that balances prompt stealthiness, effectiveness, and length, enabling the LLM to understand T2I models and safety mechanisms. Extensive experiments on various T2I models with safety mechanisms, and commercial T2I models, show the superiority and practicality of R2A.
With the deep research and widespread application of Large Language Models (LLMs), the security and privacy issues inherent in them have gradually become prominent, posing new challenges in the field of network security. Elaborately designed jailbreak attack prompts may induce LLMs to act against human values and preferences and generate harmful responses. To address this issue, we introduce a straightforward yet effective defence mechanism: Chain of Thought Prompting. This defence mechanism aims to emulate human thought processess. Chain of Thought Prompting method fully harnesses the inherent reasoning ability of the LLMs via five stages. It encourages Large Language Models (LLMs) to engage in self-thought, self-reflection, and self-refinement. Our experiments on ChatGLM-3-6B and Llama-2-7B-Chat, utilizing the JADE and DAN datasets, demonstrate a significant reduction in the average Attack Success Rate (ASR) for jailbreak attacks. This reduction is from 65.01% to 13.40% for ChatGLM-3-6B, and from 49.06% to 0.13% for Llama-2-7B-Chat. Our findings suggest that Chain of Thought Prompting can significantly reduce the generation of harmful content. Our work provides an effective method for LLMs to counter jailbreak attacks, enhancing the safety of LLMs without the need for additional training.
Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning.However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns.In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking?Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness.Based on the theoretical insights, we propose a novel jailbreak method, FICDETAIL, whose practical performance validates our theoretical findings.Warning: This work contains potentially offensive LLM-generated content.* Equal contribution. 1 q p (P )f (P )h (i) (P ) dP J Key factors affecting harmfulness with CoT.Proposition 1 implies that the expected harmfulness is influenced by several crucial factors:
Recently, the jailbreak attack, which generates adversarial prompts to bypass safety measures and mislead large language models (LLMs) to output harmful answers, has attracted extensive interest due to its potential to reveal the vulnerabilities of LLMs. However, ignoring the exploitation of the characteristics in intention understanding, existing studies could only generate prompts with weak attacking ability, failing to evade defenses (e.g., sensitive word detect) and causing malice(e.g., harmful outputs). Motivated by the mechanism in the psychology of human misjudgment, we propose a dual intention escape (DIE) jailbreak attack framework to generate more stealthy and toxic prompts to deceive LLMs to output harmful content. For stealthiness, inspired by the anchoring effect, we designed the Intention-anchored Malicious Concealment(IMC) module that hides the harmful intention behind a generated anchor intention by the recursive decomposition block and contrary intention nesting block. Since the anchor intention will be received first, the LLMs might pay less attention to the harmful intention and enter response status. For toxicity, we propose the Intention-reinforced Malicious Inducement (IMI) module based on the availability bias mechanism in a progressive malicious prompting approach. Due to the ongoing emergence of statements correlated to harmful intentions, the output content of LLMs will be closer to these more accessible intentions, i.e., more toxic. We conducted extensive experiments under black-box settings, supporting that DIE could achieve 100% ASR-R and 92.9% ASR-G against GPT3.5-turbo.
Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs). Despite offering new possibilities for LLM applications, these advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content. While LLMs have undergone extensive security evaluations with the aid of red teaming frameworks, VLMs currently lack a well-developed one. To fill this gap, we introduce Arondight, a standardized red team framework tailored specifically for VLMs. Arondight is dedicated to resolving issues related to the absence of visual modality and inadequate diversity encountered when transitioning existing red teaming methodologies from LLMs to VLMs. Our framework features an automated multi-modal jailbreak attack, wherein visual jailbreak prompts are produced by a red team VLM, and textual prompts are generated by a red team LLM guided by a reinforcement learning agent. To enhance the comprehensiveness of VLM security evaluation, we integrate entropy bonuses and novelty reward metrics. These elements incentivize the RL agent to guide the red team LLM in creating a wider array of diverse and previously unseen test cases. Our evaluation of ten cutting-edge VLMs exposes significant security vulnerabilities, particularly in generating toxic images and aligning multi-modal prompts. In particular, our Arondight achieves an average attack success rate of 84.5% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI in terms of generating toxic text. For a clearer comparison, we also categorize existing VLMs based on their safety levels and provide corresponding reinforcement recommendations. Our multimodal prompt dataset and red team code will be released after ethics committee approval. CONTENT WARNING: THIS PAPER CONTAINS HARMFUL MODEL RESPONSES.
… LLM applications under 3,000 context-specific jailbreak prompts, offering insights on how red teaming … setup for the evaluation of context-specific jailbreak threats in Section 3.1 and Sec…
Large Language Models (LLMs) are vulnerable to jailbreak attacks, where adversaries craft malicious prompts to bypass safety mechanisms and elicit harmful, illegal, or unethical responses. To ensure robustness before deployment, red-teaming LLMs is essential. However, prior methods like GCG and AutoDAN rely on loss functions that only assess whether the output matches a fixed target, offering little insight into how well the prompt circumvents refusal behaviors. We introduce Token Control Score (TCS), a novel metric that quantifies how effectively a prompt steers the LLM toward compliance and away from refusal. TCS compares the logits of key tokens representing compliant and rejecting responses, and can be extended with gradient-based feedback into a Normalized Token Control Score (NTCS) to guide optimization. Using this metric, we propose LLM-CGF, a fuzzing-based framework that iteratively discovers effective jailbreak templates for prompt injection. LLM-CGF leverages NTCS as the fitness function to explore LLM behaviors and uncover vulnerabilities with high efficiency. Experiments show that LLM-CGF outperforms state-of-the-art methods such as LLM-Fuzzer, generating more universal, transferable, and query-efficient jailbreak prompts. These results demonstrate the utility of TCS and NTCS as new objectives for prompt injection, providing deeper insights into LLM safety and enabling more thorough red-teaming evaluations. Warning: This paper contains unfiltered content generated by LLMs that may be offensive to readers.
While Large Language Models (LLMs) have achieved tremendous success in various applications, they are also susceptible to jailbreaking attacks. Several primary defense strategies have been proposed to protect LLMs from producing harmful information, mostly focusing on model fine-tuning or heuristical defense designs. However, how to achieve intrinsic robustness through prompt optimization remains an open problem. In this paper, motivated by adversarial training paradigms for achieving reliable robustness, we propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix. To achieve our defense goal whilst maintaining natural performance, we optimize the control prompt with both adversarial and benign prompts. Comprehensive experiments show that our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0%, while maintaining the model's utility on the benign task and incurring only negligible computational overhead, charting a new perspective for future explorations in LLM security. Our code is available at https://github.com/PKU-ML/PAT.
Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications.However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist.Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning.ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks.Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents.This continuous improvement process strengthens defenses against newly generated jailbreak prompts.Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios.Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism.The code is available at https://github.com/YujunZhou/ In-Context-Adversarial-Game.
Text-to-image (T2I) models can be maliciously used to generate harmful content such as sexually explicit, unfaithful, and misleading or Not-Safe-for-Work (NSFW) images.Previous attacks largely depend on the availability of the diffusion model or involve a lengthy optimization process.In this work, we investigate a more practical and universal attack that does not require the presence of a target model and demonstrate that the high-dimensional text embedding space inherently contains NSFW concepts that can be exploited to generate harmful images.We present the Jailbreaking Prompt Attack (JPA).JPA first searches for the target malicious concepts in the text embedding space using a group of antonyms generated by ChatGPT.Subsequently, a prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space.We further introduce a soft assignment with gradient masking technique that allows us to perform gradient ascent in the discrete vocabulary space.We perform extensive experiments with open-sourced T2I models, e.g.(CompVis, 2022a) and closed-sourced online services, e.g.(Ramesh et al., 2022;MidJourney, 2023;Podell et al., 2023) with black-box safety checkers.Results show that (1) JPA bypasses both text and image safety checkers (2) while preserving high semantic alignment with the target prompt.(3) JPA demonstrates a much faster speed than previous methods and can be executed in a fully automated manner.These merits render it a valuable tool for robustness evaluation in future text-to-image generation research.
Large Language Models’ safety remains a critical concern due to their vulnerability to jailbreaking attacks, which can prompt these systems to produce harmful and malicious responses. Safety classifiers, computational models trained to discern and mitigate potentially harmful, offensive, or unethical outputs, offer a practical solution to address this issue. However, despite their potential, existing safety classifiers often fail when exposed to adversarial attacks such as gradient-optimized suffix attacks. In response, our study introduces Adversarial Prompt Shield (APS), a lightweight safety classifier model that excels in detection accuracy and demonstrates resilience against unseen jailbreaking prompts. We also introduce efficiently generated adversarial training datasets, named Bot Adversarial Noisy Dialogue (BAND), which are designed to fortify the classifier’s robustness. Through extensive testing on various safety tasks and unseen jailbreaking attacks, we demonstrate the effectiveness and resilience of our models. Evaluations show that our classifier has the potential to significantly reduce the Attack Success Rate by up to 44.9%. This advance paves the way for the next generation of more reliable and resilient Large Language Models.
Warning: this paper contains data, prompts, and model outputs that are offensive in nature. Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions (that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.
… In Step 1, we show (i) the original adversarial prompt from the dataset, (ii) the prompt presented to the attacker LLM together with the system instructions, and (iii) one generated …
… modify prompts to … jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO) to create robust system-level defenses. Our approach directly incorporates the adversary …
With the development of large language models (LLMs), numerous studies have demonstrated their vulnerability to carefully crafted jailbreak attacks. However, existing mitigation measures rarely balance model usability with significant protective effects, raising concerns about model abuse. To address this, we introduce CoT Defender. It preemptively occupies the model's first few generated tokens with a chain-of-thought analysis that hinders attackers from steering the output towards harmful content. We designed a two-stage training framework that strengthens security while preserving usability. Stage 1 fine-tunes the model to follow a structured chain-of-thought format before answering. Stage 2 employs reinforcement learning to refine this reasoning. An auxiliary attacker model continuously synthesizes new jailbreak prompts, and a lightweight evaluator-Probabilistic Structured Output Evaluation (PSOE)-supplies fine-grained rewards by scoring both sentence-level intent capture and token-level format fidelity. We conducted a series of experiments on four models and six attack methods. Across all models, we successfully reduced the average attack success rate to below 8.0 %, with no more than a 7.0 % impact on the response rate for benign requests. Code is available here. Warning: This paper contains red-team data that may be offensive!
… Chain-of-Thought (CoT) without any optimization or training. We believe that certain jailbreak … understand the thoughts behind jailbreak attacks, we propose a jailbreak attack taxonomy …
We identify a jailbreaking vulnerability in multiple open-source LLMs: by augmenting dangerous requests using certain "distractors" to obfuscate their intent, we elicit specific, actionable responses on a wide variety of harmful topics. We find that such an attack noticeably alters the contents of these models' chains of thought, including changed frequencies of seemingly unrelated n-grams and heightened ethical scrutiny about harmful requests even when their response is ultimately jailbroken.
… We randomly select 250 jailbreak prompts and select 50 prompts as a small split. … Other than experiment on WildJailbreak in Section 3, we augment our study with jailbreaking …
Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to
Although large language models (LLMs) have achieved remarkable advancements, their security remains a pressing concern.One major threat is jailbreak attacks, where adversarial prompts bypass model safeguards to generate harmful or objectionable content.Researchers study jailbreak attacks to understand security and robustness of LLMs.However, existing jailbreak attack methods face two main challenges: (1) an excessive number of iterative queries, and (2) poor generalization across models.In addition, recent jailbreak evaluation datasets focus primarily on questionanswering scenarios, lacking attention to text generation tasks that require accurate regeneration of toxic content.To tackle these challenges, we propose two contributions: (1) ICE, a novel black-box jailbreak method that employs Intent Concealment and divErsion to effectively circumvent security constraints.ICE achieves high attack success rates (ASR) with a single query, significantly improving efficiency and transferability across different models.(2) BiSceneEval, a comprehensive dataset designed for assessing LLM robustness in question-answering and text-generation tasks.Experimental results demonstrate that ICE outperforms existing jailbreak techniques, revealing critical vulnerabilities in current defense mechanisms.Our findings underscore the necessity of a hybrid security strategy that integrates predefined security mechanisms with real-time semantic decomposition to enhance the security of LLMs.
… the safe robustness of VLMs and further inspire the design of defensive AI frameworks, we propose Virtual Scenario Hypnosis (VSH), a multimodal prompt injection jailbreak … to jailbreak …
Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to understand multimodal information comprehensively, achieving remarkable performance in many vision-centric tasks. Despite that, recent studies have shown that these models are susceptible to jailbreak attacks, which refer to an exploitative technique where malicious users can break the safety alignment of the target model and generate misleading and harmful answers. This potential threat is caused by both the inherent vulnerabilities of LLM and the larger attack scope introduced by vision input. To enhance the security of MLLMs against jailbreak attacks, researchers have developed various defense techniques. However, these methods either require modifications to the model's internal structure or demand significant computational resources during the inference phase. Multimodal information is a double-edged sword. While it increases the risk of attacks, it also provides additional data that can enhance safeguards. Inspired by this, we propose Cross-modality Information DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify maliciously perturbed image inputs, utilizing the cross-modal similarity between harmful queries and adversarial images. CIDER is independent of the target MLLMs and requires less computation cost. Extensive experimental results demonstrate the effectiveness and efficiency of CIDER, as well as its transferability to both white-box and black-box MLLMs.
Multimodal Large Language Models (MLLMs) bridge the gap between visual and textual data, enabling a range of advanced applications. However, complex internal interactions among visual elements and their alignment with text can introduce vulnerabilities, which may be exploited to bypass safety mechanisms. To address this, we analyze the relationship between image content and task and find that the complexity of subimages, rather than their content, is key. Building on this insight, we propose the Distraction Hypothesis, followed by a novel framework called Contrasting Subimage Distraction Jailbreaking (CS-DJ), to achieve jailbreaking by disrupting MLLMs alignment through multi-level distraction strategies. CS-DJ consists of two components: structured distraction, achieved through query decomposition that induces a distributional shift by fragmenting harmful prompts into sub-queries, and visual-enhanced distraction, realized by constructing contrasting subimages to disrupt the interactions among visual elements within the model. This dual strategy disperses the model’s attention, reducing its ability to detect and mitigate harmful content. Extensive experiments across five representative scenarios and four popular closed-source MLLMs, including GPT-4o-mini, GPT-4o, GPT-4V, and Gemini-1.5-Flash, demonstrate that CS-DJ achieves average success rates of 52.40% for the attack success rate and 74.10% for the ensemble attack success rate. These results reveal the potential of distraction-based approaches to exploit and bypass MLLMs’ defenses, offering new insights for attack strategies. Our code is available at https://github.com/TeamPigeonLab/CS-DJ.Warning: This paper contains unfiltered content generated by MLLMs that may be offensive to readers
Multimodal large language models (MLLMs) have shown remarkable capability in vision–language understanding but remain vulnerable to jailbreak attacks that bypass safety alignment. Existing methods often rely on structured input manipulations such as cross-modal signal shifting, adversarial visual injection, or semantic rephrasing, which compromise naturalness and degrade under stronger defenses. To address this, we propose Consistent Cross-Modal Backgrounds Attack (CCB-A), a jailbreak strategy that embeds benign and semantically aligned background cues across both image and text modalities. Leveraging such consistency, CCB-A dilutes safety-sensitive signals and subtly redirects model attention away from harmful content while maintaining semantic coherence. We evaluate CCB-A on six representative MLLMs, including both commercial and open-source systems, across MM-SafetyBench and HADES. Experiments show that CCB-A consistently surpasses strong baselines in harmfulness and attack success rate. Ablation and defense analyses further reveal that cross-modal semantic consistency is the key driver of its effectiveness, while current defense mechanisms remain insufficiently robust, suggesting that exploiting consistency across modalities constitutes an effective yet previously overlooked jailbreak strategy that exposes a critical vulnerability in current alignment techniques.
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that, while successful in bypassing safety filters, lack substantial harmful content. To address this gap, we propose JPS, Jailbreak MLLMs with collaborative visual Perturbation and textual Steering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by ''steering prompt'' optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at https://github.com/thu-coai/JPS Warning: This paper contains potentially sensitive contents.
This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreak methods that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) built upon the target LLM. Subsequently, we perform an efficient MLLM jailbreak and obtain a jailbreaking embedding. Finally, we convert the embedding into a textual jailbreaking suffix to carry out the jailbreak of target LLM. Compared to the direct LLM-jailbreak methods, our indirect jailbreaking approach is more efficient, as MLLMs are more vulnerable to jailbreak than pure LLMs. Additionally, to improve the attack success rate of jailbreak, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class generalization abilities.
With the rapid advancement of Large Vision-Language Models (VLMs), concerns about their potential misuse and abuse have grown rapidly.Prior research has exposed VLMs' vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards.However, current jailbreak methods often fail against cutting-edge models such as GPT-4o.We attribute this to the overexposure of harmful content and the absence of stealthy malicious guidance.In this work, we introduce a novel jailbreak framework: Multi-Modal Linkage (MML) Attack.Drawing inspiration from cryptography, MML employs an encryption-decryption process across text and image modalities to mitigate the over-exposure of malicious information.To covertly align the model's output with harmful objectives, MML leverages a technique we term evil alignment, framing the attack within the narrative context of a video game development scenario.Extensive experiments validate the effectiveness of MML.Specifically, MML jailbreaks GPT-4o with attack success rates of 99.40% on SafeBench, 98.81% on MM-SafeBench, and 99.07% on HADES-Dataset.Our code is available at https://github.com/wangyu-ovo/MML.
… , jailbreak attacks operate entirely at the prompt level and rely on natural language manipulation, such as obfuscation, role… techniques such as obfuscation (eg, leetspeak and encoding), …
… jailbreaking which provides minimum complexity. With constant security patches, basic prompt encryption strategies no longer achieve a jailbreak, … for achieving a jailbreak. Therefore a …
Nathalie Maria Kirch, Constantin Niko Weisser, Severin Field, Helen Yannakoudakis, Stephen Casper. Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2025.
This review investigated the “Do Anything Now” (DAN) jailbreak phenomenon in Large Language Models (LLMs), which bypassed safety mechanisms through adversarial prompting, prompt injection, and persona manipulation. We analyzed its multifaceted risks: (1) technical and behavioral harms, such as the facilitation of illicit content and malicious code; (2) psychological and social manipulation, including emotional dependency and deceptive interactions; and (3) online abuse, misinformation, and the cross-model portability of jailbreaks. To mitigate these threats, we synthesized defense strategies at three levels: (1) prompt-level defenses such as input filtering, prompt rewriting, and output moderation that aimed to intercept jailbreak attempts without altering model weights; (2) model-level defenses including safety fine-tuning, reinforcement learning from human feedback (RLHF), and internal signal monitoring to build intrinsic resistance to adversarial prompts; and (3) external and composite strategies such as multi-agent safety architectures, moving-target defenses, and continuous evaluation pipelines to provide layered protection and adaptability. We also note the ethics challenge raised by jailbreak research and evaluation practices, and outline open questions around responsibility, consent, and dual-use risk. We concluded that a defense-in-depth approach, supported by benchmark-driven evaluation and collaborative oversight, was essential for LLMs’ robust and trustworthy deployment.
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response
The recent breakthrough in large language models (LLMs) such as ChatGPT has revolutionized every industry at an unprecedented pace.Alongside this progress also comes mounting concerns about LLMs' susceptibility to jailbreaking attacks, which leads to the generation of harmful or unsafe content.While safety alignment measures have been implemented in LLMs to mitigate existing jailbreak attempts and force them to become increasingly complicated, it is still far from perfect.In this paper, we analyze the common pattern of the current safety alignment and show that it is possible to exploit such patterns for jailbreaking attacks by simultaneous obfuscation in queries and responses.Specifically, we propose WordGame attack, which replaces malicious words with word games to break down the adversarial intent of a query and encourage benign content regarding the games to precede the anticipated harmful content in the response, creating a context that is hardly covered by any corpus used for safety alignment.Extensive experiments demonstrate that WordGame attack can break the guardrails of the current leading proprietary and open-source LLMs, including the latest Claude 3, GPT 4, and Llama 3 models more effectively than existing attacks efficiently.The attack also remains powerful when external defenses are adopted.Further ablation studies on such simultaneous obfuscation in query and response provide evidence of the merits of the attack strategy beyond an individual attack.Warning: The paper contains unfiltered text generated by LLMs which can be offensive.
… prompt into the jailbreak prompt. Sensitive words in both the sample and the original … s decoding capability for specific encoding schemes when designing jailbreak attacks. Therefore, in …
… harmful prompt obfuscation. In this study, we propose SEQUENTIALBREAK, a … jailbreak attack that sends a series of prompts in a single query with one being the target harmful prompt. …
Recently, Large Vision-Language Models (LVLMs) have been demonstrated to be vulnerable to jailbreak attacks, highlighting the urgent need for further research to comprehensively identify and mitigate these threats. Unfortunately, existing jailbreak studies primarily focus on coarse-grained input manipulation to elicit specific responses, overlooking the exploitation of internal representations, i.e., intermediate activations, which constrains their ability to penetrate alignment safeguards and generate harmful responses. To tackle this issue, we propose the Activation Manipulation (ActMan) Attack framework, which performs fine-grained activation manipulations inspired by the perception and cognition stages of human decision-making, enhancing both the penetration capability and harmfulness of attacks. To improve penetration capability, we introduce a Deceptive Visual Camouflage module inspired by the masking effect in human perception. This module uses a benign activation-guided attention redirection strategy to conceal abnormal activation patterns, thereby suppressing LVLM's defense detection during early-stage decoding. To enhance harmfulness, we design a Malicious Semantic Induction module drawing from the framing effect in human cognition, which reconstructs jailbreak instructions using malicious activation guidance to change LVLM’s risk assessment during late-stage decoding, thereby amplifying the harmfulness of model responses. Extensive experiments on six mainstream LVLMs demonstrate that our method remarkably outperforms state-of-the-art baselines, achieving an average relative ASR improvement of 12.06%.
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs). A considerable amount of research exists proposing more effective jailbreak attacks, including the recent Greedy Coordinate Gradient (GCG) attack, jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and multilingual jailbreak. In contrast, the defensive side has been relatively less explored. This paper proposes a lightweight yet practical defense called SELFDEFEND, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts. Our key insight is that regardless of the kind of jailbreak strategies employed, they eventually need to include a harmful prompt (e.g., "how to make a bomb") in the prompt sent to LLMs, and we found that existing LLMs can effectively recognize such harmful prompts that violate their safety policies. Based on this insight, we design a shadow stack that concurrently checks whether a harmful prompt exists in the user prompt and triggers a checkpoint in the normal stack once a token of "No" or a harmful prompt is output. The latter could also generate an explainable LLM response to adversarial prompts. We demonstrate our idea of SELFDEFEND works in various jailbreak scenarios through manual analysis in GPT-3.5/4. We also list three future directions to further enhance SELFDEFEND.
Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities. Understanding and addressing these attacks is crucial for advancing the field of LLM safety. Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts. In this study, we systematically explore the efficacy of persona prompts in compromising LLM defenses. We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms. Our experiments reveal that: (1) our evolved persona prompts reduce refusal rates by 50-70% across multiple LLMs, and (2) these prompts demonstrate synergistic effects when combined with existing attack methods, increasing success rates by 10-20%. Our code and data are available at https://github.com/CjangCjengh/Generic_Persona.
Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.
Large language models (LLMs) are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking. We propose Jailbreak Distillation (JBDistill), a novel benchmark construction framework that "distills" jailbreak attacks into high-quality and easily-updatable safety benchmarks. JBDistill utilizes a small set of development models and existing jailbreak attack algorithms to create a candidate prompt pool, then employs prompt selection algorithms to identify an effective subset of prompts as safety benchmarks. JBDistill addresses challenges in existing safety evaluation: the use of consistent evaluation prompts across models ensures fair comparisons and reproducibility. It requires minimal human effort to rerun the JBDistill pipeline and produce updated benchmarks, alleviating concerns on saturation and contamination. Extensive experiments demonstrate our benchmarks generalize robustly to 13 diverse evaluation models held out from benchmark construction, including proprietary, specialized, and newer-generation LLMs, significantly outperforming existing safety benchmarks in effectiveness while maintaining high separability and diversity. Our framework thus provides an effective, sustainable, and adaptable solution for streamlining safety evaluation.
As large language models (LLMs) become increasingly prevalent, ensuring their robustness against adversarial misuse is crucial. This paper introduces the GAP (Graph of Attacks with Pruning) framework, an advanced approach for generating stealthy jailbreak prompts to evaluate and enhance LLM safeguards. GAP addresses limitations in existing tree-based LLM jailbreak methods by implementing an interconnected graph structure that enables knowledge sharing across attack paths. Our experimental evaluation demonstrates GAP's superiority over existing techniques, achieving a 20.8% increase in attack success rates while reducing query costs by 62.7%. GAP consistently outperforms state-of-the-art methods for attacking both open and closed LLMs, with attack success rates of >96%. Additionally, we present specialized variants like GAP-Auto for automated seed generation and GAP-VLM for multimodal attacks. GAP-generated prompts prove highly effective in improving content moderation systems, increasing true positive detection rates by 108.5% and accuracy by 183.6% when used for fine-tuning. Our implementation is available at https://github.com/dsbuddy/GAP-LLM-Safety.
The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries surpassed previous art by 33.1\%. The project is available at https://github.com/xirui-li/DrAttack.
Large language models (LLMs) have demonstrated remarkable capabilities, yet they also introduce novel security challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose the PASS framework (\underline{P}rompt J\underline{a}ilbreaking via \underline{S}emantic and \underline{S}tructural Formalization). Specifically, PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions, which enhances stealthiness and enables bypassing existing alignment defenses. The jailbreak outputs are then structured into a GraphRAG system that, by leveraging extracted relevant terms and formalized symbols as contextual input alongside the original query, strengthens subsequent attacks and facilitates more effective jailbreaks. We conducted extensive experiments on common open-source models, demonstrating the effectiveness of our attack.
Large Language Models (LLMs), used in creative writing, code generation, and translation, generate text based on input sequences but are vulnerable to jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak prompt methods use a combination of jailbreak templates followed by questions to ask to create jailbreak prompts. However, existing jailbreak prompt designs generally suffer from excessive semantic differences, resulting in an inability to resist defenses that use simple semantic metrics as thresholds. Jailbreak prompts are semantically more varied than the original questions used for queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question. We model the search for jailbreak prompts that satisfy both semantic similarity and jailbreak validity as a multi-objective optimization problem and employ a standardized set of genetic algorithms for generating eligible prompts. Compared to the baseline AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4% higher without ONION defense and 85.2% higher with ONION defense. SMJ's better performance in all three semantic meaningfulness metrics of Jailbreak Prompt, Similarity, and Outlier, also means that SMJ is resistant to defenses that use those metrics as thresholds.
Considerable research efforts have been devoted to ensuring that large language models (LLMs) align with human values and generate safe text. However, an excessive focus on sensitivity to certain topics can compromise the model's robustness in following instructions, thereby impacting its overall performance in completing tasks. Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models without considering their robustness. In this paper, we propose a benchmark that assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach. To comprehensively study text safety and output robustness, we introduce a latent jailbreak prompt dataset, each involving malicious instruction embedding. Specifically, we instruct the model to complete a regular task, such as translation, with the text to be translated containing malicious instructions. To further analyze safety and robustness, we design a hierarchical annotation framework. We present a systematic analysis of the safety and robustness of LLMs regarding the position of explicit normal instructions, word replacements (verbs in explicit normal instructions, target groups in malicious instructions, cue words for explicit normal instructions), and instruction replacements (different explicit normal instructions). Our results demonstrate that current LLMs not only prioritize certain instruction verbs but also exhibit varying jailbreak rates for different instruction verbs in explicit normal instructions. Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.
As the scale and complexity of jailbreaking attacks on large language models (LLMs) continue to escalate, their efficiency and practical applicability are constrained, posing a profound challenge to LLM security. Jailbreaking techniques have advanced from manual prompt engineering to automated methodologies. Recent advances have automated jailbreaking approaches that harness LLMs to generate jailbreak instructions and adversarial examples, delivering encouraging results. Nevertheless, these methods universally include an LLM generation phase, which, due to the complexities of deploying and reasoning with LLMs, impedes effective implementation and broader adoption. To mitigate this issue, we introduce \textbf{Adversarial Prompt Distillation}, an innovative framework that integrates masked language modeling, reinforcement learning, and dynamic temperature control to distill LLM jailbreaking prowess into smaller language models (SLMs). This methodology enables efficient, robust jailbreak attacks while maintaining high success rates and accommodating a broader range of application contexts. Empirical evaluations affirm the approach's superiority in attack efficacy, resource optimization, and cross-model versatility. Our research underscores the practicality of transferring jailbreak capabilities to SLMs, reveals inherent vulnerabilities in LLMs, and provides novel insights to advance LLM security investigations. Our code is available at: https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt.
Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models' safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed interpretable suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and interpretable solution applicable to various LLM platforms.
Large Language Models (LLMs), like ChatGPT, have demonstrated vast potential but also introduce challenges related to content constraints and potential misuse. Our study investigates three key research questions: (1) the number of different prompt types that can jailbreak LLMs, (2) the effectiveness of jailbreak prompts in circumventing LLM constraints, and (3) the resilience of ChatGPT against these jailbreak prompts. Initially, we develop a classification model to analyze the distribution of existing prompts, identifying ten distinct patterns and three categories of jailbreak prompts. Subsequently, we assess the jailbreak capability of prompts with ChatGPT versions 3.5 and 4.0, utilizing a dataset of 3,120 jailbreak questions across eight prohibited scenarios. Finally, we evaluate the resistance of ChatGPT against jailbreak prompts, finding that the prompts can consistently evade the restrictions in 40 use-case scenarios. The study underscores the importance of prompt structures in jailbreaking LLMs and discusses the challenges of robust jailbreak prompt generation and prevention.
Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA's effectiveness and efficiency.
Large language models (LLMs) have achieved widespread adoption across numerous applications. However, many LLMs are vulnerable to malicious attacks even after safety alignment. These attacks typically bypass LLMs' safety guardrails by wrapping the original malicious instructions inside adversarial jailbreaks prompts. Previous research has proposed methods such as adversarial training and prompt rephrasing to mitigate these safety vulnerabilities, but these methods often reduce the utility of LLMs or lead to significant computational overhead and online latency. In this paper, we propose SecurityLingua, an effective and efficient approach to defend LLMs against jailbreak attacks via security-oriented prompt compression. Specifically, we train a prompt compressor designed to discern the "true intention" of the input prompt, with a particular focus on detecting the malicious intentions of adversarial prompts. Then, in addition to the original prompt, the intention is passed via the system prompt to the target LLM to help it identify the true intention of the request. SecurityLingua ensures a consistent user experience by leaving the original input prompt intact while revealing the user's potentially malicious intention and stimulating the built-in safety guardrails of the LLM. Moreover, thanks to prompt compression, SecurityLingua incurs only a negligible overhead and extra token cost compared to all existing defense methods, making it an especially practical solution for LLM defense. Experimental results demonstrate that SecurityLingua can effectively defend against malicious attacks and maintain utility of the LLM with negligible compute and latency overhead. Our code is available at https://aka.ms/SecurityLingua.
Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine Learning (AML) evasion techniques. Through testing against six prominent protection systems, including Microsoft's Azure Prompt Shield and Meta's Prompt Guard, we show that both methods can be used to evade detection while maintaining adversarial utility achieving in some instances up to 100% evasion success. Furthermore, we demonstrate that adversaries can enhance Attack Success Rates (ASR) against black-box targets by leveraging word importance ranking computed by offline white-box models. Our findings reveal vulnerabilities within current LLM protection mechanisms and highlight the need for more robust guardrail systems.
Large Language Models (LLMs) are increasingly integrated into consumer and enterprise applications. Despite their capabilities, they remain susceptible to adversarial attacks such as prompt injection and jailbreaks that override alignment safeguards. This paper provides a systematic investigation of jailbreak strategies against various state-of-the-art LLMs. We categorize over 1,400 adversarial prompts, analyze their success against GPT-4, Claude 2, Mistral 7B, and Vicuna, and examine their generalizability and construction logic. We further propose layered mitigation strategies and recommend a hybrid red-teaming and sandboxing approach for robust LLM security.
Large Language Models (LLMs) have gained significant attention but also raised concerns due to the risk of misuse. Jailbreak prompts, a popular type of adversarial attack towards LLMs, have appeared and constantly evolved to breach the safety protocols of LLMs. To address this issue, LLMs are regularly updated with safety patches based on reported jailbreak prompts. However, malicious users often keep their successful jailbreak prompts private to exploit LLMs. To uncover these private jailbreak prompts, extensive analysis of large-scale conversational datasets is necessary to identify prompts that still manage to bypass the system's defenses. This task is highly challenging due to the immense volume of conversation data, diverse characteristics of jailbreak prompts, and their presence in complex multi-turn conversations. To tackle these challenges, we introduce JailbreakHunter, a visual analytics approach for identifying jailbreak prompts in large-scale human-LLM conversational datasets. We have designed a workflow with three analysis levels: group-level, conversation-level, and turn-level. Group-level analysis enables users to grasp the distribution of conversations and identify suspicious conversations using multiple criteria, such as similarity with reported jailbreak prompts in previous research and attack success rates. Conversation-level analysis facilitates the understanding of the progress of conversations and helps discover jailbreak prompts within their conversation contexts. Turn-level analysis allows users to explore the semantic similarity and token overlap between a singleturn prompt and the reported jailbreak prompts, aiding in the identification of new jailbreak strategies. The effectiveness and usability of the system were verified through multiple case studies and expert interviews.
Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.
Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.
As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to answer the question in their thoughts directly and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K samples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates. Notably, the fine-tuned model maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform safe completion, while post-hoc detection methods can only directly reject sensitive, harmful queries (e.g., self-harm). Our results show that inference-time strategies alone are insufficient, highlighting the necessity of safety training, and we find even $500$ samples can yield performance comparable to the entire dataset, suggesting a promising path for data-efficient safety alignment. The dataset is publicly available at: https://huggingface.co/datasets/ByteDance-Seed/ReSA.
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions
Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Unfortunately, jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content and raising concerns about LLM safety. Due to language models with intensive parameters often regarded as black boxes, the mechanisms of alignment and jailbreak are challenging to elucidate. In this paper, we employ weak classifiers to explain LLM safety through the intermediate hidden states. We first confirm that LLMs learn ethical concepts during pre-training rather than alignment and can identify malicious and normal inputs in the early layers. Alignment actually associates the early concepts with emotion guesses in the middle layers and then refines them to the specific reject tokens for safe generations. Jailbreak disturbs the transformation of early unethical classification into negative emotions. We conduct experiments on models from 7B to 70B across various model families to prove our conclusion. Overall, our paper indicates the intrinsical mechanism of LLM safety and how jailbreaks circumvent safety guardrails, offering a new perspective on LLM safety and reducing concerns. Our code is available at https://github.com/ydyjya/LLM-IHS-Explanation.
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These findings underscore the urgent need for more robust safety strategies in LLMs.
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into multiple categories: human-based, optimization-based, generation-based, and the recent indirect and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delays to user prompts, as well as be compatible with both open-source and closed-source LLMs. Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM as a defense instance (in detection state) to concurrently protect the target LLM instance (in normal answering state) in the normal stack and collaborate with it for checkpoint-based access control. The effectiveness of SelfDefend builds upon our observation that existing LLMs can identify harmful prompts or intentions in user queries, which we empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks. To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models. When deployed to protect GPT-3.5/4, Claude, Llama-2-7b/13b, and Mistral, these models outperform seven state-of-the-art defenses and match the performance of GPT-4-based SelfDefend, with significantly lower extra delays. Further experiments show that the tuned models are robust to adaptive jailbreaks and prompt injections.
Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries that exploit the visual modality. Existing alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP), which is built upon two key innovations. First, we introduce the Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments across five benchmarks on two representative LVLMs demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential components, jointly contributing to its overall effectiveness. The code is publicly available at https://github.com/zhangyitonggg/DAVSP.
The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment. Various techniques like supervised fine-tuning, reinforcement learning from human feedback, and red-teaming were developed for ensuring the safety alignment of LLMs. However, the robustness of these aligned LLMs is always challenged by adversarial attacks that exploit unexplored and underlying vulnerabilities of the safety alignment. In this paper, we develop a novel black-box jailbreak attack, called BitBypass, that leverages hyphen-separated bitstream camouflage for jailbreaking aligned LLMs. This represents a new direction in jailbreaking by exploiting fundamental information representation of data as continuous bits, rather than leveraging prompt engineering or adversarial manipulations. Our evaluation of five state-of-the-art LLMs, namely GPT-4o, Gemini 1.5, Claude 3.5, Llama 3.1, and Mixtral, in adversarial perspective, revealed the capabilities of BitBypass in bypassing their safety alignment and tricking them into generating harmful and unsafe content. Further, we observed that BitBypass outperforms several state-of-the-art jailbreak attacks in terms of stealthiness and attack success. Overall, these results highlights the effectiveness and efficiency of BitBypass in jailbreaking these state-of-the-art LLMs.
Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.
Text-to-image (T2I) models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreak attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models to produce sensitive content, revealing critical safety vulnerabilities. However, existing attack methods implicitly assume that the attacker knows the type of deployed defenses, which limits their effectiveness against unknown or diverse defense mechanisms. In this work, we reveal an underexplored vulnerability of T2I models to metaphor-based jailbreak attacks (MJA), which aims to attack diverse defense mechanisms without prior knowledge of their type by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module (LMAG) and an adversarial prompt optimization module (APO). LMAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, LMAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Extensive experiments on T2I models with various external and internal defense mechanisms demonstrate that MJA achieves stronger attack performance while using fewer queries, compared with six baseline methods. Additionally, we provide an in-depth vulnerability analysis suggesting that metaphor-based adversarial prompts evade safety mechanisms by inducing semantic ambiguity, while sensitive images arise from the model's probabilistic interpretation of concealed semantics.
Machine learning is advancing rapidly, with applications bringing notable benefits, such as improvements in translation and code generation. Models like ChatGPT, powered by Large Language Models (LLMs), are increasingly integrated into daily life. However, alongside these benefits, LLMs also introduce social risks. Malicious users can exploit LLMs by submitting harmful prompts, such as requesting instructions for illegal activities. To mitigate this, models often include a security mechanism that automatically rejects such harmful prompts. However, they can be bypassed through LLM jailbreaks. Current jailbreaks often require significant manual effort, high computational costs, or result in excessive model modifications that may degrade regular utility. We introduce TwinBreak, an innovative safety alignment removal method. Building on the idea that the safety mechanism operates like an embedded backdoor, TwinBreak identifies and prunes parameters responsible for this functionality. By focusing on the most relevant model layers, TwinBreak performs fine-grained analysis of parameters essential to model utility and safety. TwinBreak is the first method to analyze intermediate outputs from prompts with high structural and content similarity to isolate safety parameters. We present the TwinPrompt dataset containing 100 such twin prompts. Experiments confirm TwinBreak's effectiveness, achieving 89% to 98% success rates with minimal computational requirements across 16 LLMs from five vendors.
Large Language Models (LLMs) are known to be vulnerable to jailbreaking attacks, wherein adversaries exploit carefully engineered prompts to induce harmful or unethical responses. Such threats have raised critical concerns about the safety and reliability of LLMs in real-world deployment. While existing defense mechanisms partially mitigate such risks, subsequent advancements in adversarial techniques have enabled novel jailbreaking methods to circumvent these protections, exposing the limitations of static defense frameworks. In this work, we explore defending against evolving jailbreaking threats through the lens of context retrieval. First, we conduct a preliminary study demonstrating that even a minimal set of safety-aligned examples against a particular jailbreak can significantly enhance robustness against this attack pattern. Building on this insight, we further leverage the retrieval-augmented generation (RAG) techniques and propose Safety Context Retrieval (SCR), a scalable and robust safeguarding paradigm for LLMs against jailbreaking. Our comprehensive experiments demonstrate how SCR achieves superior defensive performance against both established and emerging jailbreaking tactics, contributing a new paradigm to LLM safety. Our code will be available upon publication.
Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to jailbreak attacks is essential for their responsible real-world deployment. Most previous work requires access to model gradients, or is based on human knowledge (prompt engineering) to complete jailbreak, and they hardly consider the interaction of images and text, resulting in inability to jailbreak in black box scenarios or poor performance. To overcome these limitations, we propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity maximization, referred to as PBI-Attack. Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM and embedding these features into a benign image as prior information. Subsequently, we enhance these features through bidirectional cross-modal interaction optimization, which iteratively optimizes the bimodal perturbations in an alternating manner through greedy search, aiming to maximize the toxicity of the generated response. The toxicity level is quantified using a well-trained evaluation model. Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods, achieving an average attack success rate of 92.5% across three open-source LVLMs and around 67.3% on three closed-source LVLMs. Disclaimer: This paper contains potentially disturbing and offensive content.
Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, this paper defines and investigates the Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect jailbreak attempts. Gradient Cuff exploits the unique properties observed in the refusal loss landscape, including functional values and its smoothness, to design an effective two-step detection strategy. Experimental results on two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can significantly improve the LLM's rejection capability for malicious jailbreak queries, while maintaining the model's performance for benign user queries by adjusting the detection threshold.
Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.
In this paper, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks. Code is available at https://github.com/Vinsonzyh/BlueSuffix.
We consider the problem of red teaming LLMs on elementary calculations and algebraic tasks to evaluate how various prompting techniques affect the quality of outputs. We present a framework to procedurally generate numerical questions and puzzles, and compare the results with and without the application of several red teaming techniques. Our findings suggest that even though structured reasoning and providing worked-out examples slow down the deterioration of the quality of answers, the gpt-3.5-turbo and gpt-4 models are not well suited for elementary calculations and reasoning tasks, also when being red teamed.
Recently, people have suffered from LLM hallucination and have become increasingly aware of the reliability gap of LLMs in open and knowledge-intensive tasks. As a result, they have increasingly turned to search-augmented LLMs to mitigate this issue. However, LLM-driven search also becomes an attractive target for misuse. Once the returned content directly contains targeted, ready-to-use harmful instructions or takeaways for users, it becomes difficult to withdraw or undo such exposure. To investigate LLMs' unsafe search behavior issues, we first propose \textbf{\textit{SearchAttack}} for red-teaming, which (1) rephrases harmful semantics via dense and benign knowledge to evade direct in-context decoding, thus eliciting unsafe information retrieval, (2) stress-tests LLMs' reward-chasing bias by steering them to synthesize unsafe retrieved content. We also curate an emergent, domain-specific illicit activity benchmark for search-based threat assessment, and introduce a fact-checking framework to ground and quantify harm in both offline and online attack settings. Extensive experiments are conducted to red-team the search-augmented LLMs for responsible vulnerability assessment. Empirically, SearchAttack demonstrates strong effectiveness in attacking these systems. We also find that LLMs without web search can still be steered into harmful content output due to their information-seeking stereotypical behaviors.
As the industry increasingly adopts agentic AI systems, understanding their unique vulnerabilities becomes critical. Prior research suggests that security flaws at the model level do not fully capture the risks present in agentic deployments, where models interact with tools and external environments. This paper investigates this gap by conducting a comparative red teaming analysis of GPT-OSS-20B, a 20-billion parameter open-source model. Using our observability framework AgentSeer to deconstruct agentic systems into granular actions and components, we apply iterative red teaming attacks with harmful objectives from HarmBench at two distinct levels: the standalone model and the model operating within an agentic loop. Our evaluation reveals fundamental differences between model level and agentic level vulnerability profiles. Critically, we discover the existence of agentic-only vulnerabilities, attack vectors that emerge exclusively within agentic execution contexts while remaining inert against standalone models. Agentic level iterative attacks successfully compromise objectives that completely failed at the model level, with tool-calling contexts showing 24\% higher vulnerability than non-tool contexts. Conversely, certain model-specific exploits work exclusively at the model level and fail when transferred to agentic contexts, demonstrating that standalone model vulnerabilities do not always generalize to deployed systems.
Large Language Models (LLMs) can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-training of strong LLMs (e.g. OpenAI o3) refuse to help jailbreaking, our work turn (almost) any black-box LLMs into attackers. The resulting $J_2$ (jailbreaking-to-jailbreak) attackers can effectively jailbreak the safeguard of target models using various strategies, both created by themselves or from expert human red teamers. In doing so, we show their strong but under-researched jailbreaking capabilities. Our experiments demonstrate that 1) prompts used to create $J_2$ attackers transfer across almost all black-box models; 2) an $J_2$ attacker can jailbreak a copy of itself, and this vulnerability develops rapidly over the past 12 months; 3) reasong models, such as Sonnet-3.7, are strong $J_2$ attackers compared to others. For example, when used against the safeguard of GPT-4o, $J_2$ (Sonnet-3.7) achieves 0.975 attack success rate (ASR), which matches expert human red teamers and surpasses the state-of-the-art algorithm-based attacks. Among $J_2$ attackers, $J_2$ (o3) achieves highest ASR (0.605) against Sonnet-3.5, one of the most robust models.
We introduce a novel framework for consolidating multi-turn adversarial ``jailbreak'' prompts into single-turn queries, significantly reducing the manual overhead required for adversarial testing of large language models (LLMs). While multi-turn human jailbreaks have been shown to yield high attack success rates, they demand considerable human effort and time. Our multi-turn-to-single-turn (M2S) methods -- Hyphenize, Numberize, and Pythonize -- systematically reformat multi-turn dialogues into structured single-turn prompts. Despite removing iterative back-and-forth interactions, these prompts preserve and often enhance adversarial potency: in extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, M2S methods achieve attack success rates from 70.6 percent to 95.9 percent across several state-of-the-art LLMs. Remarkably, the single-turn prompts outperform the original multi-turn attacks by as much as 17.5 percentage points while cutting token usage by more than half on average. Further analysis shows that embedding malicious requests in enumerated or code-like structures exploits ``contextual blindness'', bypassing both native guardrails and external input-output filters. By converting multi-turn conversations into concise single-turn prompts, the M2S framework provides a scalable tool for large-scale red teaming and reveals critical weaknesses in contemporary LLM defenses.
Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break' another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.
Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found https://github.com/chenxshuo/RedTeamingGPT4V
Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.
With LLM usage rapidly increasing, their vulnerability to jailbreaks that create harmful outputs are a major security risk. As new jailbreaking strategies emerge and models are changed by fine-tuning, continuous testing for security vulnerabilities is necessary. Existing Red Teaming methods fall short in cost efficiency, attack success rate, attack diversity, or extensibility as new attack types emerge. We address these challenges with Modular And Diverse Malicious Attack MiXtures (MAD-MAX) for Automated LLM Red Teaming. MAD-MAX uses automatic assignment of attack strategies into relevant attack clusters, chooses the most relevant clusters for a malicious goal, and then combines strategies from the selected clusters to achieve diverse novel attacks with high attack success rates. MAD-MAX further merges promising attacks together at each iteration of Red Teaming to boost performance and introduces a similarity filter to prune out similar attacks for increased cost efficiency. The MAD-MAX approach is designed to be easily extensible with newly discovered attack strategies and outperforms the prominent Red Teaming method Tree of Attacks with Pruning (TAP) significantly in terms of Attack Success Rate (ASR) and queries needed to achieve jailbreaks. MAD-MAX jailbreaks 97% of malicious goals in our benchmarks on GPT-4o and Gemini-Pro compared to TAP with 66%. MAD-MAX does so with only 10.9 average queries to the target LLM compared to TAP with 23.3. WARNING: This paper contains contents which are offensive in nature.
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.
This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.
As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose \ourapproach, a novel automated red teaming training framework that utilizes reinforcement learning to explore and generate more effective attack prompts while balancing their diversity. Specifically, it consists of three training stages: (1) Cold Start: The red team model is supervised and fine-tuned on a jailbreak dataset obtained through imitation learning. (2) Warm-up Exploration: The model is trained in jailbreak instruction following and exploration, using diversity and consistency as reward signals. (3) Enhanced Jailbreak: Progressive jailbreak rewards are introduced to gradually enhance the jailbreak performance of the red-team model. Extensive experiments on a variety of LLMs show that \ourapproach effectively balances the diversity and effectiveness of jailbreak prompts compared to existing methods. Our work significantly improves the efficiency of red team exploration and provides a new perspective on automated red teaming.
Automatic adversarial prompt generation provides remarkable success in jailbreaking safely-aligned large language models (LLMs). Existing gradient-based attacks, while demonstrating outstanding performance in jailbreaking white-box LLMs, often generate garbled adversarial prompts with chaotic appearance. These adversarial prompts are difficult to transfer to other LLMs, hindering their performance in attacking unknown victim models. In this paper, for the first time, we delve into the semantic meaning embedded in garbled adversarial prompts and propose a novel method that "translates" them into coherent and human-readable natural language adversarial prompts. In this way, we can effectively uncover the semantic information that triggers vulnerabilities of the model and unambiguously transfer it to the victim model, without overlooking the adversarial information hidden in the garbled text, to enhance jailbreak attacks. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Experimental results demonstrate that our method significantly improves the success rate of jailbreak attacks against various safety-aligned LLMs and outperforms state-of-the-arts by large margins. With at most 10 queries, our method achieves an average attack success rate of 81.8% in attacking 7 commercial closed-source LLMs, including GPT and Claude-3 series, on HarmBench. Our method also achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks. Code at: https://github.com/qizhangli/Adversarial-Prompt-Translator.
In the realm of large vision language models (LVLMs), jailbreak attacks serve as a red-teaming approach to bypass guardrails and uncover safety implications. Existing jailbreaks predominantly focus on the visual modality, perturbing solely visual inputs in the prompt for attacks. However, they fall short when confronted with aligned models that fuse visual and textual features simultaneously for generation. To address this limitation, this paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. Initially, we adversarially embed universally harmful perturbations in an image, guided by a few-shot query-agnostic corpus (e.g., affirmative prefixes and negative inhibitions). This process ensures that image prompt LVLMs to respond positively to any harmful queries. Subsequently, leveraging the adversarial image, we optimize textual prompts with specific harmful intent. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts through a feedback-iteration manner. To validate the efficacy of our approach, we conducted extensive evaluations on various datasets and LVLMs, demonstrating that our method significantly outperforms other methods by large margins (+29.03% in attack success rate on average). Additionally, we showcase the potential of our attacks on black-box commercial LVLMs, such as Gemini and ChatGLM.
Existing work on jailbreak Multimodal Large Language Models (MLLMs) has focused primarily on adversarial examples in model inputs, with less attention to vulnerabilities, especially in model API. To fill the research gap, we carry out the following work: 1) We discover a system prompt leakage vulnerability in GPT-4V. Through carefully designed dialogue, we successfully extract the internal system prompts of GPT-4V. This finding indicates potential exploitable security risks in MLLMs; 2) Based on the acquired system prompts, we propose a novel MLLM jailbreaking attack method termed SASP (Self-Adversarial Attack via System Prompt). By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. Furthermore, in pursuit of better performance, we also add human modification based on GPT-4's analysis, which further improves the attack success rate to 98.7\%; 3) We evaluated the effect of modifying system prompts to defend against jailbreaking attacks. Results show that appropriately designed system prompts can significantly reduce jailbreak success rates. Overall, our work provides new insights into enhancing MLLM security, demonstrating the important role of system prompts in jailbreaking. This finding could be leveraged to greatly facilitate jailbreak success rates while also holding the potential for defending against jailbreaks.
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks: carefully crafted malicious inputs intended to circumvent safety guardrails and elicit harmful responses. As such, we present AutoAdv, a novel framework that automates adversarial prompt generation to systematically evaluate and expose vulnerabilities in LLM safety mechanisms. Our approach leverages a parametric attacker LLM to produce semantically disguised malicious prompts through strategic rewriting techniques, specialized system prompts, and optimized hyperparameter configurations. The primary contribution of our work is a dynamic, multi-turn attack methodology that analyzes failed jailbreak attempts and iteratively generates refined follow-up prompts, leveraging techniques such as roleplaying, misdirection, and contextual manipulation. We quantitatively evaluate attack success rate (ASR) using the StrongREJECT (arXiv:2402.10260 [cs.CL]) framework across sequential interaction turns. Through extensive empirical evaluation of state-of-the-art models--including ChatGPT, Llama, and DeepSeek--we reveal significant vulnerabilities, with our automated attacks achieving jailbreak success rates of up to 86% for harmful content generation. Our findings reveal that current safety mechanisms remain susceptible to sophisticated multi-turn attacks, emphasizing the urgent need for more robust defense strategies.
We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token "Sure"), potentially with multiple restarts. In this way, we achieve 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks.
The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a "reconstruction-then-generation" strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21\% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.
Multimodal large language models (MLLMs) have revolutionized vision-language understanding but remain vulnerable to multimodal jailbreak attacks, where adversarial inputs are meticulously crafted to elicit harmful or inappropriate responses. We propose UniGuard, a novel multimodal safety guardrail that jointly considers the unimodal and cross-modal harmful signals. UniGuard trains a multimodal guardrail to minimize the likelihood of generating harmful responses in a toxic corpus. The guardrail can be seamlessly applied to any input prompt during inference with minimal computational costs. Extensive experiments demonstrate the generalizability of UniGuard across multiple modalities, attack strategies, and multiple state-of-the-art MLLMs, including LLaVA, Gemini Pro, GPT-4o, MiniGPT-4, and InstructBLIP. Notably, this robust defense mechanism maintains the models' overall vision-language understanding capabilities.
Recently, Multimodal Large Language Models (MLLMs) have demonstrated their superior ability in understanding multimodal content. However, they remain vulnerable to jailbreak attacks, which exploit weaknesses in their safety alignment to generate harmful responses. Previous studies categorize jailbreaks as successful or failed based on whether responses contain malicious content. However, given the stochastic nature of MLLM responses, this binary classification of an input's ability to jailbreak MLLMs is inappropriate. Derived from this viewpoint, we introduce jailbreak probability to quantify the jailbreak potential of an input, which represents the likelihood that MLLMs generated a malicious response when prompted with this input. We approximate this probability through multiple queries to MLLMs. After modeling the relationship between input hidden states and their corresponding jailbreak probability using Jailbreak Probability Prediction Network (JPPN), we use continuous jailbreak probability for optimization. Specifically, we propose Jailbreak-Probability-based Attack (JPA) that optimizes adversarial perturbations on input image to maximize jailbreak probability, and further enhance it as Multimodal JPA (MJPA) by including monotonic text rephrasing. To counteract attacks, we also propose Jailbreak-Probability-based Finetuning (JPF), which minimizes jailbreak probability through MLLM parameter updates. Extensive experiments show that (1) (M)JPA yields significant improvements when attacking a wide range of models under both white and black box settings. (2) JPF vastly reduces jailbreaks by at most over 60\%. Both of the above results demonstrate the significance of introducing jailbreak probability to make nuanced distinctions among input jailbreak abilities.
As Spoken Language Models (SLMs) integrate speech and text modalities, they inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to be susceptible to jailbreaking, where adversarial prompts induce harmful responses. Yet existing attacks largely remain unimodal, optimizing either text or audio in isolation. We explore gradient-based multimodal jailbreaks by introducing JAMA (Joint Audio-text Multimodal Attack), a joint multimodal optimization framework combining Greedy Coordinate Gradient (GCG) for text and Projected Gradient Descent (PGD) for audio, to simultaneously perturb both modalities. Evaluations across four state-of-the-art SLMs and four audio types demonstrate that JAMA surpasses unimodal jailbreak rate by 1.5x to 10x. We analyze the operational dynamics of this joint attack and show that a sequential approximation method makes it 4x to 6x faster. Our findings suggest that unimodal safety is insufficient for robust SLMs. The code and data are available at https://repos.lsv.uni-saarland.de/akrishnan/multimodal-jailbreak-slm
By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.Without incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR's state-of-the-art performance in mitigating jailbreak risks without compromising utility.
Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs' cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO's effectiveness, showcasing robust performance and strong cross-model transferability. These results underscore significant vulnerabilities in current built-in safety mechanisms, emphasizing an urgent need for advanced, alignment-aware security and safety solutions in vision-language systems.
The recent breakthrough in large language models (LLMs) such as ChatGPT has revolutionized production processes at an unprecedented pace. Alongside this progress also comes mounting concerns about LLMs' susceptibility to jailbreaking attacks, which leads to the generation of harmful or unsafe content. While safety alignment measures have been implemented in LLMs to mitigate existing jailbreak attempts and force them to become increasingly complicated, it is still far from perfect. In this paper, we analyze the common pattern of the current safety alignment and show that it is possible to exploit such patterns for jailbreaking attacks by simultaneous obfuscation in queries and responses. Specifically, we propose WordGame attack, which replaces malicious words with word games to break down the adversarial intent of a query and encourage benign content regarding the games to precede the anticipated harmful content in the response, creating a context that is hardly covered by any corpus used for safety alignment. Extensive experiments demonstrate that WordGame attack can break the guardrails of the current leading proprietary and open-source LLMs, including the latest Claude-3, GPT-4, and Llama-3 models. Further ablation studies on such simultaneous obfuscation in query and response provide evidence of the merits of the attack strategy beyond an individual attack.
In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user's original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user's prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.
合并后,文献主要覆盖八类并列方向:①越狱的系统性表征/分类与评测基准;②LRM推理链与自主代理式越狱;③文本推理型越狱的构造与可迁移生成;④自动化红队/攻击搜索与优化;⑤多模态与检索增强/系统管线(knowledge-to-action)带来的额外攻击面;⑥隐身规避型(混淆、分解重建、双层混淆、编码加密触发)攻击;⑦检测与防御(检测框架、推理占位/对抗情景外推/安全检索、提示级固化与通用守护);⑧机制级干预(激活操控、参数/对齐机制移除)与实验流程方法学。整体体现LRMs越狱研究从“提示层模板”走向“推理链/管线/机制级”并同步强化“可搜索发现 + 可复现评测 + 可部署防御”的闭环。