AIGC 率文本检测与对抗
新型检测架构与多维统计/语言学特征挖掘
该组研究侧重于开发高效的检测模型架构(如混合专家网络、Transformer 变体、轻量化模型)以及挖掘 AI 生成文本在统计学(如困惑度、负对数似然、概率曲率)和语言学(如词频分布、句法结构、篇章连贯性)上的细微特征。
- MAFD: Multiple Adversarial Features Detector for Enhanced Detection of Text-Based Adversarial Examples(Kaiwen Jin, Yifeng Xiong, Shuya Lou, Zhen Yu, 2024, Neural Processing Letters)
- CLULab-UofA at SemEval-2024 Task 8: Detecting Machine-Generated Text Using Triplet-Loss-Trained Text Similarity and Text Classification(Mohammadhossein Rezaei, Yeaeun Kwon, Reza Sanayei, Abhyuday Singh, Steven Bethard, 2024, No journal)
- Diversity Boosts AI-Generated Text Detection(Advik Raj Basani, Pin-Yu Chen, 2025, ArXiv)
- Robust detection of LLM-generated text through transfer learning with pre-trained Distilled BERT model(Jayaprakash Sundararaj, Durgaraman Maruthavanan, Deepak Jayabalan, Ashok Gadi Parthi, Balakrishna Pothineni, Vidyasagar Parlapalli, 2024, European Journal of Computer Science and Information Technology)
- MLSDET: Multi-LLM Statistical Deep Ensemble for Chinese AI-Generated Text Detection(Dianhui Mao, Denghui Zhang, Ao Zhang, Zhihua Zhao, 2025, ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- Enhancing AI-Generated Text Identification with BERT-CNN and DistilBERT-BiLSTM Models(Rajsekhar Das, Ricky Dey, Sorbojit Mondal, Nabanita Das, Bikash Sadhukhan, 2025, 2025 International Conference on Computing, Intelligence, and Application (CIACON))
- Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification(N. Kholodna, V. Vysotska, O. Markiv, Sofiia Chyrun, 2022, No journal)
- Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective(Chenwang Wu, Yiu-ming Cheung, Bo Han, Defu Lian, 2025, ArXiv)
- USTC-BUPT at SemEval-2024 Task 8: Enhancing Machine-Generated Text Detection via Domain Adversarial Neural Networks and LLM Embeddings(Zikang Guo, Kaijie Jiao, Xingyu Yao, Yuning Wan, Haoran Li, Benfeng Xu, L. Zhang, Quan Wang, Yongdong Zhang, Zhendong Mao, 2024, No journal)
- Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature(Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, Yue Zhang, 2023, ArXiv)
- AI-Generated Text is Non-Stationary: Detection via Temporal Tomography(Alva West, Yixuan Weng, Minjun Zhu, Luodan Zhang, Zhen Lin, Guangsheng Bao, Yue Zhang, 2025, ArXiv)
- MAF-Detect: A Multi-Scale Adaptive Fusion Framework for Zero-Shot Detection of LLM-Generated Text(Zhenhan Bai, Yan Zheng, 2025, 2025 7th International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI))
- IPAD: Inverse Prompt for AI Detection - A Robust and Explainable LLM-Generated Text Detector(Zheng Chen, Yushi Feng, Changyang He, Yue Deng, Hongxi Pu, Bo Li, 2025, ArXiv)
- Threads of Subtlety: Detecting Machine-Generated Texts Through Discourse Motifs(Zae Myung Kim, K. H. Lee, P. Zhu, Vipul Raheja, Dongyeop Kang, 2024, No journal)
- Unveiling Large Language Models Generated Texts: A Multi-Level Fine-Grained Detection Framework(Zhen Tao, Zhiyu Li, Runyu Chen, Dinghao Xi, Wei Xu, 2024, ArXiv)
- DeTinyLLM: Efficient detection of machine-generated text via compact paraphrase transformation(Shilei Tan, Yongcheng Zhou, Haoxiang Liu, Xuesong Wang, Si Chen, Wei Gong, 2026, Inf. Fusion)
- Mixture of Detectors: A Compact View of Machine-Generated Text Detection(Sai Teja Lekkala, Annepaka Yadagiri, Arun Kumar Challa, Samatha Reddy Machireddy, Partha Pakray, Chukhu Chunka, 2025, ArXiv)
- Discourse Features Enhance Detection of Document-Level Machine-Generated Content(Yupei Li, M. Milling, Lucia Specia, Bjorn W. Schuller, 2024, 2025 International Joint Conference on Neural Networks (IJCNN))
跨领域泛化与分布外(OOD)鲁棒性检测
关注检测器在面对未见过的生成模型(如从 GPT-3 到 GPT-4)、未知语义领域(医疗、法律)或跨语言场景时的性能退化问题,通过特征解耦、对比学习、领域知识蒸馏等技术提升模型的泛化能力。
- Are AI-Generated Text Detectors Robust to Adversarial Perturbations?(Guanhua Huang, Yuchen Zhang, Zhe Li, Yongjian You, Mingze Wang, Zhouwang Yang, 2024, ArXiv)
- Text Meets Topology: Rethinking Out-of-distribution Detection in Text-Rich Networks(Danny Wang, Ruihong Qiu, Guangdong Bai, Zi Huang, 2025, ArXiv)
- Enhancing Domain Generalization for Robust Machine-Generated Text Detection(Sungwon Park, Sungwon Han, Meeyoung Cha, 2025, IEEE Transactions on Knowledge and Data Engineering)
- mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection(D. Macko, 2025, ArXiv)
- Increasing the Robustness of the Fine-tuned Multilingual Machine-Generated Text Detectors(D. Macko, Róbert Móro, Ivan Srba, 2025, ArXiv)
- Learning Representations through Contrastive Strategies for a more Robust Stance Detection(Udhaya Kumar Rajendran, Amir Ben Khalifa, Amine Trabelsi, 2023, 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA))
- DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains(Zhihui Chen, Kai He, Yucheng Huang, Yunxiao Zhu, Mengling Feng, 2025, No journal)
- G3Detector: General GPT-Generated Text Detector(Haolan Zhan, Xuanli He, Qiongkai Xu, Yuxiang Wu, Pontus Stenetorp, 2023, ArXiv)
- EAGLE: A Domain Generalization Framework for AI-generated Text Detection(Amrita Bhattacharjee, Raha Moraffah, Joshua Garland, Huan Liu, 2024, ArXiv)
- RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns(Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao, Derek F. Wong, 2025, Trans. Assoc. Comput. Linguistics)
- Robust AI-Generated Text Detection by Restricted Embeddings(Kristian Kuznetsov, Eduard Tulchinskii, Laida Kushnareva, German Magai, Serguei Barannikov, Sergey Nikolenko, Irina Piontkovskaya, 2024, ArXiv)
- Authorship Obfuscation in Multilingual Machine-Generated Text Detection(D. Macko, Róbert Móro, Adaku Uchendu, Ivan Srba, Jason Samuel Lucas, Michiharu Yamashita, Nafis Irtiza Tripto, Dongwon Lee, Jakub Simko, Maria Bielikova, 2024, ArXiv)
对抗性攻击与文本“去指纹化”规避技术
从攻击者视角研究如何通过递归改写、提示词工程(Self-Disguise)、字符级扰动、风格迁移或利用物理格式(PDF)漏洞来消除 AI 文本的统计痕迹,使文本在保持语义的同时规避检测。
- Leveraging Multi-Model Linguistic Fusion for Enhanced AI Text Generation Evasion(Afsar Khan, Sailesh Rajagopalan, 2025, 2025 2nd International Generative AI and Computational Language Modelling Conference (GACLM))
- Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors(Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shutao Xia, Yaowei Wang, Min Zhang, 2025, No journal)
- Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors(Andrea Pedrotti, Michele Papucci, Cristiano Ciaccio, Alessio Miaschi, Giovanni Puccetti, F. Dell’Orletta, Andrea Esuli, 2025, ArXiv)
- Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion(Yinghan Zhou, Juan Wen, Wanli Peng, Zhengxian Wu, Ziwei Zhang, Yiming Xue, 2025, ArXiv)
- AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs(Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, Yuandong Tian, 2024, ArXiv)
- Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack(Ying Zhou, Ben He, Le Sun, 2024, No journal)
- GPT-4 Attempting to Attack AI-Text Detectors(Alshehri Nojoud, Li Yuhao, 2024, No journal)
- GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors(Wenlong Meng, Shuguo Fan, Chengkun Wei, Min Chen, Yuwei Li, Yuanchao Zhang, Zhikun Zhang, Wenzhi Chen, 2025, No journal)
- MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization(Yongtong Gu, Songze Li, X. Hu, 2026, ArXiv)
- Sentences Based Adversarial Attack on AI-Generated Text Detectors(Rongxin Tu, Xiangui Kang, C. Tan, Chi-Hung Chi, Kwok-Yan Lam, 2026, IEEE Transactions on Big Data)
- Complete Evasion, Zero Modification: PDF Attacks on AI Text Detection(Aldan Creo, 2025, ArXiv)
- Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings(Ahmed K. Kadhim, Lei Jiao, R. Shafik, Ole-Christoffer Granmo, 2025, ArXiv)
- Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text(Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, S. Feizi, 2025, ArXiv)
- RAFT: Realistic Attacks to Fool Text Detectors(James Wang, Ran Li, Junfeng Yang, Chengzhi Mao, 2024, ArXiv)
- Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors(Tianchun Wang, Yuanzhou Chen, Zichuan Liu, Zhanwen Chen, Haifeng Chen, Xiang Zhang, Wei Cheng, 2024, ArXiv)
- Enhancing the Undetectability of AI-Generated Text: A Semantics-Preserving Evasion Technique(Bei-Bei Luo, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection(Xinlin Peng, Ying Zhou, Ben He, Le Sun, Yingfei Sun, 2024, ArXiv)
- RedHerring Attack: Testing the Reliability of Attack Detection(Jonathan Rusert, 2025, No journal)
- Interpretable Adversarial Perturbation in Input Embedding Space for Text(Motoki Sato, Jun Suzuki, Hiroyuki Shindo, Yuji Matsumoto, 2018, ArXiv)
对抗博弈防御与鲁棒性增强机制
研究如何通过对抗训练、联合博弈(如检测器与改写器的相互演进)、语义不变特征提取及改写还原技术,增强检测器在面对恶意扰动和复杂改写时的稳健性。
- Enhancing the robustness of Fast-DetectGPT against paraphrase attacks(Suning Li, 2024, 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT))
- OUTFOX: LLM-generated Essay Detection through In-context Learning with Adversarially Generated Examples(Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki, 2023, No journal)
- RADAR: Robust AI-Text Detection via Adversarial Learning(Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho, 2023, ArXiv)
- Robust Detection of Paraphrased AI-Generated Text Using Deep Recurrent Neural Networks(M. Shirpurkar, 2026, International Journal for Research in Applied Science and Engineering Technology)
- GravText: A Robust Framework for Detecting LLM-Generated Text Using Triplet Contrastive Learning with Gravitational Factor(Youling Feng, Haoyu Wang, Jun Li, Zhongwei Cao, Linghao Yan, 2025, Syst.)
- Adversarial Training for Robust Natural Language Processing: A Focus on Sentiment Analysis and Machine Translation(Dr. B. Gomathy, Dr. T. Jayachitra, Dr. R. Rajkumar, Ms. V. Lalithamani, G. P. Ghantasala, Mr. I. Anantraj, Dr. C. Shyamala, G. Rajkumar, S. Saranya, 2024, Communications on Applied Nonlinear Analysis)
- Adversarial Robustness in Natural Language Processing: An Empirical Analysis of Machine Learning Model Vulnerabilities to Adversarial Attacks(Asheshemi Nelson Oghenekevwe, Okoro Akpohrobaro Daniel, Obode Aghogho Micheal, 2025, International Journal of Research and Innovation in Applied Science)
- Improving Gradient-based Adversarial Training for Text Classification by Contrastive Learning and Auto-Encoder(Yao Qiu, Jinchao Zhang, Jie Zhou, 2021, ArXiv)
- Defending mutation-based adversarial text perturbation: a black-box approach(Demetrio Deanda, I. Alsmadi, Jesus Guerrero, Gongbo Liang, 2025, Cluster Computing)
- PRDetect: Perturbation-Robust LLM-generated Text Detection Based on Syntax Tree(Xiang Li, Zhiyi Yin, Hexiang Tan, Shaoling Jing, Du Su, Yi Cheng, Huawei Shen, Fei Sun, 2025, No journal)
- Robustness of generative AI detection: adversarial attacks on black-box neural text detectors(Vitalii Fishchuk, Daniel Braun, 2024, International Journal of Speech Technology)
- Detecting the Invisible: Adversarial Strategies for AI-Generated Text in the LLM Era(Kent Alber Fredson, Yithro Paulus Tjendra, Leander Farrell Suryadi, Puti Andam Suri, 2025, 2025 8th International Conference on Information and Communications Technology (ICOIACT))
- Enhancing Neural Text Detector Robustness with μAttacking and RR-Training(G. Liang, Jesus Guerrero, Fengbo Zheng, I. Alsmadi, 2023, Electronics)
- Contrastive Triplet Learning for Robust Detection of AI-generated Content(Shraddha Vaidya, Jatinderkumar R. Saini, 2025, 2025 5th Asian Conference on Innovation in Technology (ASIANCON))
- Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training(Yuanfang Li, Zhaohan Zhang, Chengzhengxu Li, Chao Shen, Xiaoming Liu, 2025, No journal)
- Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations(Sai Teja Lekkala, Annepaka Yadagiri, Sangam Sai Anish, Siva Gopala Krishna Nuthakki, Partha Pakray, 2025, 2026 20th International Conference on Ubiquitous Information Management and Communication (IMCOM))
- Detecting Word-Level Adversarial Text Attacks via SHapley Additive exPlanations(E. Mosca, Lukas Huber, Marc Alexander Kühn, G. Groh, 2022, No journal)
- DAMAGE: Detecting Adversarially Modified AI Generated Text(Elyas Masrour, Bradley Emi, Max Spero, 2025, ArXiv)
- Mitigating Paraphrase Attacks on Machine-Text Detection via Paraphrase Inversion(Rafael A. Rivera Soto, Barry Chen, Nicholas Andrews, 2024, No journal)
- Learning from Mistakes: Self-correct Adversarial Training for Chinese Unnatural Text Correction(Xuan Feng, Tianlong Gu, Xiaoli Liu, Liang Chang, 2024, ArXiv)
- Exploration of Contrastive Learning Strategies toward more Robust Stance Detection(Udhaya Kumar Rajendran, Amine Trabelsi, 2023, No journal)
主动防御:文本水印技术及其安全性
探讨在 LLM 生成阶段嵌入隐藏信号(水印)的主动防御手段,包括语义水印、频率水印及低熵增强技术,并研究针对水印的“颜色感知”攻击及水印的抗改写鲁棒性。
- An Entropy-based Text Watermarking Detection Method(Yijian Lu, Aiwei Liu, Dianzhi Yu, Jingjing Li, Irwin King, 2024, No journal)
- A Linguistics-Aware LLM Watermarking via Syntactic Predictability(Shinwoo Park, Hyejin Park, Hyeseon Ahn, Yo-Sub Han, 2025, ArXiv)
- DeepTextMark: Deep Learning based Text Watermarking for Detection of Large Language Model Generated Text(Travis J. E. Munyer, Xin Zhong, 2023, ArXiv)
- Bypassing LLM Watermarks with Color-Aware Substitutions(Qilong Wu, Varun Chandrasekaran, 2024, No journal)
- Provable Robust Watermarking for AI-Generated Text(Xuandong Zhao, P. Ananth, Lei Li, Yu-Xiang Wang, 2023, ArXiv)
- Toward Evasion-Resistant LLM Attribution with Multi-Scale Watermarking and Cryptographic Verification(Pieter Janssen, E. Conti, 2026, Frontiers in Artificial Intelligence Research)
- Post-Hoc Watermarking for Robust Detection in Text Generated by Large Language Models(Jifei Hao, Jipeng Qiang, Yi Zhu, Yun Li, Yunhao Yuan, Xiaoye Ouyang, 2025, No journal)
- CurveMark: Detecting AI-Generated Text via Probabilistic Curvature and Dynamic Semantic Watermarking(Yuhan Zhang, Xingxiang Jiang, Hua Sun, Yao Zhang, Deyu Tong, 2025, Entropy)
- Character-Level Perturbations Disrupt LLM Watermarks(Zhaoxi Zhang, Xiaomei Zhang, Yanjun Zhang, He Zhang, Shirui Pan, Bo Liu, Asif Gill, L. Zhang, 2025, ArXiv)
- Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks(Yixin Cheng, Hongcheng Guo, Yangming Li, Leonid Sigal, 2025, ArXiv)
- Adaptive Robust Watermarking for Large Language Models via Dynamic Token Embedding Perturbation(Ziyang Zeng, Han Lin, Shuxin Zhang, Boyuan Wang, 2026, IEEE Access)
- k-SemStamp: A Clustering-Based Semantic Watermark for Detection of Machine-Generated Text(A. Hou, Jingyu (Jack) Zhang, Yichen Wang, Daniel Khashabi, Tianxing He, 2024, ArXiv)
- FreqMark: Frequency-Based Watermark for Sentence-Level Detection of LLM-Generated Text(Zhenyu Xu, Kun Zhang, Victor S. Sheng, 2024, ArXiv)
系统性评测基准、理论极限与垂直场景应用
提供大规模评测基准(如 PADBen, SHIELD),分析检测的理论极限(如 KL 散度、成员推理攻击),并探索在学术诚信、钓鱼邮件检测、网络安全报告及人机协作文本等特定垂直场景下的应用实效。
- Telescope: Discovering Multilingual LLM Generated Texts with Small Specialized Language Models(Héctor Cerezo-Costas, Pedro Alonso Doval, Maximiliano Hormazábal-Lagos, Aldan Creo, 2024, No journal)
- Can AI-Generated Text be Reliably Detected?(Vinu Sankar Sadasivan, Aounon Kumar, S. Balasubramanian, Wenxiao Wang, S. Feizi, 2023, ArXiv)
- When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection(Lang Gao, Xuhui Li, Chenxi Wang, Mingzhe Li, Wei Liu, Zirui Song, Jinghui Zhang, Rui Yan, Preslav Nakov, Xiuying Chen, 2025, ArXiv)
- Machine Text Detectors are Membership Inference Attacks(Ryuto Koike, Liam Dugan, Masahiro Kaneko, Christopher Callison-Burch, Naoaki Okazaki, 2025, ArXiv)
- PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks(Yiwei Zha, Rui Min, Shanu Sushmita, 2025, ArXiv)
- Beyond Easy Wins: A Text Hardness-Aware Benchmark for LLM-generated Text Detection(Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee, 2025, ArXiv)
- Evaluating the Reliability of Generative AI in Distinguishing Machine from Human Text(Y. Yuhefizar, Ronal Watrianthos, Dony Marzuki, 2025, Data and Metadata)
- A Practical Examination of AI-Generated Text Detectors for Large Language Models(Brian Tufts, Xuandong Zhao, Lei Li, 2024, No journal)
- Detecting AI-Generated Text in Student Submissions Using Multi-Modal Classification(H. Yamashita, Lukas Meier, 2025, International Journal of Innovative Science and Research Technology)
- LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?(Qihui Zhang, Chujie Gao, Dongping Chen, Yue Huang, Yixin Huang, Zhenyang Sun, Shilin Zhang, Weiye Li, Zhengyan Fu, Yao Wan, Lichao Sun, 2024, No journal)
- Exposing AI-Generated Threat Reports through Semantic and Adversarial Detection Models(Manvi Breja, Namita Dahiya, 2025, 2025 2nd International Conference on Intelligent Systems for Cybersecurity (ISCS))
- Can AI-Generated Text be Reliably Detected? Stress Testing AI Text Detectors Under Various Attacks(Vinu Sankar Sadasivan, Aounon Kumar, S. Balasubramanian, Wenxiao Wang, S. Feizi, 2025, Trans. Mach. Learn. Res.)
- Explainable Transformer-Based Email Phishing Classification with Adversarial Robustness(P. SajadU, 2025, ArXiv)
- How You Prompt Matters! Even Task-Oriented Constraints in Instructions Affect LLM-Generated Text Detection(Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki, 2023, No journal)
- MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector(Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, Tao Jiang, 2024, ArXiv)
- SAVANA- A Robust Framework for Deepfake Video Detection and Hybrid Double Paraphrasing with Probabilistic Analysis Approach for AI Text Detection(Dr. Viomesh Kumar Singh, Bhavesh Agone, Aryan More, Aryan Mengawade, Atharva Deshmukh, Atharva Badgujar, 2024, International Journal for Research in Applied Science and Engineering Technology)
- CNLP-NITS-PP at GenAI Detection Task 3: Cross-Domain Machine-Generated Text Detection Using DistilBERT Techniques(Sai Teja Lekkala, Annepaka Yadagiri, Mangadoddi Srikar Vardhan, Partha Pakray, 2025, No journal)
- Random at GenAI Detection Task 3: A Hybrid Approach to Cross-Domain Detection of Machine-Generated Text with Adversarial Attack Mitigation(Shifali Agrahari, Prabhat Mishra, Sujit Kumar, 2025, No journal)
- Pangram at GenAI Detection Task 3: An Active Learning Approach to Machine-Generated Text Detection(Bradley Emi, Max Spero, Elyas Masrour, 2025, No journal)
- LuxVeri at GenAI Detection Task 3: Cross-Domain Detection of AI-Generated Text Using Inverse Perplexity-Weighted Ensemble of Fine-Tuned Transformer Models(Md Kamrujjaman Mobin, Md Saiful Islam, 2025, ArXiv)
- BBN-U.Oregon's ALERT system at GenAI Content Detection Task 3: Robust Authorship Style Representations for Cross-Domain Machine-Generated Text Detection(Hemanth Kandula, Chak-Fai Li, Haoling Qiu, Damianos G. Karakos, Hieu Man, T. Nguyen, Brian Ulicny, 2025, No journal)
- Investigating generative AI models and detection techniques: impacts of tokenization and dataset size on identification of AI-generated text(Haowei Hua, C. Yao, 2024, Frontiers in Artificial Intelligence)
- Multi-Class Detection of Humanized AI Text Using Machine Learning and Transformer Models(Batyr Sharimbayev, A. Kazin, Shirali Kadyrov, 2025, 2025 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE))
合并后的分组全面覆盖了 AIGC 文本检测与对抗的完整技术生命周期。研究已从早期的简单统计二分类,演进为深度挖掘语言学特征与利用对抗博弈提升鲁棒性的复杂体系。目前,该领域呈现出三大趋势:一是“攻防演进”,攻击手段从简单改写转向复杂的“去指纹化”规避,而防御则引入了对抗训练与主动水印技术;二是“泛化与鲁棒”,研究重点转向解决跨模型、跨领域及分布外数据的失效问题;三是“实战化与标准化”,通过建立大规模行业基准和参加国际竞赛,推动技术在学术诚信、网络安全等真实场景中的落地应用。
总计114篇相关文献
In recent years, text generation tools utilizing Artificial Intelligence (AI) have occasionally been misused across various domains, such as generating student reports or creative writings. This issue prompts plagiarism detection services to enhance their capabilities in identifying AI-generated content. Adversarial attacks are often used to test the robustness of AI-text generated detectors. This work proposes a novel textual adversarial attack on the detection models such as Fast-DetectGPT. The method employs embedding models for data perturbation, aiming at reconstructing the AI generated texts to reduce the likelihood of detection of the true origin of the texts. Specifically, we employ different embedding techniques, including the Tsetlin Machine (TM), an interpretable approach in machine learning for this purpose. By combining synonyms and embedding similarity vectors, we demonstrates the state-of-the-art reduction in detection scores against Fast-DetectGPT. Particularly, in the XSum dataset, the detection score decreased from 0.4431 to 0.2744 AUROC, and in the SQuAD dataset, it dropped from 0.5068 to 0.3532 AUROC.
Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. While prior detectors often rely on token-level likelihoods or opaque black-box classifiers, these approaches struggle against high-quality generations and offer little interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to 18.7% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.
With the advancement in capabilities of Large Language Models (LLMs), one major step in the responsible and safe use of such LLMs is to be able to detect text generated by these models. While supervised AI-generated text detectors perform well on text generated by older LLMs, with the frequent release of new LLMs, building supervised detectors for identifying text from such new models would require new labeled training data, which is infeasible in practice. In this work, we tackle this problem and propose a domain generalization framework for the detection of AI-generated text from unseen target generators. Our proposed framework, EAGLE, leverages the labeled data that is available so far from older language models and learns features invariant across these generators, in order to detect text generated by an unknown target generator. EAGLE learns such domain-invariant features by combining the representational power of self-supervised contrastive learning with domain adversarial training. Through our experiments we demonstrate how EAGLE effectively achieves impressive performance in detecting text generated by unseen target generators, including recent state-of-the-art ones such as GPT-4 and Claude, reaching detection scores of within 4.7% of a fully supervised detector.
AI Generated Text (AIGT) detectors are developed with texts from humans and LLMs of common tasks. Despite the diversity of plausible prompt choices, these datasets are generally constructed with a limited number of prompts. The lack of prompt variation can introduce prompt-specific shortcut features that exist in data collected with the chosen prompt, but do not generalize to others. In this paper, we analyze the impact of such shortcuts in AIGT detection. We propose Feedback-based Adversarial Instruction List Optimization (FAILOpt), an attack that searches for instructions deceptive to AIGT detectors exploiting prompt-specific shortcuts. FAILOpt effectively drops the detection performance of the target detector, comparable to other attacks based on adversarial in-context examples. We also utilize our method to enhance the robustness of the detector by mitigating the shortcuts. Based on the findings, we further train the classifier with the dataset augmented by FAILOpt prompt. The augmented classifier exhibits improvements across generation models, tasks, and attacks. Our code will be available at https://github.com/zxcvvxcz/FAILOpt.
With the development of large language models (LLMs), detecting whether text is generated by a machine becomes increasingly challenging in the face of malicious use cases like the spread of false information, protection of intellectual property, and prevention of academic plagiarism. While well-trained text detectors have demonstrated promising performance on unseen test data, recent research suggests that these detectors have vulnerabilities when dealing with adversarial attacks, such as paraphrasing. In this paper, we propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection. We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model’s robustness against such attacks. The empirical results reveal that the current detection model can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content. Furthermore, we explore the prospect of improving the model’s robustness over iterative adversarial learning. Although some improvements in model robustness are observed, practical applications still face significant challenges. These findings shed light on the future development of AI-text detectors, emphasizing the need for more accurate and robust detection methods.
The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.
The growth of highly advanced Large Language Models (LLMs) constitutes a huge dual-use problem, making it necessary to create dependable AI-generated text detection systems. Modern detectors are notoriously vulnerable to adversarial attacks, with paraphrasing standing out as an effective evasion technique that foils statistical detection. This paper presents a comparative study of adversarial robustness, first by quantifying the limitations of standard adversarial training and then by introducing a novel, significantly more resilient detection framework: Perturbation-Invariant Feature Engineering (PIFE), a framework that enhances detection by first transforming input text into a standardized form using a multi-stage normalization pipeline, it then quantifies the transformation's magnitude using metrics like Levenshtein distance and semantic similarity, feeding these signals directly to the classifier. We evaluate both a conventionally hardened Transformer and our PIFE-augmented model against a hierarchical taxonomy of character-, word, and sentence-level attacks. Our findings first confirm that conventional adversarial training, while resilient to syntactic noise, fails against semantic attacks, an effect we term”semantic evasion threshold“, where its True Positive Rate at a strict 1 % False Positive Rate plummets to 48.8 %. In stark contrast, our PIFE model, which explicitly engineers features from the discrepancy between a text and its canonical form, overcomes this limitation. It maintains a remarkable 82.6 % TPR under the same conditions, effectively neutralizing the most sophisticated semantic attacks. This superior performance demonstrates that explicitly modeling perturbation artifacts, rather than merely training on them, is a more promising path toward achieving genuine robustness in the adversarial arms race.
This paper presents our approach for Task 3 of the GenAI content detection workshop at COLING-2025, focusing on Cross-Domain Machine-Generated Text (MGT) Detection. We propose an ensemble of fine-tuned transformer models, enhanced by inverse perplexity weighting, to improve classification accuracy across diverse text domains. For Subtask A (Non-Adversarial MGT Detection), we combined a fine-tuned RoBERTa-base model with an OpenAI detector-integrated RoBERTa-base model, achieving an aggregate TPR score of 0.826, ranking 10th out of 23 detectors. In Subtask B (Adversarial MGT Detection), our fine-tuned RoBERTa-base model achieved a TPR score of 0.801, securing 8th out of 22 detectors. Our results demonstrate the effectiveness of inverse perplexity-based weighting for enhancing generalization and performance in both non-adversarial and adversarial MGT detection, highlighting the potential for transformer models in cross-domain AI-generated content detection.
The widespread use of large language models (LLMs) has sparked concerns about the potential misuse of AI-generated text, as these models can produce content that closely resembles human-generated text. Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations, with even minor changes in characters or words causing a reversal in distinguishing between human-created and AI-generated text. This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN). The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations. We also propose a siamese calibration technique to train the model to make equally confidence predictions under different noise, which improves the model's robustness against adversarial perturbations. Experiments on four publicly available datasets show that the SCRN outperforms all baseline methods, achieving 6.5\%-18.25\% absolute accuracy improvement over the best baseline method under adversarial attacks. Moreover, it exhibits superior generalizability in cross-domain, cross-genre, and mixed-source scenarios. The code is available at \url{https://github.com/CarlanLark/Robust-AIGC-Detector}.
The field of AI-generated text detection has evolved from supervised classification to zero-shot statistical analysis. However, current approaches share a fundamental limitation: they aggregate token-level measurements into scalar scores, discarding positional information about where anomalies occur. Our empirical analysis reveals that AI-generated text exhibits significant non-stationarity, statistical properties vary by 73.8\% more between text segments compared to human writing. This discovery explains why existing detectors fail against localized adversarial perturbations that exploit this overlooked characteristic. We introduce Temporal Discrepancy Tomography (TDT), a novel detection paradigm that preserves positional information by reformulating detection as a signal processing task. TDT treats token-level discrepancies as a time-series signal and applies Continuous Wavelet Transform to generate a two-dimensional time-scale representation, capturing both the location and linguistic scale of statistical anomalies. On the RAID benchmark, TDT achieves 0.855 AUROC (7.1\% improvement over the best baseline). More importantly, TDT demonstrates robust performance on adversarial tasks, with 14.1\% AUROC improvement on HART Level 2 paraphrasing attacks. Despite its sophisticated analysis, TDT maintains practical efficiency with only 13\% computational overhead. Our work establishes non-stationarity as a fundamental characteristic of AI-generated text and demonstrates that preserving temporal dynamics is essential for robust detection.
As LLM-generated text becomes increasingly human-like, detecting it, especially when paraphrased, becomes more challenging. This paper enhances the AI-Catcher model by introducing adversarial training using the DAIGT v4 dataset, with a key focus on adding perturbed samples during training and new linguistic and statistical features. The goal is to make the model more aware of paraphrastic variations which often help LLM-generated content evade detection. This paper approach improves the model's robustness by exposing it to both human- and LLM-generated paraphrases, enabling better generalization and higher accuracy, especially in adversarial settings. Our experiments further validate the effectiveness of this enhancement, with the enhanced model outperforming the baseline. Specifically, the inclusion of new features led to a 0.6% increase in F1-score compared to the previous study, followed by an additional 0.8% gain after applying adversarial training.
CurveMark: Detecting AI-Generated Text via Probabilistic Curvature and Dynamic Semantic Watermarking
Large language models (LLMs) pose significant challenges to content authentication, as their sophisticated generation capabilities make distinguishing AI-produced text from human writing increasingly difficult. Current detection methods suffer from limited information capture, poor rate–distortion trade-offs, and vulnerability to adversarial perturbations. We present CurveMark, a novel dual-channel detection framework that combines probability curvature analysis with dynamic semantic watermarking, grounded in information-theoretic principles to maximize mutual information between text sources and observable features. To address the limitation of requiring prior knowledge of source models, we incorporate a Bayesian multi-hypothesis detection framework for statistical inference without prior assumptions. Our approach embeds imperceptible watermarks during generation via entropy-aware, semantically informed token selection and extracts complementary features from probability curvature patterns and watermark-specific metrics. Evaluation across multiple datasets and LLM architectures demonstrates 95.4% detection accuracy with minimal quality degradation (perplexity increase < 1.3), achieving 85–89% channel capacity utilization and robust performance under adversarial perturbations (72–94% information retention).
While demonstrating its powerful text generation capabilities, large language models have also raised concerns about malicious behaviors such as academic fraud and the dissemination of fake news. Consequently, accurately identifying and detecting AI-generated text has become increasingly important. However, current research reveals that detectors targeting AI-generated text still have significant vulnerabilities. Although existing evasion methods have achieved some success in reducing detection rates, they lack effective control over semantic preservation. To address this issue, we propose a novel approach. First, we introduce a comprehensive replacement of continuous multi-word units to generate natural adversarial samples while ensuring semantic consistency. Second, we design an innovative method based on hybrid semantic primitives and synonyms to expand the candidate set and select the most semantically similar words for replacement using semantic similarity measures. Finally, we evaluate the effectiveness of our method on three of the most popular existing detectors. Experimental results show that our approach achieves the highest attack success rate while significantly lowering detection rates to below 50%, maintaining high semantic similarity and low grammatical error rates. Through these innovations, our method demonstrates remarkable advantages in generating natural and hard-to-detect adversarial texts, providing new insights and directions for future research.
The proliferation of large language models has raised growing concerns about their misuse, particularly in cases where AI-generated text is falsely attributed to human authors. Machine-generated content detectors claim to effectively identify such text under various conditions and from any language model. This paper critically evaluates these claims by assessing several popular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, PHD, LogRank, Binoculars) on a range of domains, datasets, and models that these detectors have not previously encountered. We employ various prompting strategies to simulate practical adversarial attacks, demonstrating that even moderate efforts can significantly evade detection. We emphasize the importance of the true positive rate at a specific false positive rate (TPR@FPR) metric and demonstrate that these detectors perform poorly in certain settings, with TPR@.01 as low as 0%. Our findings suggest that both trained and zero-shot detectors struggle to maintain high sensitivity while achieving a reasonable true positive rate.
Large language models (LLMs) have exhibited remarkable capabilities in text generation tasks. However, the utilization of these models carries inherent risks, including but not limited to plagiarism, the dissemination of fake news, and issues in educational exercises. Although several detectors have been proposed to address these concerns, their effectiveness against adversarial perturbations, specifically in the context of student essay writing, remains largely unexplored. This paper aims to bridge this gap by constructing AIG-ASAP, an AI-generated student essay dataset, employing a range of text perturbation methods that are expected to generate high-quality essays while evading detection. Through empirical experiments, we assess the performance of current AIGC detectors on the AIG-ASAP dataset. The results reveal that the existing detectors can be easily circumvented using straightforward automatic adversarial attacks. Specifically, we explore word substitution and sentence substitution perturbation methods that effectively evade detection while maintaining the quality of the generated essays. This highlights the urgent need for more accurate and robust methods to detect AI-generated student essays in the education domain.
Large Scale Language Models (LLMs) have transformed the field of Natural Language Processing (NLP) by improving various applications like text generation, summarization, machine translation and so on. In one field, they are significantly helping to improve performance but on the other side, they raise potential risks in the field of cybersecurity. In today’s digital age, LLMs play major role in creating AI-enhanced CTI reports which simulate humanauthored reports and thus, creating difficulty in distinguishing. Traditional detection methods such as lexical, rule-based are not enough capable to distinguish the reports. Thus, this paper presents an integrated semantic and adversarial detection framework to effectively distinguish human-authored and AI-synthesized CTI reports. The proposed framework utilizes semantic features, contextual and discourse features with adversarial training to protect from attacks. A hybrid dataset containing 16,000 reports were tested on the baseline BERT, TF-IDF+SVM and our model. The proposed model significantly achieve 94.2% accuracy and gives a viable path to enhance the CTI sharing platforms.
No abstract available
This paper introduces the system developed by USTC-BUPT for SemEval-2024 Task 8. The shared task comprises three subtasks across four tracks, aiming to develop automatic systems to distinguish between human-written and machine-generated text across various domains, languages and generators. Our system comprises four components: DATeD, LLAM, TLE, and AuDM, which empower us to effectively tackle all subtasks posed by the challenge. In the monolingual track, DATeD improves machine-generated text detection by incorporating a gradient reversal layer and integrating additional domain labels through Domain Adversarial Neural Networks, enhancing adaptation to diverse text domains. In the multilingual track, LLAM employs different strategies based on language characteristics. For English text, the LLM Embeddings approach utilizes embeddings from a proxy LLM followed by a two-stage CNN for classification, leveraging the broad linguistic knowledge captured during pre-training to enhance performance. For text in other languages, the LLM Sentinel approach transforms the classification task into a next-token prediction task, which facilitates easier adaptation to texts in various languages, especially low-resource languages. TLE utilizes the LLM Embeddings method with a minor modification in the classification strategy for subtask B. AuDM employs data augmentation and fine-tunes the DeBERTa model specifically for subtask C. Our system wins the multilingual track and ranks second in the monolingual track. Additionally, it achieves third place in both subtask B and C.
Large Language Models (LLMs) are gearing up to surpass human creativity. The veracity of the statement needs careful consideration. In recent developments, critical questions arise regarding the authenticity of human work and the preservation of their creativity and innovative abilities. This paper investigates such issues. This paper addresses machine-generated text detection across several scenarios, including document-level binary and multiclass classification or generator attribution, sentence-level segmentation to differentiate between human-AI collaborative text, and adversarial attacks aimed at reducing the detectability of machine-generated text. We introduce a new work called BMAS English: an English language dataset for binary classification of human and machine text, for multiclass classification, which not only identifies machine-generated text but can also try to determine its generator, and Adversarial attack addressing where it is a common act for the mitigation of detection, and Sentence-level segmentation, for predicting the boundaries between human and machine-generated text. We believe that this paper will address previous work in Machine-Generated Text Detection (MGTD) in a more meaningful way.
The widespread use of AI-generated text has introduced significant security concerns, driving the need for reliable detection systems. However, recent studies reveal that neural network-based detectors are vulnerable to adversarial examples. To improve the robustness of such classifiers, a number of adversarial attack strategies have been developed, particularly in the context of text sentiment classification. Most existing adversarial attack methods focus on the semantics of individual words or sentences, often neglecting the broader contextual semantics of the entire text—particularly in the case of long AI-generated text. This limitation frequently results in adversarial examples that lack fluency and coherence. In this paper, we propose a novel method called Sentence-based Adversarial attack on AI-Generated Text detectors (SAGT), which generates linguistically fluent adversarial examples by inserting model-generated sentences into the original text. To ensure contextual semantic consistency, we extract important keywords from the original text—selected based on changes in the detector's confidence score—and incorporate them into the generated sentences. Extensive experimental results demonstrate that adversarial examples crafted by SAGT can effectively evade AI-generated text detectors.
The rapid advancement of large language models (LLMs) has made AI-generated text increasingly fluent and indistinguishable from human writing. However, malicious use of AI text for misinformation or plagiarism raises the need for reliable detectors. Simple detectors often fail when the AI-generated text is paraphrased by an adversary. In this work, we propose a robust detection framework based on deep recurrent neural networks (RNNs) that is resilient to paraphrasing. We compile a large dataset of AI-generated and human-written text (e.g., a 500K Kaggle corpus and simulate paraphrase attacks using state-of-the-art paraphrasing models. Our model employs a multi-layer Long Short-Term Memory (LSTM) network to capture sequential patterns and is trained with both original and paraphrased samples. In experiments, the proposed RNN classifier achieves high accuracy on unaltered AI text and retains strong performance on paraphrased adversarial examples (far exceeding the drop seen in baseline detectors. These results demonstrate that deep recurrent models, when properly trained, can detect AI-generated content even under paraphrasing
No abstract available
The increasing misuse of AI-generated texts (AIGT) has motivated the rapid development of AIGT detection methods. However, the reliability of these detectors remains fragile against adversarial evasions. Existing attack strategies often rely on white-box assumptions or demand prohibitively high computational and interaction costs, rendering them ineffective under practical black-box scenarios. In this paper, we propose Multi-stage Alignment for Style Humanization (MASH), a novel framework that evades black-box detectors based on style transfer. MASH sequentially employs style-injection supervised fine-tuning, direct preference optimization, and inference-time refinement to shape the distributions of AI-generated texts to resemble those of human-written texts. Experiments across 6 datasets and 5 detectors demonstrate the superior performance of MASH over 11 baseline evaders. Specifically, MASH achieves an average Attack Success Rate (ASR) of 92%, surpassing the strongest baselines by an average of 24%, while maintaining superior linguistic quality.
The increased quality and human-likeness of AI generated texts has resulted in a rising demand for neural text detectors, i.e. software that is able to detect whether a text was written by a human or generated by an AI. Such tools are often used in contexts where the use of AI is restricted or completely prohibited, e.g. in educational contexts. It is, therefore, important for the effectiveness of such tools that they are robust towards deliberate attempts to hide the fact that a text was generated by an AI. In this article, we investigate a broad range of adversarial attacks in English texts with six different neural text detectors, including commercial and research tools. While the results show that no detector is completely invulnerable to adversarial attacks, the latest generation of commercial detectors proved to be very robust and not significantly influenced by most of the evaluated attack strategies.
No abstract available
Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusations of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a robust AI-text detector via adversarial learning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic content to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5-Turbo.
Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.
No abstract available
We present a novel evaluation paradigm for AI text detectors that prioritizes real-world and equitable assessment. Current approaches predominantly report conventional metrics like AUROC, overlooking that even modest false positive rates constitute a critical impediment to practical deployment of detection systems. Furthermore, real-world deployment necessitates predetermined threshold configuration, making detector stability (i.e. the maintenance of consistent performance across diverse domains and adversarial scenarios), a critical factor. These aspects have been largely ignored in previous research and benchmarks. Our benchmark, SHIELD, addresses these limitations by integrating both reliability and stability factors into a unified evaluation metric designed for practical assessment. Furthermore, we develop a post-hoc, model-agnostic humanification framework that modifies AI text to more closely resemble human authorship, incorporating a controllable hardness parameter. This hardness-aware approach effectively challenges current SOTA zero-shot detection methods in maintaining both reliability and stability. (Data and code: https://github.com/navid-aub/SHIELD-Benchmark)
Large language models (LLMs) present significant risks when used to generate non-factual content and spread disinformation at scale. Detecting such LLM-generated content is crucial, yet current detectors often struggle to generalize in open-world contexts. We introduce Learning2Rewrite, a novel framework for detecting AI-generated text with exceptional generalization to unseen domains. Our method leverages the insight that LLMs inherently modify AI-generated content less than human-written text when tasked with rewriting. By training LLMs to minimize alterations on AI-generated inputs, we amplify this disparity, yielding a more distinguishable and generalizable edit distance across diverse text distributions. Extensive experiments on data from 21 independent domains and four major LLMs (GPT-3.5, GPT-4, Gemini, and Llama-3) demonstrate that our detector outperforms state-of-the-art detection methods by up to 23.04% in AUROC for in-distribution tests, 37.26% for out-of-distribution tests, and 48.66% under adversarial attacks. Our unique training objective ensures better generalizability compared to directly training for classification, when leveraging the same amount of parameters. Our findings suggest that reinforcing LLMs' inherent rewriting tendencies offers a robust and scalable solution for detecting AI-generated text.
With rapidly growing AI-generated contents by utilizing Large Language Models (LLMs) such as GPT, Gemini, and Llama, etc. the proliferation of text appearing fluent and more accurate has increased. This has resulted into serious challenges such as generation of misinformation, incredible or unreliable contents. Recent efforts in the literature, focus on surface level features and do not capture semantic embeddings and thus are not robust against adversarial attacks. In this research, authors propose a contrastive triplet learning method using transformer-based approach to classify human and AI-generated text by understanding the semantic separation between embedding space. The models are trained on most widely used misinformation dataset CoAID, with paraphrased AI negatives. The proposed method is highly efficient providing significant achievement in generalization and robustness to adversarial attacks. Prominently, the google/bert_uncased_L-4_H-256_A-4 model showed best performance with 94% accuracy, showing improvement of 4% against existing studies in the literature with contrastive learning.
The increasing parameters and expansive dataset of large lan- guage models (LLMs) highlight the urgent demand for a technical solution to audit the underlying privacy risks and copyright issues associated with LLMs. Existing studies have partially addressed this need through an exploration of the pre-training data detection problem, which is an instance of a membership inference attack (MIA). This problem involves determining whether a given piece of text has been used during the pre-training phase of the target LLM. Although existing methods have designed various sophisticated MIA score functions to achieve considerable detection performance in pre-trained LLMs, how to achieve high-confidence detection and how to perform MIA on aligned LLMs remain challenging. In this paper, we propose MIA-Tuner, a novel instruction-based MIA method, which instructs LLMs themselves to serve as a more precise pre-training data detector internally, rather than design an external MIA score function. Furthermore, we design two instruction-based safeguards to respectively mitigate the privacy risks brought by the existing methods and MIA-Tuner. To comprehensively evaluate the most recent state-of-the-art LLMs, we collect a more up-to-date MIA benchmark dataset, named WIKIMIA-24, to replace the widely adopted benchmark WIKIMIA. We conduct extensive experiments across various aligned and unaligned LLMs over the two benchmark datasets. The results demonstrate that MIA-Tuner increases the AUC of MIAs from 0.7 to a significantly high level of 0.9.
With advanced neural network techniques, language models can generate content that looks genuinely created by humans. Such advanced progress benefits society in numerous ways. However, it may also bring us threats that we have not seen before. A neural text detector is a classification model that separates machine-generated text from human-written ones. Unfortunately, a pretrained neural text detector may be vulnerable to adversarial attack, aiming to fool the detector into making wrong classification decisions. Through this work, we propose μAttacking, a mutation-based general framework that can be used to evaluate the robustness of neural text detectors systematically. Our experiments demonstrate that μAttacking identifies the detector’s flaws effectively. Inspired by the insightful information revealed by μAttacking, we also propose an RR-training strategy, a straightforward but effective method to improve the robustness of neural text detectors through finetuning. Compared with the normal finetuning method, our experiments demonstrated that RR-training effectively increased the model robustness by up to 11.33% without increasing much effort when finetuning a neural text detector. We believe the μAttacking and RR-training are useful tools for developing and evaluating neural language models.
The recent large-scale emergence of LLMs has left an open space for dealing with their consequences, such as plagiarism or the spread of false information on the Internet. Coupling this with the rise of AI detector bypassing tools, reliable machine-generated text detection is in increasingly high demand. We investigate the paraphrasing attack resilience of various machine-generated text detection methods, evaluating three approaches: fine-tuned RoBERTa, Binoculars, and text feature analysis, along with their ensembles using Random Forest classifiers. We discovered that Binoculars-inclusive ensembles yield the strongest results, but they also suffer the most significant losses during attacks. In this paper, we present the dichotomy of performance versus resilience in the world of AI text detection, which complicates the current perception of reliability among state-of-the-art techniques.
No abstract available
Adversarial attacks in the field of Natural Language Processing greatly undermine the effectiveness and safety of models, raising significant challenges when it comes to real-world implementation. The researchers suggested using detection methods to identify and reject hostile samples while maintaining the accuracy of the original model. Nevertheless, current detection methods depend on analyzing a single characteristic, resulting in restricted resilience and flexibility. To address these constraints, we proposed the Multiple Adversarial Features Detector (MAFD), an innovative detection technique that utilizes a wide range of adversarial features, such as segmented perplexity, word frequency, and probability distribution, to enhance the effectiveness of detecting adversarial examples. Our comprehensive experiments shows that MAFD outperforms existing advanced methods in terms of detection accuracy and displays significant robustness and adaptability when applied to various base detectors and attack scenarios. In addition, the design of MAFD facilitates the seamless integration of further adversarial features, hence enhancing its detection capabilities.
No abstract available
AI humanizers are a new class of online software tools meant to paraphrase and rewrite AI-generated text in a way that allows them to evade AI detection software. We study 19 AI humanizer and paraphrasing tools and qualitatively assess their effects and faithfulness in preserving the meaning of the original text. We show that many existing AI detectors fail to detect humanized text. Finally, we demonstrate a robust model that can detect humanized AI text while maintaining a low false positive rate using a data-centric augmentation approach. We attack our own detector, training our own fine-tuned model optimized against our detector's predictions, and show that our detector's cross-humanizer generalization is sufficient to remain robust to this attack.
No abstract available
In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an ``incorrect''prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.
Machine-generated Text (MGT) detection is crucial for regulating and attributing online texts. While the existing MGT detectors achieve strong performance, they remain vulnerable to simple perturbations and adversarial attacks. To build an effective defense against malicious perturbations, we view MGT detection from a threat modeling perspective, that is, analyzing the model's vulnerability from an adversary's point of view and exploring effective mitigations. To this end, we introduce an adversarial framework for training a robust MGT detector, named GREedy Adversary PromoTed DefendER (GREATER). The GREATER consists of two key components: an adversary GREATER-A and a detector GREATER-D. The GREATER-D learns to defend against the adversarial attack from GREATER-A and generalizes the defense to other attacks. GREATER-A identifies and perturbs the critical tokens in embedding space, along with greedy search and pruning to generate stealthy and disruptive adversarial examples. Besides, we update the GREATER-A and GREATER-D synchronously, encouraging the GREATER-D to generalize its defense to different attacks and varying attack intensities. Our experimental results across 10 text perturbation strategies and 6 adversarial attacks show that our GREATER-D reduces the Attack Success Rate (ASR) by 0.67% compared with SOTA defense methods while our GREATER-A is demonstrated to be more effective and efficient than SOTA attack approaches. Codes and dataset are available in https://github.com/Liyuuuu111/GREATER.
AI-generated text detectors have become essential tools for maintaining content authenticity, yet their robustness against evasion attacks remains questionable. We present PDFuzz, a novel attack that exploits the discrepancy between visual text layout and extraction order in PDF documents. Our method preserves exact textual content while manipulating character positioning to scramble extraction sequences. We evaluate this approach against the ArguGPT detector using a dataset of human and AI-generated text. Our results demonstrate complete evasion: detector performance drops from (93.6 $\pm$ 1.4) % accuracy and 0.938 $\pm$ 0.014 F1 score to random-level performance ((50.4 $\pm$ 3.2) % accuracy, 0.0 F1 score) while maintaining perfect visual fidelity. Our work reveals a vulnerability in current detection systems that is inherent to PDF document structures and underscores the need for implementing sturdy safeguards against such attacks. We make our code publicly available at https://github.com/ACMCMC/PDFuzz.
In this paper, we introduce GradEscape, the first gradient-based evader designed to attack AI-generated text (AIGT) detectors. GradEscape overcomes the undifferentiable computation problem, caused by the discrete nature of text, by introducing a novel approach to construct weighted embeddings for the detector input. It then updates the evader model parameters using feedback from victim detectors, achieving high attack success with minimal text modification. To address the issue of tokenizer mismatch between the evader and the detector, we introduce a warm-started evader method, enabling GradEscape to adapt to detectors across any language model architecture. Moreover, we employ novel tokenizer inference and model extraction techniques, facilitating effective evasion even in query-only access. We evaluate GradEscape on four datasets and three widely-used language models, benchmarking it against four state-of-the-art AIGT evaders. Experimental results demonstrate that GradEscape outperforms existing evaders in various scenarios, including with an 11B paraphrase model, while utilizing only 139M parameters. We have successfully applied GradEscape to two real-world commercial AIGT detectors. Our analysis reveals that the primary vulnerability stems from disparity in text expression styles within the training data. We also propose a potential defense strategy to mitigate the threat of AIGT evaders. We open-source our GradEscape for developing more robust AIGT detectors.
While AI-generated text (AIGT) detectors achieve over 90\% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text -- itself AI-generated -- evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content to deeply laundered text, and five progressive detection tasks across sentence-pair and single-sentence challenges. We evaluate 11 state-of-the-art detectors, revealing critical asymmetry: detectors successfully identify the plagiarism evasion problem but fail for the case of authorship obfuscation. Our findings demonstrate that current detection approaches cannot effectively handle the intermediate laundering region, necessitating fundamental advances in detection architectures beyond existing semantic and stylistic discrimination methods. For detailed code implementation, please see https://github.com/JonathanZha47/PadBen-Paraphrase-Attack-Benchmark.
The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbf{Co}ntrastive \textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.
Large language models (LLMs) have exhibited remarkable fluency across various tasks. However, their unethical applications, such as disseminating disinformation, have become a growing concern. Although recent works have proposed a number of LLM detection methods, their robustness and reliability remain unclear. In this paper, we present RAFT: a grammar error-free black-box attack against existing LLM detectors. In contrast to previous attacks for language models, our method exploits the transferability of LLM embeddings at the word-level while preserving the original text quality. We leverage an auxiliary embedding to greedily select candidate words to perturb against the target detector. Experiments reveal that our attack effectively compromises all detectors in the study across various domains by up to 99%, and are transferable across source models. Manual human evaluation studies show our attacks are realistic and indistinguishable from original human-written text. We also show that examples generated by RAFT can be used to train adversarially robust detectors. Our work shows that current LLM detectors are not adversarially robust, underscoring the urgent need for more resilient detection mechanisms.
Large Language Models (LLMs) perform impressively well in various applications. However, the potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use. Consequently, the reliable detection of AI-generated text has become a critical area of research. AI text detectors have shown to be effective under their specific settings. In this paper, we stress-test the robustness of these AI text detectors in the presence of an attacker. We introduce recursive paraphrasing attack to stress test a wide range of detection schemes, including the ones using the watermarking as well as neural network-based detectors, zero shot classifiers, and retrieval-based detectors. Our experiments conducted on passages, each approximately 300 tokens long, reveal the varying sensitivities of these detectors to our attacks. Our findings indicate that while our recursive paraphrasing method can significantly reduce detection rates, it only slightly degrades text quality in many cases, highlighting potential vulnerabilities in current detection systems in the presence of an attacker. Additionally, we investigate the susceptibility of watermarked LLMs to spoofing attacks aimed at misclassifying human-written text as AI-generated. We demonstrate that an attacker can infer hidden AI text signatures without white-box access to the detection method, potentially leading to reputational risks for LLM developers. Finally, we provide a theoretical framework connecting the AUROC of the best possible detector to the Total Variation distance between human and AI text distributions. This analysis offers insights into the fundamental challenges of reliable detection as language models continue to advance. Our code is publicly available at https://github.com/vinusankars/Reliability-of-AI-text-detectors.
Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.
AI-generated text (AIGT) detection evasion aims to reduce the detection probability of AIGT, helping to identify weaknesses in detectors and enhance their effectiveness and reliability in practical applications. Although existing evasion methods perform well, they suffer from high computational costs and text quality degradation. To address these challenges, we propose Self-Disguise Attack (SDA), a novel approach that enables Large Language Models (LLM) to actively disguise its output, reducing the likelihood of detection by classifiers. The SDA comprises two main components: the adversarial feature extractor and the retrieval-based context examples optimizer. The former generates disguise features that enable LLMs to understand how to produce more human-like text. The latter retrieves the most relevant examples from an external knowledge base as in-context examples, further enhancing the self-disguise ability of LLMs and mitigating the impact of the disguise process on the diversity of the generated text. The SDA directly employs prompts containing disguise features and optimized context examples to guide the LLM in generating detection-resistant text, thereby reducing resource consumption. Experimental results demonstrate that the SDA effectively reduces the average detection accuracy of various AIGT detectors across texts generated by three different LLMs, while maintaining the quality of AIGT.
Although membership inference attacks (MIAs) and machine-generated text detection target different goals, their methods often exploit similar signals based on a language model's probability distribution, and the two tasks have been studied independently. This can result in conclusions that overlook stronger methods and valuable insights from the other task. In this work, we theoretically and empirically demonstrate the transferability, i.e., how well a method originally developed for one task performs on the other, between MIAs and machine text detection. We prove that the metric achieving asymptotically optimal performance is identical for both tasks. We unify existing methods under this optimal metric and hypothesize that the accuracy with which a method approximates this metric is directly correlated with its transferability. Our large-scale empirical experiments demonstrate very strong rank correlation ($\rho \approx 0.7$) in cross-task performance. Notably, we also find that a machine text detector achieves the strongest performance among evaluated methods on both tasks, demonstrating the practical impact of transferability. To facilitate cross-task development and fair evaluation, we introduce MINT, a unified evaluation suite for MIAs and machine-generated text detection, implementing 15 recent methods from both tasks.
The large language models (LLMs) are able to generate high-quality texts in multiple languages. Such texts are often not recognizable by humans as generated, and therefore present a potential of LLMs for misuse (e.g., plagiarism, spams, disinformation spreading). An automated detection is able to assist humans to indicate the machine-generated texts; however, its robustness to out-of-distribution data is still challenging. This notebook describes our mdok approach in robust detection, based on fine-tuning smaller LLMs for text classification. It is applied to both subtasks of Voight-Kampff Generative AI Detection 2025, providing remarkable performance (1st rank) in both, the binary detection as well as the multiclass classification of various cases of human-AI collaboration.
With the rapid advancements in pre-trained large language models like ChatGPT, the surge of AI-generated text, particularly in Chinese, has presented significant challenges to existing detection systems due to its increasing realism and complexity. To address this, we introduce MLSDET: a groundbreaking Multi-LLM Statistical Deep Ensemble framework designed for high-precision detection of AI-generated Chinese text. MLSDET uniquely integrates a Mixture of Experts (MoE) architecture with a novel cross-entropy metric, setting a new benchmark for robustness and generalization. By employing a diverse ensemble of large language models (LLMs), including Qwen, Wenzhong-GPT2, and LLaMA, our approach extracts intricate features such as log-rank, entropy, log-likelihood, and the newly introduced LLMs-crossEntropy, accurately capturing both model consensus and the statistical distribution differences between AI-generated and human-authored text. Experimental results on the HC3-Chinese dataset show that MLSDET surpasses traditional zero-shot methods like CLTR by 15.94% in F1 score and competes effectively with existing methods, offering a scalable solution for real-world applications.
As LLM-generated text becomes increasingly prevalent on the internet, often containing hallucinations or biases, detecting such content has emerged as a critical area of research. Recent methods have demonstrated impressive performance in detecting text generated entirely by LLMs. However, in real-world scenarios, users often introduce perturbations to the LLM-generated text, and the robustness of existing detection methods against these perturbations has not been sufficiently explored. This paper empirically investigates this challenge and finds that even minor perturbations can severely degrade the performance of current detection methods. To address this issue, we find that the syntactic tree is minimally affected by disturbances and exhibits distinct differences between human-written and LLM-generated text. Therefore, we propose a detection method based on syntactic trees, which can capture features invariant to perturbations. It demonstrates significantly improved robustness against perturbation on the HC3 and GPT-3.5-mixed datasets. Moreover, it also has the shortest time expenditure. We provide the code and data at https://github.com/thulx18/ PRDetect .
Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. However, current zero-shot detectors, while effective on general text, often fail when applied to specialized content due to domain shift. We provide a theoretical analysis showing this failure is fundamentally linked to the KL divergence between human, detector, and source text distributions. To address this, we propose DivScore, a zero-shot detection framework using normalized entropy-based scoring and domain knowledge distillation to robustly identify LLM-generated text in specialized domains. We also release a domain-specific benchmark for LLM-generated text detection in the medical and legal domains. Experiments on our benchmark show that DivScore consistently outperforms state-of-the-art detectors, with 14.4% higher AUROC and 64.0% higher recall (0.1% false positive rate threshold). In adversarial settings, DivScore demonstrates superior robustness than other baselines, achieving on average 22.8% advantage in AUROC and 29.5% in recall. Code and data are publicly available.
This paper proposes MAF-Detect, a zero-shot detection framework based on multi-scale adaptive fusion, designed to effectively identify text generated by large language models. This framework significantly improves the ability to discriminate machine-generated text without relying on training data through a multi-granular perturbation strategy, multidimensional semantic feature fusion, and a dynamic adaptive threshold mechanism. Experiments on the cross-domain Chinese and English dataset HC3 demonstrate that MAF-Detect outperforms existing zero-shot detection methods in both recognition accuracy and robustness, particularly in short text recognition tasks, validating its effectiveness and versatility in practical applications.
The increasing use of Large Language Models (LLMs) for generating highly coherent and contextually relevant text introduces new risks, including misuse for unethical purposes such as disinformation or academic dishonesty. To address these challenges, we propose FreqMark, a novel watermarking technique that embeds detectable frequency-based watermarks in LLM-generated text during the token sampling process. The method leverages periodic signals to guide token selection, creating a watermark that can be detected with Short-Time Fourier Transform (STFT) analysis. This approach enables accurate identification of LLM-generated content, even in mixed-text scenarios with both human-authored and LLM-generated segments. Our experiments demonstrate the robustness and precision of FreqMark, showing strong detection capabilities against various attack scenarios such as paraphrasing and token substitution. Results show that FreqMark achieves an AUC improvement of up to 0.98, significantly outperforming existing detection methods.
Detecting text generated by large language models (LLMs) is a growing challenge as these models produce outputs nearly indistinguishable from human writing. This study explores multiple detection approaches, including a Multi-Layer Perceptron (MLP), Long Short-Term Memory (LSTM) networks, a Transformer block, and a fine-tuned distilled BERT model. Leveraging BERT's contextual understanding, we train the model on diverse datasets containing authentic and synthetic texts, focusing on features like sentence structure, token distribution, and semantic coherence. The fine-tuned BERT outperforms baseline models, achieving high accuracy and robustness across domains, with superior AUC scores and efficient computation times. By incorporating domain-specific training and adversarial techniques, the model adapts to sophisticated LLM outputs, improving detection precision. These findings underscore the efficacy of pretrained transformer models for ensuring authenticity in digital communication, with potential applications in mitigating misinformation, safeguarding academic integrity, and promoting ethical AI usage.
Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. Yet, this ability also heightens the risk of identity impersonation. To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection. In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations. Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops. We attribute this limitation to the \textit{feature-inversion trap}, where features that are discriminative in general domains become inverted and misleading when applied to personalized text. Based on this finding, we propose \method, a simple and reliable way to predict detector performance changes in personalized settings. \method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence. Our experiments show that \method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85\% correlation with the actual performance gaps. We hope that this work will encourage further research on personalized text detection.
With the rapid development and widespread application of Large Language Models (LLMs), the use of Machine-Generated Text (MGT) has become increasingly common, bringing with it potential risks, especially in terms of quality and integrity in fields like news, education, and science. Current research mainly focuses on purely MGT detection without adequately addressing mixed scenarios, including AI-revised Human-Written Text (HWT) or human-revised MGT. To tackle this challenge, we define mixtext, a form of mixed text involving both AI and human-generated content. Then, we introduce MixSet, the first dataset dedicated to studying these mixtext scenarios. Leveraging MixSet, we executed comprehensive experiments to assess the efficacy of prevalent MGT detectors in handling mixtext situations, evaluating their performance in terms of effectiveness, robustness, and generalization. Our findings reveal that existing detectors struggle to identify mixtext, particularly in dealing with subtle modifications and style adaptability. This research underscores the urgent need for more fine-grain detectors tailored for mixtext, offering valuable insights for future research. Code and Models are available at https://github.com/Dongping-Chen/MixSet.
Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks.1
Since the proliferation of LLMs, there have been concerns about their misuse for harmful content creation and spreading. Recent studies justify such fears, providing evidence of LLM vulnerabilities and high potential of their misuse. Humans are no longer able to distinguish between high-quality machine-generated and authentic human-written texts. Therefore, it is crucial to develop automated means to accurately detect machine-generated content. It would enable to identify such content in online information space, thus providing an additional information about its credibility. This work addresses the problem by proposing a robust fine-tuning process of LLMs for the detection task, making the detectors more robust against obfuscation and more generalizable to out-of-distribution data.
The growing abilities of large language models (LLMs) have introduced new challenges in reliably distinguishing LLM-generated texts from human-written content, particularly when paraphrasing techniques are used to evade detection. This paper proposes GravText, a detection framework designed to address this robustness gap by targeting paraphrase-invariant semantic features. GravText integrates triplet contrastive learning with a dynamic anchor switching strategy to better model inter-class separability under paraphrasing. Additionally, it introduces a physics-inspired gravitational factor based on cross-attention mechanisms, which enhances the discriminative power of learned embeddings by simulating semantic attraction and repulsion. Experimental results on the HC3 Chinese dataset demonstrate GravText’s superior robustness against paraphrasing. Crucially, further cross-lingual evaluation on an English essay dataset confirms the framework’s strong generalization ability and language-agnostic properties. These findings point to a promising direction for building more reliable AI-text detectors resilient to paraphrasing-based evasion.
Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.
No abstract available
No abstract available
Watermarking approaches are proposed to identify if text being circulated is human or large language model (LLM) generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (``green'') tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods fail to evade detection for longer text segments. We overcome this limitation, and propose {\em Self Color Testing-based Substitution (SCTS)}, the first ``color-aware'' attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.
We study the problem of watermarking large language models (LLMs) generated text -- one of the most promising approaches for addressing the safety challenges of LLM usage. In this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. We propose a robust and high-quality watermark method, Unigram-Watermark, by extending an existing approach with a simplified fixed grouping strategy. We prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. Experiments on three varying LLMs and two datasets verify that our Unigram-Watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs. Code is available at https://github.com/XuandongZhao/Unigram-Watermark.
Large Language Model (LLM) watermarking embeds detectable signals into generated text for copyright protection, misuse prevention, and content detection. While prior studies evaluate robustness using watermark removal attacks, these methods are often suboptimal, creating the misconception that effective removal requires large perturbations or powerful adversaries. To bridge the gap, we first formalize the system model for LLM watermark, and characterize two realistic threat models constrained on limited access to the watermark detector. We then analyze how different types of perturbation vary in their attack range, i.e., the number of tokens they can affect with a single edit. We observe that character-level perturbations (e.g., typos, swaps, deletions, homoglyphs) can influence multiple tokens simultaneously by disrupting the tokenization process. We demonstrate that character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model. We further propose guided removal attacks based on the Genetic Algorithm (GA) that uses a reference detector for optimization. Under a practical threat model with limited black-box queries to the watermark detector, our method demonstrates strong removal performance. Experiments confirm the superiority of character-level perturbations and the effectiveness of the GA in removing watermarks under realistic constraints. Additionally, we argue there is an adversarial dilemma when considering potential defenses: any fixed defense can be bypassed by a suitable perturbation strategy. Motivated by this principle, we propose an adaptive compound character-level attack. Experimental results show that this approach can effectively defeat the defenses. Our findings highlight significant vulnerabilities in existing LLM watermark schemes and underline the urgency for the development of new robust mechanisms.
As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.
The advent of large language models (LLMs) has revolutionized the field of text generation, producing outputs that closely mimic human-like writing. Although academic and industrial institutions have developed detectors to prevent the malicious usage of LLM-generated texts, other research has doubt about the robustness of these systems. To stress test these detectors, we introduce a proxy-attack strategy that effortlessly compromises LLMs, causing them to produce outputs that align with human-written text and mislead detection systems. Our method attacks the source model by leveraging a reinforcement learning (RL) fine-tuned humanized small language model (SLM) in the decoding phase. Through an in-depth analysis, we demonstrate that our attack strategy is capable of generating responses that are indistinguishable to detectors, preventing them from differentiating between machine-generated and human-written text. We conduct systematic evaluations on extensive datasets using proxy-attacked open-source models, including Llama2-13B, Llama3-70B, and Mixtral-8*7B in both white- and black-box settings. Our findings show that the proxy-attack strategy effectively deceives the leading detectors, resulting in an average AUROC drop of 70.4% across multiple datasets, with a maximum drop of 90.3% on a single dataset. Furthermore, in cross-discipline scenarios, our strategy also bypasses these detectors, leading to a significant relative decrease of up to 90.9%, while in cross-language scenario, the drop reaches 91.3%. Despite our proxy-attack strategy successfully bypassing the detectors with such significant relative drops, we find that the generation quality of the attacked models remains preserved, even within a modest utility budget, when compared to the text produced by the original, unattacked source model.
No abstract available
High-quality text generation capability of recent Large Language Models (LLMs) causes concerns about their misuse (e.g., in massive generation/spread of disinformation). Machine-generated text (MGT) detection is important to cope with such threats. However, it is susceptible to authorship obfuscation (AO) methods, such as paraphrasing, which can cause MGTs to evade detection. So far, this was evaluated only in monolingual settings. Thus, the susceptibility of recently proposed multilingual detectors is still unknown. We fill this gap by comprehensively benchmarking the performance of 10 well-known AO methods, attacking 37 MGT detection methods against MGTs in 11 languages (i.e., 10 $\times$ 37 $\times$ 11 = 4,070 combinations). We also evaluate the effect of data augmentation on adversarial robustness using obfuscated texts. The results indicate that all tested AO methods can cause evasion of automated detection in all tested languages, where homoglyph attacks are especially successful. However, some of the AO methods severely damaged the text, making it no longer readable or easily recognizable by humans (e.g., changed language, weird characters).
Recent watermarked generation algorithms inject detectable signatures during language generation to facilitate post-hoc detection. While token-level watermarks are vulnerable to paraphrase attacks, SemStamp (Hou et al., 2023) applies watermark on the semantic representation of sentences and demonstrates promising robustness. SemStamp employs locality-sensitive hashing (LSH) to partition the semantic space with arbitrary hyperplanes, which results in a suboptimal tradeoff between robustness and speed. We propose k-SemStamp, a simple yet effective enhancement of SemStamp, utilizing k-means clustering as an alternative of LSH to partition the embedding space with awareness of inherent semantic structure. Experimental results indicate that k-SemStamp saliently improves its robustness and sampling efficiency while preserving the generation quality, advancing a more effective tool for machine-generated text detection.
Existing machine-generated text (MGT) detection methods implicitly assume labels as the"golden standard". However, we reveal boundary ambiguity in MGT detection, implying that traditional training paradigms are inexact. Moreover, limitations of human cognition and the superintelligence of detectors make inexact learning widespread and inevitable. To this end, we propose an easy-to-hard enhancement framework to provide reliable supervision under such inexact conditions. Distinct from knowledge distillation, our framework employs an easy supervisor targeting relatively simple longer-text detection tasks (despite weaker capabilities), to enhance the more challenging target detector. Firstly, longer texts targeted by supervisors theoretically alleviate the impact of inexact labels, laying the foundation for reliable supervision. Secondly, by structurally incorporating the detector into the supervisor, we theoretically model the supervisor as a lower performance bound for the detector. Thus, optimizing the supervisor indirectly optimizes the detector, ultimately approximating the underlying"golden"labels. Extensive experiments across diverse practical scenarios, including cross-LLM, cross-domain, mixed text, and paraphrase attacks, demonstrate the framework's significant detection effectiveness. The code is available at: https://github.com/tmlr-group/Easy2Hard.
Detecting machine-generated text is a critical task in the era of large language models. In this paper, we present our systems for SemEval-2024 Task 8, which focuses on multi-class classification to discern between human-written and maching-generated texts by five state-of-the-art large language models. We propose three different systems: unsupervised text similarity, triplet-loss-trained text similarity, and text classification. We show that the triplet-loss trained text similarity system outperforms the other systems, achieving 80% accuracy on the test set and surpassing the baseline model for this subtask. Additionally, our text classification system, which takes into account sentence paraphrases generated by the candidate models, also outperforms the unsupervised text similarity system, achieving 74% accuracy.
No abstract available
As Generative AI models become increasingly adept at producing human-like text, AI content detectors struggle to distinguish machine-generated text from human writing. This paper explores an innovative approach to evade these detection systems by fusing text generated from multiple large language models (LLMs) at the token level. By strategically merging the outputs of different models, the generated text incorporates a diverse range of linguistic styles, syntactic structures, and semantic patterns, effectively circumventing the detection signals typically used by content classifiers. The evaluation demonstrates that this multi-model fusion technique significantly reduces the accuracy of existing AI detection systems, highlighting vulnerabilities in their current architecture. In response, we introduce an enhanced detection framework that integrates advanced natural language processing (NLP) techniques to improve model robustness against sophisticated AI text manipulations. The results underscore the evolving cat-and-mouse game between AI-generated text and detection models, offering new insights into improving both generative AI and detection capabilities.
The burgeoning progress in the field of Large Language Models (LLMs) heralds significant benefits due to their unparalleled capacities. However, it is critical to acknowledge the potential misuse of these models, which could give rise to a spectrum of social and ethical dilemmas. Despite numerous preceding efforts centered around distinguishing synthetic text, most existing detection systems fail to identify data synthesized by the latest LLMs, such as ChatGPT and GPT-4. In response to this challenge, we introduce an unpretentious yet potent detection approach proficient in identifying synthetic text across a wide array of fields. Moreover, our detector demonstrates outstanding performance uniformly across various model architectures and decoding strategies. It also possesses the capability to identify text generated utilizing a potent detection-evasion technique. Our comprehensive research underlines our commitment to boosting the robustness and efficiency of machine-generated text detection mechanisms, particularly in the context of swiftly progressing and increasingly adaptive AI technologies.
Abstract: As the generative AI has advanced with a great speed, the need to detect AI-generated content, including text and deepfake media, also increased. This research work proposes a hybrid detection method that includes double paraphrasing-based consistency checks, coupled with probabilistic content analysis through natural language processing and machine learning algorithms for text and advanced deepfake detection techniques for media. Our system hybridizes the double paraphrasing framework of SAVANA with probabilistic analysis toward high accuracy on AI-text detection in forms such as DOCX or PDF from diverse domains- academic text, business text, reviews, and media. Specifically, for detecting visual artifact and spatiotemporal inconsistencies attributed to deepfakes within media applications, we'll be exploiting BlazeFace, EfficientNetB4 for extracting features while classifying and detecting respective deepfakes. Experimental results indicate that the hybrid model achieves up to 95% accuracy for AI-generated text detection and up to 96% accuracy for deepfake detection with the traditional models and the standalone SAVANA-based methods. This approach therefore positions our framework as an adaptive and reliable tool to detect AI-generated content within various contexts, thereby enriching content integrity in digital environments.
The rise of advanced large language models (LLMs) has enabled the generation of human-like text, challenging the detection of AI-generated and humanized AI content. This study evaluates Logistic Regression, Bidirectional LSTM, and DeBERTa for multi-class detection of human-written, AI-generated, and humanized AI text. We introduce a novel dataset of 30,000 texts, including 10,000 humanized samples created via a LangChain-based pipeline with GPT-4o, verified to reduce AI detectability using ZeroGPT. Experimental results show DeBERTa achieves 96.93% accuracy, outperforming Logistic Regression (93.43%) and LSTM (93.77%) in distinguishing text classes. Our approach leverages stylometric features and deep contextual embeddings to address real-world challenges like stylistic overlap and adversarial paraphrasing. Key contributions include the dataset, a comparative model evaluation, and insights into detecting humanized AI text, with implications for content moderation, academic integrity, and misinformation prevention.
The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets – 15.5% absolute improvement on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvemene on M4 compared to SOTA approaches. The data and code are available at this link1.
High-quality paraphrases are easy to produce using instruction-tuned language models or specialized paraphrasing models. Although this capability has a variety of benign applications, paraphrasing attacks$\unicode{x2013}$paraphrases applied to machine-generated texts$\unicode{x2013}$are known to significantly degrade the performance of machine-text detectors. This motivates us to consider the novel problem of paraphrase inversion, where, given paraphrased text, the objective is to recover an approximation of the original text. The closer the approximation is to the original text, the better machine-text detectors will perform. We propose an approach which frames the problem as translation from paraphrased text back to the original text, which requires examples of texts and corresponding paraphrases to train the inversion model. Fortunately, such training data can easily be generated, given a corpus of original texts and one or more paraphrasing models. We find that language models such as GPT-4 and Llama-3 exhibit biases when paraphrasing which an inversion model can learn with a modest amount of data. Perhaps surprisingly, we also find that such models generalize well, including to paraphrase models unseen at training time. Finally, we show that when combined with a paraphrased-text detector, our inversion models provide an effective defense against paraphrasing attacks, and overall our approach yields an average improvement of +22% AUROC across seven machine-text detectors and three different domains.
Generative AI models, including ChatGPT, Gemini, and Claude, are increasingly significant in enhancing K–12 education, offering support across various disciplines. These models provide sample answers for humanities prompts, solve mathematical equations, and brainstorm novel ideas. Despite their educational value, ethical concerns have emerged regarding their potential to mislead students into copying answers directly from AI when completing assignments, assessments, or research papers. Current detectors, such as GPT-Zero, struggle to identify modified AI-generated texts and show reduced reliability for English as a Second Language learners. This study investigates detection of academic cheating by use of generative AI in high-stakes writing assessments. Classical machine learning models, including logistic regression, XGBoost, and support vector machine, are used to distinguish between AI-generated and student-written essays. Additionally, large language models including BERT, RoBERTa, and Electra are examined and compared to traditional machine learning models. The analysis focuses on prompt 1 from the ASAP Kaggle competition. To evaluate the effectiveness of various detection methods and generative AI models, we include ChatGPT, Claude, and Gemini in their base, pro, and latest versions. Furthermore, we examine the impact of paraphrasing tools such as GPT-Humanizer and QuillBot and introduce a new method of using synonym information to detect humanized AI texts. Additionally, the relationship between dataset size and model performance is explored to inform data collection in future research.
To combat the misuse of Large Language Models (LLMs), many recent studies have presented LLM-generated-text detectors with promising performance. When users instruct LLMs to generate texts, the instruction can include different constraints depending on the user's need. However, most recent studies do not cover such diverse instruction patterns when creating datasets for LLM detection. In this paper, we reveal that even task-oriented constraints -- constraints that would naturally be included in an instruction and are not related to detection-evasion -- cause existing powerful detectors to have a large variance in detection performance. We focus on student essay writing as a realistic domain and manually create task-oriented constraints based on several factors for essay quality. Our experiments show that the standard deviation (SD) of current detector performance on texts generated by an instruction with such a constraint is significantly larger (up to an SD of 14.4 F1-score) than that by generating texts multiple times or paraphrasing the instruction. We also observe an overall trend where the constraints can make LLM detection more challenging than without them. Finally, our analysis indicates that the high instruction-following ability of LLMs fosters the large impact of such constraints on detection performance.
This study proposes two advanced transformer-based architectures for enhancing the identification of AI-generated text: a fine-tuned BERT-CNN model and a hybrid DistilBERT-BiLSTM framework. The BERT-CNN architecture combines pre-trained BERT embeddings with convolutional neural networks to detect localized linguistic patterns indicative of synthetic text. The DistilBERT-BiLSTM model integrates the efficiency of DistilBERT with bidirectional LSTM layers to capture sequential dependencies and long-range contextual features. Both approaches employ standardized preprocessing using the BERT tokenizer, including tokenization, padding, and truncation, to ensure consistency in input representation. The BERT-CNN model achieved strong performance with 95.67% accuracy, 94.32% F1-score, and 93.45% precision, demonstrating its capability to discern subtle AI-generated patterns. The DistilBERT-BiLSTM framework further enhanced detection accuracy to 97%, with precision, recall, and F1-score values of 98%, 97%, and 97%, respectively, attributed to its ability to model temporal relationships in text sequences. Both models exhibited robustness against paraphrasing-based evasion techniques, with DistilBERT-BiLSTM showing superior generalization due to its balanced architecture of lightweight language understanding and sequential analysis. This research underscores the efficacy of transformer-based hybrid models in advancing AI-generated text detection, offering scalable solutions for maintaining content authenticity in academic, professional, and digital platforms. The findings contribute to the development of reliable tools for ethical AI adoption and mitigation of misinformation risks.
The rapid proliferation of generative Artificial Intelligence (AI) tools, particularly Large Language Models (LLMs) such as ChatGPT, has introduced unprecedented challenges to academic integrity in higher education. Students increasingly utilize these AI systems to generate essays, reports, and assignments, creating an urgent need for robust detection mechanisms that can identify AI-generated content in academic submissions. This study presents a comprehensive multi-modal classification approach that integrates multiple feature extraction techniques including stylometric analysis, linguistic pattern recognition, and semantic coherence measurement to detect AI-generated text with enhanced accuracy. By employing Convolutional Neural Networks (CNNs) for local feature extraction, recurrent neural architectures for sequential pattern analysis, and fusion-based ensemble learning methods that combine multiple classification pathways, our proposed framework achieves detection accuracy of 94.3 percent on a corpus of authentic student submissions and AI-generated counterparts. The multi-modal approach addresses limitations of single-modality detection systems by capturing diverse textual characteristics including vocabulary diversity, syntactic complexity, semantic consistency, and discourse structure patterns that distinguish human and AI writing. Experimental results demonstrate that AI-generated texts exhibit statistically significant differences in lexical diversity metrics, n-gram patterns, and topic coherence measures compared to authentic student writing. Furthermore, this research investigates the challenges of detection evasion strategies including paraphrasing and hybrid authorship scenarios where students modify AI-generated content. The findings underscore both the potential and limitations of current detection technologies while providing practical recommendations for educational institutions seeking to maintain academic integrity in the age of generative AI.
Machine learning algorithms have gained popularity for performing numerous tasks on text data including, prediction, recommendation, sentiment analysis, etc. Along with the development of these algorithms, several adversarial attacks have emerged that can be injected into input data to manipulate the outputs of machine learning-based models. These attacks affect the models’ performance hence leading to generating incorrect results. This paper introduces two Evasion-type adversarial machine-learning attacks for Bangla text and later proposes defensive mechanisms against them. At first, a comprehensive Bangla dataset has been generated and a teacher model is trained on that. Then the introduced attacks are injected into the dataset to manipulate it. Alongside this, a student model is trained on the manipulated dataset to build a robust model that learns to defeat any future adversarial attack. Finally, the experimental analysis shows that the proposed framework achieves robustness and can defeat adversarial machine-learning attacks for Bangla text.
With the advent of large language models (LLM), the line between human-crafted and machine-generated texts has become increasingly blurred. This paper delves into the inquiry of identifying discernible and unique linguistic properties in texts that were written by humans, particularly uncovering the underlying discourse structures of texts beyond their surface structures. Introducing a novel methodology, we leverage hierarchical parse trees and recursive hypergraphs to unveil distinctive discourse patterns in texts produced by both LLMs and humans. Empirical findings demonstrate that, although both LLMs and humans generate distinct discourse patterns influenced by specific domains, human-written texts exhibit more structural variability, reflecting the nuanced nature of human writing in different domains. Notably, incorporating hierarchical discourse features enhances binary classifiers' overall performance in distinguishing between human-written and machine-generated texts, even on out-of-distribution and paraphrased samples. This underscores the significance of incorporating hierarchical discourse features in the analysis of text patterns. The code and dataset are available at https://github.com/minnesotanlp/threads-of-subtlety.
Introduction: The rapid progression of generative AI systems has facilitated the creation of human-like text with remarkable sophistication. Models such as GPT-4, Claude, and Gemini are capable of generating coherent content across a wide range of genres, thereby raising critical concerns regarding the differentiation between machine-generated and human-authored text. This capability presents significant challenges to academic integrity, content authenticity, and the development of reliable detection methodologies.Objective: To evaluate the performance and reliability of current AI-based text detection tools in identifying machine-generated content across different text genres, AI models, and writing styles, establishing a comprehensive benchmark for detection capabilities.Methodology: We systematically evaluated ten commercially available AI detection tools utilizing a curated dataset comprising 150 text samples, expanded from the original 50. This dataset included human-authored texts, both original and translated, as well as AI-generated content from six advanced models (GPT-3.5, GPT-4, Gemini, Bing, Claude, LLaMA2), along with paraphrased variants. Each tool underwent assessment through binary classification, employing metrics such as accuracy, precision, recall, F1 scores, and confusion matrices. Statistical significance was determined using McNemar's test with Bonferroni correction.Results indicate that Content at Scale demonstrated the highest accuracy at 88% (95% CI: 84.2-91.8%), followed by Crossplag at 76% and Copyleaks at 70%. Notably, performance varied significantly across different text categories, with all tools exhibiting reduced accuracy for texts generated by more recent models, such as Claude and LLaMA2. False positive rates ranged from 4% to 32%, which raises concerns regarding their applicability in academic contexts. No tool achieved perfect accuracy, and a performance degradation of 12% was observed with models released subsequent to the initial study design.Conclusions: Current AI text detection tools exhibit moderate to high levels of accuracy; however, they remain imperfect, displaying considerable variability across different AI models and text types. The ongoing challenge of achieving reliable detection, coupled with non-trivial false positive rates, necessitates cautious implementation in high-stakes environments. These tools should serve as a complement to, rather than a replacement for, human judgment in academic and professional contexts.
With the increasing integration of large language models (LLMs) into open-domain writing, detecting machine-generated text has become a critical task for ensuring content authenticity and trust. Existing approaches rely on statistical discrepancies or model-specific heuristics to distinguish between LLM-generated and human-written text. However, these methods struggle in real-world scenarios due to limited generalization, vulnerability to paraphrasing, and lack of explainability, particularly when facing stylistic diversity or hybrid human-AI authorship. In this work, we propose StyleDecipher, a robust and explainable detection framework that revisits LLM-generated text detection using combined feature extractors to quantify stylistic differences. By jointly modeling discrete stylistic indicators and continuous stylistic representations derived from semantic embeddings, StyleDecipher captures distinctive style-level divergences between human and LLM outputs within a unified representation space. This framework enables accurate, explainable, and domain-agnostic detection without requiring access to model internals or labeled segments. Extensive experiments across five diverse domains, including news, code, essays, reviews, and academic abstracts, demonstrate that StyleDecipher consistently achieves state-of-the-art in-domain accuracy. Moreover, in cross-domain evaluations, it surpasses existing baselines by up to 36.30%, while maintaining robustness against adversarial perturbations and mixed human-AI content. Further qualitative and quantitative analysis confirms that stylistic signals provide explainable evidence for distinguishing machine-generated text. Our source code can be accessed at https://github.com/SiyuanLi00/StyleDecipher.
Large language models (LLMs) have transformed human writing by enhancing grammar correction, content expansion, and stylistic refinement. However, their widespread use raises concerns about authorship, originality, and ethics, even potentially threatening scholarly integrity. Existing detection methods, which mainly rely on single-feature analysis and binary classification, often fail to effectively identify LLM-generated text in academic contexts. To address these challenges, we propose a novel Multi-level Fine-grained Detection (MFD) framework that detects LLM-generated text by integrating low-level structural, high-level semantic, and deep-level linguistic features, while conducting sentence-level evaluations of lexicon, grammar, and syntax for comprehensive analysis. To improve detection of subtle differences in LLM-generated text and enhance robustness against paraphrasing, we apply two mainstream evasion techniques to rewrite the text. These variations, along with original texts, are used to train a text encoder via contrastive learning, extracting high-level semantic features of sentence to boost detection generalization. Furthermore, we leverage advanced LLM to analyze the entire text and extract deep-level linguistic features, enhancing the model's ability to capture complex patterns and nuances while effectively incorporating contextual information. Extensive experiments on public datasets show that the MFD model outperforms existing methods, achieving an MAE of 0.1346 and an accuracy of 88.56%. Our research provides institutions and publishers with an effective mechanism to detect LLM-generated text, mitigating risks of compromised authorship. Educators and editors can use the model's predictions to refine verification and plagiarism prevention protocols, ensuring adherence to standards.
Adversarial training has emerged as a powerful technique to improve the reliability of natural language processing (NLP) designs, especially sentiment analysis and machine translation. By providing adversarial examples during training process, models are exposed to perturbations that challenge their understanding and interpretation of textual data. This process helps in developing models that are not only accurate but also resilient to manipulations and noise in real-world scenarios. In sentiment analysis, adversarial training ensures that models can maintain consistent performance despite variations in input text, such as paraphrasing or the inclusion of misleading sentiment indicators. This robustness is crucial for applications involving user-generated content, where linguistic diversity and intentional manipulations are common.In the context of machine translation, adversarial training contributes to the development of models that can handle diverse linguistic structures and idiomatic expressions, which are often sources of errors in traditional models. By simulating adversarial attacks that introduce such complexities, the training process makes models more adept at preserving the semantic integrity of translated texts across different languages. This improved robustness is particularly beneficial for applications requiring high translation accuracy and reliability, such as international communication, content localization, and multilingual information retrieval. Overall, adversarial training provides a significant advancement in creating more resilient and effective NLP models for sentiment analysis and machine translation.
Large language models (LLMs) have transformed natural language generation capabilities across numerous applications, yet their proliferation raises critical concerns regarding content attribution, intellectual property protection, and potential misuse. Watermarking techniques have emerged as promising solutions for embedding verifiable signals into LLM outputs, but existing approaches remain vulnerable to sophisticated evasion attacks that exploit detection mechanisms through adversarial modifications. This paper introduces a novel watermarking framework that integrates multi-scale semantic embedding with cryptographic verification to achieve robust attribution of LLM-generated text. Our approach operates across multiple granularity levels, from token-level perturbations to discourse-level structural patterns, while incorporating error-correcting codes and cryptographic signatures to ensure detection integrity even under aggressive tampering attempts. Through comprehensive evaluation on diverse text generation tasks, we demonstrate that our framework achieves superior robustness against paraphrasing attacks, token substitution, and deletion operations while maintaining high text quality with perplexity comparable to unwatermarked outputs. The integration of cryptographic primitives enables public verifiability without exposing watermarking keys, addressing critical security requirements for real-world deployment. Our results show detection accuracy exceeding 94 percent under various attack scenarios while preserving semantic coherence and stylistic naturalness of generated text.
No abstract available
Natural Language Processing (NLP) systems have achieved remarkable success in sentiment analysis, named entity recognition, and text classification through deep learning architectures such as Transformers and recurrent neural networks. However, these models remain vulnerable to adversarial perturbations, small, carefully crafted textual modifications capable of misleading predictions. This research introduces DUALARMOR, an integrated framework designed to enhance adversarial robustness, interpretability, and certification in NLP models. Using benchmark datasets (IMDB, SST-2, and AG News), the study evaluates four model architectures BERT, RoBERTa, LSTM, and GRU against gradient-based, rule-based, and semantic-preserving adversarial attacks. DUAL-ARMOR combines Token-Aware Adversarial Training (TAAT) for lexical invariance, Internal-Noise Regularization (INR) for decision boundary smoothing, and an External Guardian Layer that incorporates an Ensemble Consensus Detector (ECD) and Certified Radius Estimator (CRE) for real-time attack detection and robustness certification. Experimental results show a significant reduction in robustness degradation ratios (from 36% to below 12%) and improved calibration, with the Expected Calibration Error halved across models. Linguistic coherence and attention stability also improved, with Grad-CAM visualizations confirming enhanced focus consistency under attack. The framework achieved detection AUC values above 90% and increased certified coverage by over 30%, validating its robustness under both synthetic and semantic adversarial scenarios. Statistical significance tests (p < 0.05) verified the reliability of these results, while computational overhead remained within practical limits (+24% training, +13% inference). Overall, DUALARMOR establishes a certifiable, end-to-end defense paradigm that unifies adversarial training, regularization, and runtime detection, offering a scalable, interpretable, and security-first solution for deploying NLP models in safety-critical domains such as finance, healthcare, and cybersecurity.
State-of-the-art machine learning models are prone to adversarial attacks”:" Maliciously crafted inputs to fool the model into making a wrong prediction, often with high confidence. While defense strategies have been extensively explored in the computer vision domain, research in natural language processing still lacks techniques to make models resilient to adversarial text inputs. We adapt a technique from computer vision to detect word-level attacks targeting text classifiers. This method relies on training an adversarial detector leveraging Shapley additive explanations and outperforms the current state-of-the-art on two benchmarks. Furthermore, we prove the detector requires only a low amount of training samples and, in some cases, generalizes to different datasets without needing to retrain.
Unnatural text correction aims to automatically detect and correct spelling errors or adversarial perturbation errors in sentences. Existing methods typically rely on fine-tuning or adversarial training to correct errors, which have achieved significant success. However, these methods exhibit poor generalization performance due to the difference in data distribution between training data and real-world scenarios, known as the exposure bias problem. In this paper, we propose a self-correct adversarial training framework for learning from mistakes (LIMIT), which is a task- and model-independent framework to correct unnatural errors or mistakes. Specifically, we fully utilize errors generated by the model that are actively exposed during the inference phase, i.e., predictions that are inconsistent with the target. This training method not only simulates potential errors in real application scenarios, but also mitigates the exposure bias of the traditional training process. Meanwhile, we design a novel decoding intervention strategy to maintain semantic consistency. Extensive experimental results on Chinese unnatural text error correction datasets show that our proposed method can correct multiple forms of errors and outperforms the state-of-the-art text correction methods. In addition, extensive results on Chinese and English datasets validate that LIMIT can serve as a plug-and-play defense module and can extend to new models and datasets without further training.
The action of spreading false information through fake news articles presents a significant danger to society because it has the ability to shape public opinion with inaccurate facts. This can lead to negative effects, such as reduced trust in institutions and the promotion of conflict, division, and even violence. In this article, a text augmentation technique is introduced as a means of generating new data from preexisting fake news datasets. This approach has the potential to enhance classifier performance by a range of 3%–11%. It can also be utilized to launch a successful attack on trained classifiers, with up to a 90% success rate. However, the success rate of these attacks decreased to less than 28% when the model was retrained with the generated adversarial examples. These results demonstrate the effectiveness of text augmentation as a viable method for detecting fake news and increasing classifier accuracy and performance, as well as its ability to be utilized to perform adversarial machine learning (ML) and improve the resilience of ML algorithms.
Recent work has proposed several efficient approaches for generating gradient-based adversarial perturbations on embeddings and proved that the model's performance and robustness can be improved when they are trained with these contaminated embeddings. While they paid little attention to how to help the model to learn these adversarial samples more efficiently. In this work, we focus on enhancing the model's ability to defend gradient-based adversarial attack during the model's training process and propose two novel adversarial training approaches: (1) CARL narrows the original sample and its adversarial sample in the representation space while enlarging their distance from different labeled samples. (2) RAR forces the model to reconstruct the original sample from its adversarial representation. Experiments show that the proposed two approaches outperform strong baselines on various text classification datasets. Analysis experiments find that when using our approaches, the semantic representation of the input sentence won't be significantly affected by adversarial perturbations, and the model's performance drops less under adversarial attack. That is to say, our approaches can effectively improve the robustness of the model. Besides, RAR can also be used to generate text-form adversarial samples.
No abstract available
The rapid development of Large Language Models (LLMs) has transformed various fields, especially education, where their ability to generate human-like text enhances writing efficiency. However, these advancements present challenges in developing students' critical thinking and writing skills. Therefore, important to distinguish between human-written and AI-generated text to maintain academic integrity. This study proposes a machine-learning approach that utilizes an ensemble of RoBERTa transformer models to classify AI-generated text in English essays. The proposed method combines three variants of the RoBERTa model with different training parameters to improve the classification model's performance. Evaluation results show good performance with a precision of 99.560%, a recall of 97.839%, and an F1-score of 98.692%. These results outperform the individual RoBERTa models and traditional machine learning models such as Naive Bayes, Support Vector Machine, and Random Forest. The findings highlight the effectiveness of using an ensemble method with RoBERTa transformer models for the classification of AI-generated text. This research contributes to the development of AI-generated text classification models and offers solutions to the challenges faced in education due to the growth of LLM.
The existing methods of generating adversarial texts usually change the original meanings of texts significantly and even generate the unreadable texts. These less readable adversarial texts can misclassify the machine classifier successfully, but they cannot deceive the human observers very well. In this paper, we propose a novel method that generates readable adversarial texts with some perturbations that can also confuse human observers successfully. Based on the continuous bag-of-words (CBOW) model, the proposed method looks for the appropriate perturbations to generate the adversarial texts through controlling the perturbation direction vectors. Meanwhile, we apply adversarial training to regularize the classification model and extend it to semi-supervised tasks with virtual adversarial training. Experiments are conducted to show that the generated adversaries are interpretable and confused to humans and the virtual adversarial training effectively improves the robustness of the model.
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating high-quality text, raising significant concerns regarding copyright protection and content provenance verification. However, most existing watermarking techniques rely on uniform perturbation or rule-based token biasing schemes, which exhibit critical vulnerabilities under adversarial attacks such as paraphrasing, translation, and content truncation, often failing to maintain detection reliability in real-world deployment scenarios. To address these challenges, this paper introduces a novel context-aware robust watermarking framework that dynamically adjusts watermark embedding strength according to contextual semantic characteristics during text generation. The proposed approach incorporates a token-level semantic modulation mechanism that strategically intensifies watermark signals in copyright-sensitive segments while minimizing perturbations in semantically neutral regions, achieving an improved balance between imperceptibility and robustness. Furthermore, an adaptive threshold estimation algorithm is developed for watermark detection, which automatically calibrates detection boundaries based on noise statistics, significantly enhancing resilience against diverse attack vectors. Extensive experiments on the WaterBench benchmark demonstrate superior performance over state-of-the-art baselines, maintaining high detection accuracy with a 95.3% true positive rate (TPR) under clean conditions and strong robustness under severe perturbations, including paraphrasing attacks (82.7% TPR), translation attacks (78.4% TPR), and content truncation (88.9% TPR at 50% retention). Meanwhile, the proposed method reduces false positive rates by 43.2% compared with existing approaches while preserving text quality with negligible perplexity increase (1.8%). These results establish a new paradigm for practical and scalable LLM watermarking in real-world copyright-sensitive deployment scenarios.
Stance Detection is the task of identifying the position of an author of a text towards an issue or a target. Previous studies on Stance Detection indicate that the existing systems are non-robust to the variations and errors in input sentences. Our proposed methodology uses Contrastive Learning to learn sentence representations by bringing semantically similar sentences and sentences implying the same stance closer to each other in the embedding space. We compare our approach to a pretrained transformer model directly finetuned with the stance datasets. We use char-level and word-level adversarial perturbation attacks to measure the resilience of the models and we show that our approach achieves better performances and is more robust to the different adversarial perturbations introduced to the test data. The results indicate that our approach performs better on small-sized and class-imbalanced stance datasets.
Stance Detection refers to the process of determining an author’s position towards a particular issue or target in a text. Previous research suggests that existing systems for Stance Detection are not resilient enough to handle variations and errors in input sentences. In our proposed methodology, we utilize Contrastive Learning to learn sentence representations. We achieve this by bringing semantically similar sentences and those implying the same stance closer to each other in the embedding space. To compare our approach, we use a pretrained transformer model that is directly finetuned with the stance datasets. We evaluate the resilience of the models using char-level and word-level adversarial perturbation attacks and show that our approach performs better and is more robust to the different adversarial perturbations introduced to the test data. Our approach is also shown to perform better on small-sized and class-imbalanced stance datasets. We further experiment with unlabeled stance datasets to make the representation learning independent of domain-specific labels, and the models trained with our approach on unlabeled datasets are still robust and perform comparably to those trained with labeled data.
Following great success in the image processing field, the idea of adversarial training has been applied to tasks in the natural language processing (NLP) field. One promising approach directly applies adversarial training developed in the image processing field to the input word embedding space instead of the discrete input space of texts. However, this approach abandons such interpretability as generating adversarial texts to significantly improve the performance of NLP tasks. This paper restores interpretability to such methods by restricting the directions of perturbations toward the existing words in the input embedding space. As a result, we can straightforwardly reconstruct each input with perturbations to an actual text by considering the perturbations to be the replacement of words in the sentence while maintaining or even improving the task performance.
Large language models (LLMs) have shown the ability to produce fluent and cogent content, presenting both productivity opportunities and societal risks. To build trustworthy AI systems, it is imperative to distinguish between machine-generated and human-authored content. The leading zero-shot detector, DetectGPT, showcases commendable performance but is marred by its intensive computational costs. In this paper, we introduce the concept of conditional probability curvature to elucidate discrepancies in word choices between LLMs and humans within a given context. Utilizing this curvature as a foundational metric, we present **Fast-DetectGPT**, an optimized zero-shot detector, which substitutes DetectGPT's perturbation step with a more efficient sampling step. Our evaluations on various datasets, source models, and test conditions indicate that Fast-DetectGPT not only surpasses DetectGPT by a relative around 75% in both the white-box and black-box settings but also accelerates the detection process by a factor of 340, as detailed in Table 1. See \url{https://github.com/baoguangsheng/fast-detect-gpt} for code, data, and results.
Text watermarking algorithms for large language models (LLMs) can effectively identify machine-generated texts by embedding and detecting hidden features in the text. Although the current text watermarking algorithms perform well in most high-entropy scenarios, its performance in low-entropy scenarios still needs to be improved. In this work, we opine that the influence of token entropy should be fully considered in the watermark detection process, $i.e.$, the weight of each token during watermark detection should be customized according to its entropy, rather than setting the weights of all tokens to the same value as in previous methods. Specifically, we propose \textbf{E}ntropy-based Text \textbf{W}atermarking \textbf{D}etection (\textbf{EWD}) that gives higher-entropy tokens higher influence weights during watermark detection, so as to better reflect the degree of watermarking. Furthermore, the proposed detection process is training-free and fully automated. From the experiments, we demonstrate that our EWD can achieve better detection performance in low-entropy scenarios, and our method is also general and can be applied to texts with different entropy distributions. Our code and data is available\footnote{\url{https://github.com/luyijian3/EWD}}. Additionally, our algorithm could be accessed through MarkLLM \cite{pan2024markllm}\footnote{\url{https://github.com/THU-BPM/MarkLLM}}.
Out-of-distribution (OOD) detection remains challenging in text-rich networks, where textual features intertwine with topological structures. Existing methods primarily address label shifts or rudimentary domain-based splits, overlooking the intricate textual-structural diversity. For example, in social networks, where users represent nodes with textual features (name, bio) while edges indicate friendship status, OOD may stem from the distinct language patterns between bot and normal users. To address this gap, we introduce the TextTopoOOD framework for evaluating detection across diverse OOD scenarios: (1) attribute-level shifts via text augmentations and embedding perturbations; (2) structural shifts through edge rewiring and semantic connections; (3) thematically-guided label shifts; and (4) domain-based divisions. Furthermore, we propose TNT-OOD to model the complex interplay between Text aNd Topology using: 1) a novel cross-attention module to fuse local structure into node-level text representations, and 2) a HyperNetwork to generate node-specific transformation parameters. This aligns topological and semantic features of ID nodes, enhancing ID/OOD distinction across structural and textual shifts. Experiments on 11 datasets across four OOD scenarios demonstrate the nuanced challenge of TextTopoOOD for evaluating OOD detection in text-rich networks.
Large language models have revolutionized text generation, offering significant benefits while also posing threats to society, such as copyright infringement and misinformation. To prevent harmful use, the task of detecting machine-generated content has become an important research topic, though it remains particularly challenging across diverse content domains. This paper presents DGRM, an innovative add-on module designed to improve the domain generalization capability of existing machine-generated text detectors. Our model consists of two training components. (1) Feature disentanglement separates a text’s embedding into target-specific and common attributes, thereby enhancing semantic domain generalization across different content domains. (2) Feature regularization applies constraints to these attributes to extract additional target-relevant information and ensure detection consistency under syntactic perturbations—thus achieving syntactic domain generalization. Evaluation over multiple datasets demonstrates that incorporating our module substantially improves the detection of machine-generated text across semantically and syntactically diverse domains. We hope our work contributes to mitigating the harmful use of language models.
Growing amount and quality of AI-generated texts makes detecting such content more difficult. In most real-world scenarios, the domain (style and topic) of generated data and the generator model are not known in advance. In this work, we focus on the robustness of classifier-based detectors of AI-generated text, namely their ability to transfer to unseen generators or semantic domains. We investigate the geometry of the embedding space of Transformer-based text encoders and show that clearing out harmful linear subspaces helps to train a robust classifier, ignoring domain-specific spurious features. We investigate several subspace decomposition and feature selection strategies and achieve significant improvements over state of the art methods in cross-domain and cross-generator transfer. Our best approaches for head-wise and coordinate-based subspace removal increase the mean out-of-distribution (OOD) classification score by up to 9% and 14% in particular setups for RoBERTa and BERT embeddings respectively. We release our code and data: https://github.com/SilverSolver/RobustATD
Large Language Models (LLMs) have been able to generate high-quality text and demonstrate excellent performance in various tasks. However, the ability of large language models to generate increasingly efficient human like texts has also raised concerns about their misuse for malicious purposes. Therefore, it is urgent to design a reliable and robust method for generating text detection. The zero-shot detector Fast-DetectGPT not only demonstrates strong performance in detection accuracy, but is also commendable in detection speed. However, it is unclear whether it can still maintain such a high detection success rate in the face of paraphrase attacks. In this article, we use various paraphrase attacks to perturb text and evaluate the robustness of Fast-DetectGPT, and find a significant decrease in its performance. Meanwhile, we propose Sentence-DetectGPT to optimize the detection process of Fast-DetectGPT, greatly enhancing its robustness in the face of paraphrase attacks.
Phishing and related cyber threats are becoming more varied and technologically advanced. Among these, email-based phishing remains the most dominant and persistent threat. These attacks exploit human vulnerabilities to disseminate malware or gain unauthorized access to sensitive information. Deep learning (DL) models, particularly transformer-based models, have significantly enhanced phishing mitigation through their contextual understanding of language. However, some recent threats, specifically Artificial Intelligence (AI)-generated phishing attacks, are reducing the overall system resilience of phishing detectors. In response, adversarial training has shown promise against AI-generated phishing threats. This study presents a hybrid approach that uses DistilBERT, a smaller, faster, and lighter version of the BERT transformer model for email classification. Robustness against text-based adversarial perturbations is reinforced using Fast Gradient Method (FGM) adversarial training. Furthermore, the framework integrates the LIME Explainable AI (XAI) technique to enhance the transparency of the DistilBERT architecture. The framework also uses the Flan-T5-small language model from Hugging Face to generate plain-language security narrative explanations for end-users. This combined approach ensures precise phishing classification while providing easily understandable justifications for the model's decisions.
Large Language Models (LLMs) are vulnerable to jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires a time-consuming search for adversarial prompts, whereas automatic adversarial prompt generation often leads to semantically meaningless attacks that do not scale well. In this paper, we present a novel method that uses another LLM, called AdvPrompter, to generate human-readable adversarial prompts in seconds. AdvPrompter, which is trained using an alternating optimization algorithm, generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show highly competitive results on the AdvBench and HarmBench datasets, that also transfer to closed-source black-box LLMs. We also show that training on adversarial suffixes generated by AdvPrompter is a promising strategy for improving the robustness of LLMs to jailbreaking attacks.
合并后的分组全面覆盖了 AIGC 文本检测与对抗的完整技术生命周期。研究已从早期的简单统计二分类,演进为深度挖掘语言学特征与利用对抗博弈提升鲁棒性的复杂体系。目前,该领域呈现出三大趋势:一是“攻防演进”,攻击手段从简单改写转向复杂的“去指纹化”规避,而防御则引入了对抗训练与主动水印技术;二是“泛化与鲁棒”,研究重点转向解决跨模型、跨领域及分布外数据的失效问题;三是“实战化与标准化”,通过建立大规模行业基准和参加国际竞赛,推动技术在学术诚信、网络安全等真实场景中的落地应用。