aigc文本的检测与对抗
综述与可检出性理论:检测技术谱系、评测范式与研究边界
面向AIGC/LLM生成文本检测的整体图景与理论边界:系统综述检测技术谱系(如水印、统计/神经判别、数据集与评测范式),并从可检出性角度讨论影响因素与难点(如人类改写、可识别性上界/下界、OOD与现实误读),同时汇总对抗防御研究脉络。
- The Science of Detecting LLM-Generated Texts(Ruixiang Tang, Yu-Neng Chuang, Xia Hu, 2023, ArXiv Preprint)
- AI-generated text detection: A comprehensive review of methods, datasets, and applications(Tanzila Kehkashan, Raja Adil Riaz, A. S. Al-Shamayleh, Adnan Akhunzada, Noman Ali, Muhammad Hamza, Faheem Akbar, 2025, Computer Science Review)
- A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions(Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia S. Chao, Derek F. Wong, 2025, Computational Linguistics)
- Towards Possibilities & Impossibilities of AI-generated Text Detection: A Survey(Soumya Suvra Ghosal, Souradip Chakraborty, Jonas Geiping, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, 2023, ArXiv Preprint)
- Detecting AI-Generated Text: Factors Influencing Detectability with Current Methods(Kathleen C. Fraser, Hillary Dawkins, Svetlana Kiritchenko, 2024, Journal of Artificial Intelligence Research)
- A Survey of Adversarial Defenses and Robustness in NLP(Shreyansh Goyal, Sumanth Doddapaneni, Mitesh M. Khapra, B. Ravindran, 2023, ACM Computing Surveys)
- Understanding the effects of human-written paraphrases in LLM-generated text detection(Hiu Ting Lau, Arkaitz Zubiaga, 2025, Natural Language Processing Journal)
- On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?(Mingmeng Geng, Thierry Poibeau, 2025, ArXiv Preprint)
检测器方法与基准评测:特征/集成/句级与细粒度任务
聚焦“检测器/模型构建与基准评估”的工程与实验路线:通过特征工程(如困惑度、TF-IDF/语言统计、学术写作指标、层级特征融合等)、不同检测范式的集成/多模型联合、句级或细粒度任务定义,以及在竞赛或标准基准上评测准确性与跨域/跨模型泛化。该组强调可用检测器的有效性与可落地评测。
- Streaming Bilingual Perplexity-Driven HeteroGNN: A Heterogeneous Graph Transformer with Incremental Training for AIGC Text Detection(Ruijin Peng, Yue Zhang, 2025, Proceedings of the 2025 9th International Conference on Computer Science and Artificial Intelligence)
- Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions(Madhav S. Baidya, S. S. Baidya, Chirag Chawla, 2026, ArXiv Preprint)
- LuxVeri at GenAI Detection Task 3: Cross-Domain Detection of AI-Generated Text Using Inverse Perplexity-Weighted Ensemble of Fine-Tuned Transformer Models(Md Kamrujjaman Mobin, Md Saiful Islam, 2025, ArXiv Preprint)
- mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection(Dominik Macko, 2025, ArXiv Preprint)
- ChatGPT Presence in Academic Writing: Detecting AI-Generated Text in Undergraduate and Graduate Students’ Research Proposal Literature Reviews(Vera A. Dugartsyrenova, 2025, RUDN Journal of Psychology and Pedagogics)
- Evaluating the Efficacy of Perplexity Scores in Distinguishing AI-Generated and Human-Written Abstracts.(Alperen Elek, Hatice Sude Yildiz, Benan Akca, Nisa Cem Oren, Batuhan Gundogdu, 2025, Academic Radiology)
- BUST: Benchmark for the evaluation of detectors of LLM-Generated Text(Joseph Cornelius, Oscar Lithgow-Serrano, Sandra Mitrović, Ljiljana Dolamic, Fabio Rinaldi, 2024, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers))
- A Framework for Enhancing Accuracy in AI Generated Text Detection Using Ensemble Modelling(Kush Aggarwal, Sahib Singh, Parul, Vipin Pal, S. Yadav, 2024, 2024 IEEE Region 10 Symposium (TENSYMP))
- MLSDET: Multi-LLM Statistical Deep Ensemble for Chinese AI-Generated Text Detection(Dianhui Mao, Denghui Zhang, Ao Zhang, Zhihua Zhao, 2025, ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- Raidar: geneRative AI Detection viA Rewriting(Chengzhi Mao, Carl Vondrick, Hao Wang, Junfeng Yang, 2024, ArXiv Preprint)
- Reviewing the performance of AI detection tools in differentiating between AI-generated and human-written texts: A literature and integrative hybrid review(C Chaka, 2024, Journal of Applied Learning & Teaching)
- BiScope: AI-generated Text Detection by Checking Memorization of Preceding Tokens(Hanxi Guo, Siyuan Cheng, Xiaolong Jin, Zhuo Zhang, Kaiyuan Zhang, Guanhong Tao, Guangyu Shen, Xiangyu Zhang, 2024, Advances in Neural Information Processing Systems 37)
- SeqXGPT: Sentence-Level AI-Generated Text Detection(Pengyu Wang, Linyang Li, Ke Ren, Botian Jiang, Dong Zhang, Xipeng Qiu, 2023, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing)
- Robust detection of LLM-generated text through transfer learning with pre-trained Distilled BERT model(Jayaprakash Sundararaj, Durgaraman Maruthavanan, Deepak Jayabalan, Ashok Gadi Parthi, Balakrishna Pothineni, Vidyasagar Parlapalli, 2024, European Journal of Computer Science and Information Technology)
- Decoding the AI Pen: Techniques and Challenges in Detecting AI-Generated Text(Sara Abdali, Richard Anarfi, C. Barberan, Jia He, 2024, Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
- Multi-Hierarchical Feature Detection for Large Language Model Generated Text(Luyan Zhang, Xinyu Xie, 2025, ArXiv Preprint)
- Enhancing Text Authenticity: A Novel Hybrid Approach for AI-Generated Text Detection(Ye Zhang, Qian Leng, Mengran Zhu, Rui Ding, Yue Wu, Jintong Song, Yulu Gong, 2024, ArXiv Preprint)
- LuxVeri at GenAI Detection Task 1: Inverse Perplexity Weighted Ensemble for Robust Detection of AI-Generated Text across English and Multilingual Contexts(Md Kamrujjaman Mobin, Md Saiful Islam, 2025, ArXiv Preprint)
- Sarang at DEFACTIFY 4.0: Detecting AI-Generated Text Using Noised Data and an Ensemble of DeBERTa Models(Avinash Trivedi, Sangeetha Sivanesan, 2025, ArXiv Preprint)
- Distinguishing Human-Generated and AI-Generated Academic Writing: A Machine Learning Benchmark Study(Ali Raza, Mohib Ullah, R. Khan, Adeem Ali Anwar, Muhammad Inam Ul Haq, Shazia Riaz, 2026, VFAST Transactions on Software Engineering)
- Using AI-based detectors to control AI-assisted plagiarism in ESL writing: “The Terminator Versus the Machines”(Karim Ibrahim, 2023, Language Testing in Asia)
- Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text(Ahmed M. Elkhatat, Khaled Elsaid, S. Almeer, 2023, International Journal for Educational Integrity)
- Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement(Zihao Cheng, Li Zhou, Feng Jiang, Benyou Wang, Haizhou Li, 2024, Proceedings of the ACM on Web Conference 2025)
- DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning(Wanquan Feng, Xun Guo, Yongxin He, Haibin Huang, Chongyang Ma, Shan Zhang, Ting Zhang, 2024, Advances in Neural Information Processing Systems 37)
- Detecting AI-Generated Sentences in Human-AI Collaborative Hybrid Texts: Challenges, Strategies, and Insights(Zijie Zeng, Shiqi Liu, Lele Sha, Zhuang Li, Kaixun Yang, Sannyuya Liu, Dragan Gašević, Guanliang Chen, 2024, ArXiv Preprint)
- Perceptions and detection of AI use in manuscript preparation for academic journals(Nir Chemaya, Daniel Martin, 2023, PLOS ONE)
- Between human and AI: assessing the reliability of AI text detection tools(Valentina Bellini, Federico Semeraro, J. Montomoli, M. Cascella, E. Bignami, 2024, Current Medical Research and Opinion)
- UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models(Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao, 2025, ArXiv Preprint)
检测可靠性压力测试:真实威胁条件下的评估与误报/漏报分析
从“对抗视角”评估检测系统在真实威胁条件下的脆弱性,并以统一指标衡量可靠性:在递归改写、提示/生成策略、跨数据集/跨模型条件下测试多类检测器的误报/漏报差异,并给出在固定低FPR条件下的TPR等评测目标。
- Can AI-Generated Text be Reliably Detected?(Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, Soheil Feizi, 2023, ArXiv Preprint)
- A Practical Examination of AI-Generated Text Detectors for Large Language Models(Brian Tufts, Xuandong Zhao, Lei Li, 2025, Findings of the Association for Computational Linguistics: NAACL 2025)
- The Effectiveness of Software Designed to Detect AI-Generated Writing: A Comparison of 16 AI Text Detectors(William H. Walters, 2023, Open Information Science)
对抗逃逸与鲁棒防御:释义/扰动攻击、对抗评估与防御策略
专注对抗规避与鲁棒防御:以语义保持(释义/扰动)为核心目标绕过检测,研究模型对语义层扰动的脆弱点(如词性偏置/语言结构依赖),并提出对抗训练、注意力/特征层对抗检测等机制;同时覆盖黑盒/受限条件下的对抗样本攻击与防御(如改写通用攻击、攻击无关检测、攻击-修复框架)。该组明确以“逃逸/鲁棒性”为研究主线。
- Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense(Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, Mohit Iyyer, 2023, ArXiv Preprint)
- Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations(Lekkala Sai Teja, Annepaka Yadagiri, Sangam Sai Anish, Siva Gopala Krishna Nuthakki, Partha Pakray, 2025, ArXiv Preprint)
- Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection(Xinlin Peng, Ying Zhou, Ben He, Le Sun, Yingfei Sun, 2024, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing)
- ADS-detector: An attention-based dual stream adversarial example detection method(Sensen Guo, Xiaoyu Li, Peican Zhu, Zhiying Mu, 2023, Knowledge-Based Systems)
- Arabic Synonym BERT-based Adversarial Examples for Text Classification(Norah Alshahrani, Saied Alshahrani, Esma Wali, Jeanna Matthews, 2024, ArXiv Preprint)
- Evaluating Text Classification Robustness to Part-of-Speech Adversarial Examples(Anahita Samadi, Allison Sullivan, 2024, ArXiv Preprint)
- RADAR: Robust AI-Text Detection via Adversarial Learning(Pin-Yu Chen, Tsung-Yi Ho, Xiaomeng Hu, 2023, Advances in Neural Information Processing Systems 36)
- Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text(Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, Soheil Feizi, 2025, ArXiv Preprint)
- Robustness of generative AI detection: adversarial attacks on black-box neural text detectors(Vitalii Fishchuk, Daniel Braun, 2024, International Journal of Speech Technology)
- Unlocking Pandora's Box: Unveiling the Elusive Realm of AI Text Detection(Toluwani Aremu, 2023, SSRN Electronic Journal)
- Complete Evasion, Zero Modification: PDF Attacks on AI Text Detection(Aldan Creo, 2025, ArXiv Preprint)
- Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training(Yuanfan Li, Zhaohan Zhang, Chengzhengxu Li, Chao Shen, Xiaoming Liu, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- TextDefense: Adversarial Text Detection Based on Word Importance Score Dispersion(Lujia Shen, Yuwen Pu, Xuhong Zhang, Chunpeng Ge, Xing Yang, Hao Peng, Wei Wang, Shouling Ji, 2025, IEEE Transactions on Dependable and Secure Computing)
- J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News(Tharindu Kumarage, Amrita Bhattacharjee, Djordje Padejski, Kristy Roschke, Dan Gillmor, Scott W. Ruston, Huan Liu, Joshua Garland, 2023, Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers))
- SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs(Aldan Creo, Shushanta Pudasaini, 2024, ArXiv Preprint)
- TextGuise: Adaptive adversarial example attacks on text classification model(Guoqin Chang, Haichang Gao, Zhou Yao, Haoquan Xiong, 2023, Neurocomputing)
- On Adversarial Examples for Text Classification by Perturbing Latent Representations(Korn Sooksatra, Bikram Khanal, Pablo Rivas, 2024, ArXiv Preprint)
- Defense of Adversarial Ranking Attack in Text Retrieval: Benchmark and Baseline via Detection(Xuanang Chen, Ben He, Le Sun, Yingfei Sun, 2023, ArXiv Preprint)
- The Best Defense is Attack: Repairing Semantics in Textual Adversarial Examples(Heng Yang, Ke Li, 2023, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing)
- Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers(Harsha Moraliyage, Geemini Kulawardana, Daswin de Silva, Zafar Issadeen, Milos Manic, Seiichiro Katsura, 2025, Applied System Innovation)
- DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection(Xiao Yu, Yuang Qi, Kejiang Chen, Guoqiang Chen, Xi Yang, Pengyuan Zhu, Xiuwei Shang, Weiming Zhang, Nenghai Yu, 2023, ArXiv Preprint)
- Robust AI-Generated Text Detection by Restricted Embeddings(Kristian Kuznetsov, Eduard Tulchinskii, Laida Kushnareva, German Magai, Serguei Barannikov, Sergey Nikolenko, Irina Piontkovskaya, 2024, Findings of the Association for Computational Linguistics: EMNLP 2024)
- Plagiarism Detection: Identifying AI-Generated and Paraphrased Content(Ankita Kumari, Netanya Singh, Spoorthi V, Kavitha S.N, 2024, 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS))
水印与可验证溯源:生成机制信号检测、鲁棒性与水印规避/安全性
专注基于“生成机制信号”的水印与可验证溯源:包括集成/熵感知/深度学习水印、条件水印与可鲁棒验证框架;并研究水印面临的规避攻击与系统副作用(如水印降低对齐、对齐重采样等),同时覆盖从记录token概率以进行第三方溯源的检测工具思路。该组与纯统计判别/对抗评估区分在于:核心证据是可验证的水印或生成机制统计,而非一般性分类特征。
- Ensemble Watermarks for Large Language Models(Georg Niess, Roman Kern, 2024, ArXiv Preprint)
- An Entropy-based Text Watermarking Detection Method(Yijian Lu, Aiwei Liu, Dianzhi Yu, Jingjing Li, Irwin King, 2024, ArXiv Preprint)
- DeepTextMark: A Deep Learning-Driven Text Watermarking Approach for Identifying Large Language Model Generated Text(Travis J. E. Munyer, A. Tanvir, A. Das, Xin Zhong, 2023, IEEE Access)
- BiMarker: Enhancing Text Watermark Detection for Large Language Models with Bipolar Watermarks(Zhuang Li, Qiuping Yi, Zongcheng Ji, Yijian Lu, Yanqi Li, Keyang Xiao, Hongliang Liang, 2025, ArXiv Preprint)
- Toward Evasion-Resistant LLM Attribution with Multi-Scale Watermarking and Cryptographic Verification(Pieter Janssen, E. Conti, 2026, Frontiers in Artificial Intelligence Research)
- Watermarking Degrades Alignment in Language Models: Analysis and Mitigation(Apurv Verma, NhatHai Phan, Shubhendu Trivedi, 2025, ArXiv Preprint)
- Adaptive Robust Watermarking for Large Language Models via Dynamic Token Embedding Perturbation(Ziyang Zeng, Han Lin, Shuxin Zhang, Boyuan Wang, 2026, IEEE Access)
- A Watermark for Large Language Models(John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein, 2023, ArXiv Preprint)
- Embarrassingly Simple Text Watermarks(Ryoma Sato, Yuki Takezawa, Han Bao, Kenta Niwa, Makoto Yamada, 2023, ArXiv Preprint)
- Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking(Yijie Xu, Aiwei Liu, Xuming Hu, Lijie Wen, Hui Xiong, 2025, ArXiv Preprint)
- LLMDet: A Third Party Large Language Models Generated Text Detection Tool(Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng, Tat-Seng Chua, 2023, Findings of the Association for Computational Linguistics: EMNLP 2023)
- LLM Watermark Evasion via Bias Inversion(Jeongyeon Hwang, Sangdon Park, Jungseul Ok, 2025, ArXiv Preprint)
- Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy(Yu Fu, Deyi Xiong, Yue Dong, 2023, ArXiv Preprint)
- On the Reliability of Watermarks for Large Language Models(John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, Tom Goldstein, 2023, ArXiv Preprint)
- Watermarking for AI Content Detection: A Review on Text, Visual, and Audio Modalities(Lele Cao, 2025, ArXiv Preprint)
- Modification and Generated-Text Detection: Achieving Dual Detection Capabilities for the Outputs of LLM by Watermark(Yuhang Cai, Yaofei Wang, Donghui Hu, Chen Gu, 2025, ArXiv Preprint)
- CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality(Junyan Zhang, Shuliang Liu, Aiwei Liu, Yubo Gao, Jungang Li, Xiaojie Gu, Xuming Hu, 2025, ArXiv Preprint)
- Watermarking Degrades Alignment in Language Models: Analysis and Mitigation(Apurv Verma, NhatHai Phan, Shubhendu Trivedi, 2025, ArXiv Preprint)
LLM重写/对比与低查询采样检测:DetectGPT与“以LLM为工具”的判别范式
以“LLM重写/对比与低查询采样”为核心的检测范式:让LLM作为工具对候选文本进行重写或生成对照样本,再通过相似度/一致性与概率分布差异进行判别;或采用多次扰动采样与低查询加速/代理推断完成DetectGPT类检测。该组与水印、通用对抗鲁棒性不同,强调检测范式的推理机制与查询预算优化。
- Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection via Querying ChatGPT(Biru Zhu, Lifan Yuan, Ganqu Cui, Yangyi Chen, Chong Fu, Bingxiang He, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu, 2023, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing)
- Fighting Fire with Fire: Can ChatGPT Detect AI-generated Text?(Amrita Bhattacharjee, Huang Liu, 2023, ACM SIGKDD Explorations Newsletter)
- Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model(Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng, 2024, Findings of the Association for Computational Linguistics ACL 2024)
- Fighting Fire with Fire: Can ChatGPT Detect AI-generated Text?(Amrita Bhattacharjee, Huang Liu, 2023, ACM SIGKDD Explorations Newsletter)
- Unlocking Pandora's Box: Unveiling the Elusive Realm of AI Text Detection(Toluwani Aremu, 2023, SSRN Electronic Journal)
现实应用中的误判与治理:写作实践、教育/期刊场景与误报漏报风险
面向真实场景的误判、可靠性与规避性使用影响:讨论检测工具在教育/期刊/写作实践中的局限、误伤风险、以及因“人性化改写/规避使用”导致的偏差;并关注跨平台/跨工具一致性与解释需求。该组属于治理与落地层面的应用评估,而非单纯算法对抗实验。
- WHO WROTE THIS ESSAY? DETECTING AI-GENERATED WRITING IN SECOND LANGUAGE EDUCATION IN HIGHER EDUCATION(Katarzyna Alexander, Christine Savvidou, Chris Alexander, 2023, Teaching English With Technology)
- Artificial Intelligence Content Detector in Paper Writing: Beyond the Detection(Shigeki Matsubara, 2024, Annals of Surgical Oncology)
- Beyond detection: GenAI in EAL writing education(Yachao Sun, 2025, Elt Journal)
- Ability of AI detection tools and humans to accurately identify different forms of AI-generated written content(Adam Cheng, Yiqun Lin, Gabriel Reedy, Christine Joseph, Samantha Wirkowski, Viviane Mallette, Vikhashni Nagesh, David Krieser, Aaron Calhoun, 2025, Advances in Simulation)
面向信息失真/幻觉的生成检测与缓解(科学文本场景扩展)
将“检测与对抗”扩展到科学文本的信息失真/幻觉问题:不止判别是否为AIGC,而是检测事实性/真实性偏差(幻觉、信息扭曲)并给出集成信号与后编辑缓解思路。该组的研究目标与AIGC来源溯源区分度较高,作为扩展方向单列。
- Hallucination Detection and Mitigation in Scientific Text Simplification using Ensemble Approaches: DS@GT at CLEF 2025 SimpleText(Krishna Chaitanya Marturi, Heba H. Elwazzan, 2025, ArXiv Preprint)
- MLSDET: Multi-LLM Statistical Deep Ensemble for Chinese AI-Generated Text Detection(Dianhui Mao, Denghui Zhang, Ao Zhang, Zhihua Zhao, 2025, ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- A Framework for Enhancing Accuracy in AI Generated Text Detection Using Ensemble Modelling(Kush Aggarwal, Sahib Singh, Parul, Vipin Pal, S. Yadav, 2024, 2024 IEEE Region 10 Symposium (TENSYMP))
- Evaluating the Efficacy of Perplexity Scores in Distinguishing AI-Generated and Human-Written Abstracts.(Alperen Elek, Hatice Sude Yildiz, Benan Akca, Nisa Cem Oren, Batuhan Gundogdu, 2025, Academic Radiology)
合并后形成“综述与可检出性理论—检测器方法与基准评测—检测可靠性压力测试—对抗逃逸与鲁棒防御—水印与可验证溯源—LLM重写/低查询采样检测—现实应用治理—科学文本幻觉检测扩展”八个并列分组。整体研究从方法构建与评测入手,进一步以对抗与可靠性压力测试检验脆弱性;同时以水印提供可验证证据,并发展以LLM为工具的检测范式及低查询加速策略;最后在教育/写作治理与科学文本幻觉缓解等更广应用场景中评估误判与副作用。
总计95篇相关文献
Widely applied large language models (LLMs) can generate human-like content, raising concerns about the abuse of LLMs. Therefore, it is important to build strong AI-generated text (AIGT) detectors. Current works only consider document-level AIGT detection, therefore, in this paper, we first introduce a sentence-level detection challenge by synthesizing a dataset that contains documents that are polished with LLMs, that is, the documents contain sentences written by humans and sentences modified by LLMs. Then we propose Sequence X (Check) GPT, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection. These features are composed like waves in speech processing and cannot be studied by LLMs. Therefore, we build SeqXGPT based on convolution and self-attention networks. We test it in both sentence and document-level detection challenges. Experimental results show that previous methods struggle in solving sentence-level AIGT detection, while our method not only significantly surpasses baseline methods in both sentence and document-level detection challenges but also exhibits strong generalization capabilities.
This review examines the rapidly evolving field of AI-generated text detection, which has gained critical importance following the widespread deployment of advanced large language …
Abstract Objective Large language models (LLMs) such as ChatGPT-4 have raised critical questions regarding their distinguishability from human-generated content. In this research, we evaluated the effectiveness of online detection tools in identifying ChatGPT-4 vs human-written text. Methods A two texts produced by ChatGPT-4 using differing prompts and one text created by a human author were analytically assessed using the following online detection tools: GPTZero, ZeroGPT, Writer ACD, and Originality. Results The findings revealed a notable variance in the detection capabilities of the employed detection tools. GPTZero and ZeroGPT exhibited inconsistent assessments regarding the AI-origin of the texts. Writer ACD predominantly identified texts as human-written, whereas Originality consistently recognized the AI-generated content in both samples from ChatGPT-4. This highlights Originality’s enhanced sensitivity to patterns characteristic of AI-generated text. Conclusion The study demonstrates that while automatic detection tools may discern texts generated by ChatGPT-4 significant variability exists in their accuracy. Undoubtedly, there is an urgent need for advanced detection tools to ensure the authenticity and integrity of content, especially in scientific and academic research. However, our findings underscore an urgent need for more refined detection methodologies to prevent the misdetection of human-written content as AI-generated and vice versa.
… of AI-generated context (AIGC) is rapidly improving, and the correctness and detection of … In this paper, we review the current methods of AIGC detector and introduce the definition, …
Detecting text generated by Large Language Models (LLMs) is a pressing need in order to identify and prevent misuse of these powerful models in a wide range of applications, which have highly undesirable consequences such as misinformation and academic dishonesty. Given a piece of subject text, many existing detection methods work by measuring the difficulty of LLM predicting the next token in the text from their prefix. In this paper, we make a critical observation that how well the current token’s output logits memorizes the closely preceding input tokens also provides strong evidence. Therefore, we propose a novel bi-directional calculation method that measures the cross-entropy losses between an output logits and the ground-truth token (forward) and between the output logits and the immediately preceding input token (backward). A classifier is trained to make the final prediction based on the statistics of these losses. We evaluate our system, named B I S COPE , on texts generated by five latest commercial LLMs across five heterogeneous datasets, including both natural language and code. B I S COPE demonstrates superior detection accuracy and robustness compared to nine existing baseline methods, exceeding the state-of-the-art non-commercial methods’ detection accuracy by over 0 . 30 F1 score, achieving over 0 . 95 detection F1 score on average. It also outperforms the best commercial tool GPTZero that is based on a commercial LLM trained with an enormous volume of data. Code is available at https://github.com/MarkGHX/BiScope .
Kristian Kuznetsov, Eduard Tulchinskii, Laida Kushnareva, German Magai, Serguei Barannikov, Sergey Nikolenko, Irina Piontkovskaya. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024.
With the proliferation of AI-driven technologies, the generation of synthetic text has become increasingly prevalent, posing significant challenges in distinguishing between human-generated and AI-generated content (LLM). To mitigate this challenge, novel approach is proposed in this paper for AI-generated text detection through ensemble modelling based framework, leveraging the strengths of multiple state-of-the-art language models. Proposed Ensemble model integrates BERT, DeBERTa, and a custom ensemble method, each contributing to the collective decision-making process with weighted predictions. A diverse dataset sourced from various online platforms is used and this dataset comprises both human-written and AI-generated text samples. A fine-tuning strategy is used that dynamically adjusts the weights of the ensemble model based on the validation accuracy of each constituent model, while applying a cosine learning rate scheduler during training to optimize performance. The effectiveness of the ensemble model is evaluated using standard performance metrics such as accuracy, recall and F1 score. Proposed model achieved an accuracy over 94% and high recall of 98% through the ensemble framework, demonstrating accuracy improvement by 4.1% over BERT and and robustness in detecting AI-generated text across different domains and languages. The research contributes to advancing the field of AI-generated text detection and addresses critical challenges in content moderation and verification in online environments.
… In line with this key observation, we propose to reformulate AI-generated text detection as a task of distinguishing diverse writing styles within the feature space, rather than merely …
Large language models (LLMs) such as ChatGPT are increasingly being used for various use cases, including text content generation at scale. Although detection methods for such AI-generated text exist already, we investigate ChatGPT's performance as a detector on such AI-generated text, inspired by works that use ChatGPT as a data labeler or annotator. We evaluate the zeroshot performance of ChatGPT in the task of human-written vs. AI-generated text detection, and perform experiments on publicly available datasets. We empirically investigate if ChatGPT is symmetrically effective in detecting AI-generated or human-written text. Our findings provide insight on how ChatGPT and similar LLMs may be leveraged in automated detection pipelines by simply focusing on solving a specific aspect of the problem and deriving the rest from that solution. All code and data is available at https://github.com/AmritaBh/ChatGPT-as-Detector.
The proliferation of artificial intelligence (AI)-generated content, particularly from models like ChatGPT, presents potential challenges to academic integrity and raises concerns about plagiarism. This study investigates the capabilities of various AI content detection tools in discerning human and AI-authored content. Fifteen paragraphs each from ChatGPT Models 3.5 and 4 on the topic of cooling towers in the engineering process and five human-witten control responses were generated for evaluation. AI content detection tools developed by OpenAI, Writer, Copyleaks, GPTZero, and CrossPlag were used to evaluate these paragraphs. Findings reveal that the AI detection tools were more accurate in identifying content generated by GPT 3.5 than GPT 4. However, when applied to human-written control responses, the tools exhibited inconsistencies, producing false positives and uncertain classifications. This study underscores the need for further development and refinement of AI content detection tools as AI-generated content becomes more sophisticated and harder to distinguish from human-written text.
Abstract This study evaluates the accuracy of 16 publicly available AI text detectors in discriminating between AI-generated and human-generated writing. The evaluated documents include 42 undergraduate essays generated by ChatGPT-3.5, 42 generated by ChatGPT-4, and 42 written by students in a first-year composition course without the use of AI. Each detector’s performance was assessed with regard to its overall accuracy, its accuracy with each type of document, its decisiveness (the relative number of uncertain responses), the number of false positives (human-generated papers designated as AI by the detector), and the number of false negatives (AI-generated papers designated as human). Three detectors – Copyleaks, TurnItIn, and Originality.ai – have high accuracy with all three sets of documents. Although most of the other 13 detectors can distinguish between GPT-3.5 papers and human-generated papers with reasonably high accuracy, they are generally ineffective at distinguishing between GPT-4 papers and those written by undergraduate students. Overall, the detectors that require registration and payment are only slightly more accurate than the others.
… showed that artificial intelligence (AI)content detectors (‘Detector’) sometimes labeled humanwritten manuscripts as AI-generated and vice versa. Therefore, will a ‘complete Detector’ be …
… -generated text with 100% precision (with no false positives) and human-written text with 100% precision (with no false negatives) across all contexts of writing, then, all reviewed AI …
… seek to escape detection. Our results indicate a very high FDR due to a low detector-facing … alterations and an arms race between AI detectors and AI generators. Our study raises …
The increasing use of artificial intelligence (AI) by scholars presents a pressing challenge to healthcare publishing. While legitimate use can potentially accelerate scholarship, unethical approaches also exist, leading to factually inaccurate and biased text that may degrade scholarship. Numerous online AI detection tools exist that provide a percentage score of AI use. These can assist authors and editors in navigating this landscape. In this study, we compared the scores from three AI detection tools (ZeroGPT, PhraslyAI, and Grammarly AI Detector) across five plausible conditions of AI use and evaluated them against human assessments. Thirty open access articles published in the journals Advances in Simulation and Simulation in Healthcare prior to 2022 were selected, and the article introductions were extracted. Five experimental conditions were examined, including: (1) 100% human written; (2) human written, light AI editing; (3) human written, heavy AI editing; (4) AI written text from human content; and (5) 100% AI written from article title. The resulting materials were assessed by three open-access AI detection tools and five blinded human raters. Results were summarized descriptively and compared using repeated measures analysis of variance (ANOVA), intraclass correlation coefficients (ICC), and Bland–Altman plots. The three AI detection tools were able to differentiate between the five test conditions (p < 0.001 for all), but varied significantly in absolute score, with ICC ranging from 0.57 to 0.95, raising concerns regarding overall reliability of these tools. Human scoring was far less consistent, with an overall accuracy of 19%, indistinguishable from chance. While existing AI detection tools can meaningfully distinguish plausible AI use conditions, reliability across these tools is variable. Human scoring accuracy is uniformly low. Use of AI detection tools by scholars and journal editors may assist in determining potentially unethical use but they should not be relied upon alone at this time.
… submitted to an AI text detector which indicated … by AI compared to two cases likely generated by AI in the control group. Thus, this study suggests that the current quality of text detection …
The release of ChatGPT marked the beginning of a new era of AI-assisted plagiarism that disrupts traditional assessment practices in ESL composition. In the face of this challenge, educators are left with little guidance in controlling AI-assisted plagiarism, especially when conventional methods fail to detect AI-generated texts. One approach to managing AI-assisted plagiarism is using fine-tuned AI classifiers, such as RoBERTa, to identify machine-generated texts; however, the reliability of this approach is yet to be established. To address the challenge of AI-assisted plagiarism in ESL contexts, the present cross-disciplinary descriptive study examined the potential of two RoBERTa-based classifiers to control AI-assisted plagiarism on a dataset of 240 human-written and ChatGPT-generated essays. Data analysis revealed that both platforms could identify AI-generated texts, but their detection accuracy was inconsistent across the dataset.
Abstract This study investigates how generative artificial intelligence (GenAI) shapes the writing practices of five Chinese English as an additional language (EAL) students and how emerging GenAI-text detection practices mediate that use. Findings from screen-recordings, written drafts, and interviews reveal four recurrent functions of GenAI use, that is, brainstorming, structuring, occasional drafting, and revising, yet students calibrated their reliance on each function against the perceived risk of being flagged by GenAI detectors. To mitigate that risk, participants shuttled between ChatGPT, Grammarly, and self-paraphrase to ‘humanise’ GenAI output, occasionally reallocating effort from idea development to surface revision. The study, therefore, problematises a detection-first approach, showing that it (1) engenders strategic but pedagogically hollow text manipulation, (2) complicates ‘plagiarised’ versus ‘original’ writing, and (3) potentially shifts assessment towards product-centric criteria. The study argues for cultivating critical AI literacy so that GenAI-assisted EAL writing instruction can prioritize creativity, voice, and argument quality over algorithmic policing.
The rapid advances in Generative AI tools have produced both excitement and worry about how AI will impact academic writing. However, little is known about what norms are emerging around AI use in manuscript preparation or how these norms might be enforced. We address both gaps in the literature by conducting a survey of 271 academics about whether it is necessary to report ChatGPT use in manuscript preparation and by running GPT-modified abstracts from 2,716 published papers through a leading AI detection software to see if these detectors can detect different AI uses in manuscript preparation. We find that most academics do not think that using ChatGPT to fix grammar needs to be reported, but detection software did not always draw this distinction, as abstracts for which GPT was used to fix grammar were often flagged as having a high chance of being written by AI. We also find disagreements among academics on whether more substantial use of ChatGPT to rewrite text needs to be reported, and these differences were related to perceptions of ethics, academic role, and English language background. Finally, we found little difference in their perceptions about reporting ChatGPT and research assistant help, but significant differences in reporting perceptions between these sources of assistance and paid proofreading and other AI assistant tools (Grammarly and Word). Our results suggest that there might be challenges in getting authors to report AI use in manuscript preparation because (i) there is not uniform agreement about what uses of AI should be reported and (ii) journals might have trouble enforcing nuanced reporting requirements using AI detection tools.
… is motivated by realistic socio-technological challenges such as fake content generation, AI plagiarism (eg using LLMs for writing tests), and false accusations of innocent writers. …
Large language models (LLMs) have advanced to a point that even humans have difficulty discerning whether a text was generated by another human, or by a computer. However, knowing whether a text was produced by human or artificial intelligence (AI) is important to determining its trustworthiness, and has applications in many domains including detecting fraud and academic dishonesty, as well as combating the spread of misinformation and political propaganda. The task of AI-generated text (AIGT) detection is therefore both very challenging, and highly critical. In this survey, we summarize stateof-the art approaches to AIGT detection, including watermarking, statistical and stylistic analysis, and machine learning classification. We also provide information about existing datasets for this task. Synthesizing the research findings, we aim to provide insight into the salient factors that combine to determine how “detectable” AIGT text is under different scenarios, and to make practical recommendations for future work towards this significant technical and societal challenge.
With the rapid development of artificial intelligence (AI) tools, concerns emerge regarding students’ unethical uses of these tools to produce AI-generated research texts or their parts, and to present them as original writing. This issue is compounded by the lack of reliable tools for detecting machine-generated text. To address these concerns, the present study aimed to identify distinctive features of ChatGPT-generated research proposal literature reviews ( N = 45) and investigate the presence of these features in English-language literature reviews produced by undergraduate and graduate students from two Russian universities. During the first stage, an analysis of AI-generated texts and a small sample of graduate students’ ( N = 12) literature reviews was conducted. Findings revealed that many features typical of AI-generated texts were clearly present in student texts suggesting that these features may serve as indicators of machine-generated writing. One such feature was the unusually high recurrence of lexical items (predominantly with abstract meanings) in both AI-generated and student texts. Drawing on these insights, a frequency analysis was performed using AntConc to explore the occurrence of these items in AI-generated texts and compile a list of the most frequent items indicative of machine-generated writing (referred to in this study as “ChatGPT language”). At the second stage, findings on the initial indicators were validated, refined, and expanded based on an analysis of a larger sample of 47 English language literature reviews prepared by bachelor and master students. The study identified ten indicators of AI-generated writing pertaining to content, structure, and language use in literature reviews, which are detailed and illustrated in the paper. The study’s findings contribute valuable practical and research insights which may aid all those involved in teaching English language academic writing, reviewing students’ academic texts, and supervising research projects across diverse EAP contexts.
This study explores the capability of academic staff assisted by the Turnitin Artificial Intelligence (AI) detection tool to identify the use of AI-generated content in university assessments. 22 different experimental submissions were produced using Open AI’s ChatGPT tool, with prompting techniques used to reduce the likelihood of AI detectors identifying AI-generated content. These submissions were marked by 15 academic staff members alongside genuine student submissions. Although the AI detection tool identified 91% of the experimental submissions as containing AI-generated content, only 54.8% of the content was identified as AI-generated, underscoring the challenges of detecting AI content when advanced prompting techniques are used. When academic staff members marked the experimental submissions, only 54.5% were reported to the academic misconduct process, emphasising the need for greater awareness of how the results of AI detectors may be interpreted. Similar performance in grades was obtained between student submissions and AI-generated content (AI mean grade: 52.3, Student mean grade: 54.4), showing the capabilities of AI tools in producing human-like responses in real-life assessment situations. Recommendations include adjusting the overall strategies for assessing university students in light of the availability of new Generative AI tools. This may include reducing the overall reliance on assessments where AI tools may be used to mimic human writing, or by using AI-inclusive assessments. Comprehensive training must be provided for both academic staff and students so that academic integrity may be preserved.
Distinguishing Human-Generated and AI-Generated Academic Writing: A Machine Learning Benchmark Study
The rapid adoption of large language models (LLMs) such as ChatGPT has raised critical questions about authorship, originality, and integrity in academic writing. Unlike conventional plagiarism testing tools, AI-generated or AI-rephrased text can preserve the original meaning and context of the text while modifying the writing style, making it challenging to detect using standard similarity checks. This study addresses this challenge by creating a domain-specific corpus of postgraduate-level academic texts. The corpus contains 22,520 samples, equally divided between human-written text and AI-rephrased text. All samples were preprocessed and represented using two common techniques: TF-IDF and Word2Vec. The dataset was evaluated using well-known machine learning and deep learning models, including Logistic Regression, Support Vector Machines, Recurrent Neural Networks, and transformer-based models BERT and T5. The results show that linear and sequential models provide low baseline performance, with accuracy between 50-54%. While BERT significantly outperforms the other models, achieving 83% precision along with a high recall rate. Confusion matrix analysis further shows that traditional models tend to overpredict AI authorship, whereas BERT demonstrates strong reliability in distinguishing between human-written and AI-generated text. The results show that transformer-based models are more effective for authorship verification in academic settings. They also emphasize the trade-offs among interpretability, computational cost, and predictive performance. In general, this study offers some important recommendations for the creation of credible, transparent, and domain-sensitive AI detectors for academia.
The remarkable ability of large language models (LLMs) to comprehend, interpret, and generate complex language has rapidly integrated LLM-generated text into various aspects of daily life, where users increasingly accept it. However, the growing reliance on LLMs underscores the urgent need for effective detection mechanisms to identify LLM-generated text. Such mechanisms are critical to mitigating misuse and safeguarding domains like artistic expression and social networks from potential negative consequences. LLM-generated text detection, conceptualised as a binary classification task, seeks to determine whether an LLM produced a given text. Recent advances in this field stem from innovations in watermarking techniques, statistics-based detectors, and neural-based detectors. Human- Assisted methods also play a crucial role. In this survey, we consolidate recent research breakthroughs in this field, emphasising the urgent need to strengthen detector research. Additionally, we review existing datasets, highlighting their limitations and developmental requirements. Furthermore, we examine various LLM-generated text detection paradigms, shedding light on challenges like out-of-distribution problems, potential attacks, real-world data issues and ineffective evaluation frameworks. Finally, we outline intriguing directions for future research in LLM-generated text detection to advance responsible artificial intelligence (AI). This survey aims to provide a clear and comprehensive introduction for newcomers while offering seasoned researchers valuable updates in the field.
… , a novel benchmark for LLM-generated text detection. We … of human, human revisions of text such as word substitutions, and … media, to serve as samples of human-written text. To create …
Mervat Abassy, Kareem Elozeiri, Alexander Aziz, Minh Ngoc Ta, Raj Vardhan Tomar, Bimarsha Adhikari, Saad El Dine Ahmed, Yuxia Wang, Osama Mohammed Afzal, Zhuohan Xie, Jonibek Mansurov, Ekaterina Artemova, Vladislav Mikhailov, Rui Xing, Jiahui Geng, Hasan Iqbal, Zain Muhammad Mujahid, Tarek Mahmoud, Akim Tsvigun, Alham Fikri Aji, Artem Shelmanov, Nizar Habash, Iryna Gurevych, Preslav Nakov. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2024.
Natural Language Generation has been rapidly developing with the advent of large language models (LLMs). While their usage has sparked significant attention from the general public, it is important for readers to be aware when a piece of text is LLM-generated. This has brought about the need for building models that enable automated LLM-generated text detection, with the aim of mitigating potential negative outcomes of such content. Existing LLM-generated detectors show competitive performances in telling apart LLM-generated and human-written text, but this performance is likely to deteriorate when paraphrased texts are considered. In this study, we devise a new data collection strategy to collect Human & LLM Paraphrase Collection (HLPC), a first-of-its-kind dataset that incorporates human-written texts and paraphrases, as well as LLM-generated texts and paraphrases. With the aim of understanding the effects of human-written paraphrases on the performance of SOTA LLM-generated text detectors OpenAI RoBERTa and watermark detectors, we perform classification experiments that incorporate human-written paraphrases, watermarked and non-watermarked LLM-generated documents from GPT and OPT, and LLM-generated paraphrases from DIPPER and BART. The results show that the inclusion of human-written paraphrases has a significant impact of LLM-generated detector performance, promoting TPR@1%FPR with a possible trade-off of AUROC and accuracy.
Large language models (LLMs), e.g., ChatGPT, have revolutionized the domain of natural language processing because of their excellent performance on various tasks. Despite their great potential, LLMs also incur serious concerns as they are likely to be misused. There are already reported cases of academic cheating by using LLMs. Thus, it is a pressing problem to identify LLM-generated texts. In this work, we design a zero-shot black-box method for detecting LLM-generated texts. The key idea is to revise the text to be detected using the ChatGPT model. Our method is based on the intuition that the ChatGPT model will make fewer revisions to LLM-generated texts than it does to human-written texts, because the texts generated by LLMs are more in accord with the generation logic and statistical patterns learned by LLMs like ChatGPT. Thus, if the text to be detected and its ChatGPT-revised version have a higher degree of similarity, the text is more likely to be LLM-generated. Extensive experiments on various datasets and tasks show that our method can effectively detect LLM-generated texts. Moreover, compared with other detection methods, our method has better generalization ability and is more stable across various datasets. The codes are publicly available at https://github.com/thunlp/LLM-gene rated-text-detection .
The rapid development of large language models (LLMs), like ChatGPT, has resulted in the widespread presence of LLM-generated content on social media platforms, raising concerns about misinformation, data biases, and privacy violations, which can undermine trust in online discourse. While detecting LLM-generated content is crucial for mitigating these risks, current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-LLM collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content. This approach introduces two novel tasks: LLM Role Recognition (LLM-RR), a multi-class classification task that identifies specific roles of an LLM in content generation, and LLM Involvement Measurement (LLM-IM), a regression task that quantifies the extent of LLM involvement in content creation. To support these tasks, we propose LLMDetect, a benchmark designed to evaluate detectors' performance on these new tasks. LLMDetect includes the Hybrid News Detection Corpus (HNDC) for training detectors, as well as DetectEval, a comprehensive evaluation suite that considers five distinct cross-context variations and two multi-intensity variations within the same LLM role. This allows for a thorough assessment of detectors' generalization and robustness across diverse contexts. Our empirical validation of 10 baseline detection methods demonstrates that fine-tuned Pre-trained Language Model (PLM)-based models consistently outperform others on both tasks, while advanced LLMs face challenges in accurately detecting their own generated content. Our experimental results and analysis offer insights for developing more effective detection models for LLM-generated content. This research enhances the understanding of LLM-generated content and establishes a foundation for more nuanced detection methodologies.
With the rapid advancements in pre-trained large language models like ChatGPT, the surge of AI-generated text, particularly in Chinese, has presented significant challenges to existing detection systems due to its increasing realism and complexity. To address this, we introduce MLSDET: a groundbreaking Multi-LLM Statistical Deep Ensemble framework designed for high-precision detection of AI-generated Chinese text. MLSDET uniquely integrates a Mixture of Experts (MoE) architecture with a novel cross-entropy metric, setting a new benchmark for robustness and generalization. By employing a diverse ensemble of large language models (LLMs), including Qwen, Wenzhong-GPT2, and LLaMA, our approach extracts intricate features such as log-rank, entropy, log-likelihood, and the newly introduced LLMs-crossEntropy, accurately capturing both model consensus and the statistical distribution differences between AI-generated and human-authored text. Experimental results on the HC3-Chinese dataset show that MLSDET surpasses traditional zero-shot methods like CLTR by 15.94% in F1 score and competes effectively with existing methods, offering a scalable solution for real-world applications.
Generated texts from large language models (LLMs) are remarkably close to high-quality human-authored text, raising concerns about their potential misuse in spreading false information and academic misconduct. Consequently, there is an urgent need for a highly practical detection tool capable of accurately identifying the source of a given text. However, existing detection tools typically rely on access to LLMs and can only differentiate between machine-generated and human-authored text, failing to meet the requirements of fine-grained tracing, intermediary judgment, and rapid detection. Therefore, we propose LLMDet, a model-specific, secure, efficient, and extendable detection tool, that can source text from specific LLMs, such as GPT-2, OPT, LLaMA, and others. In LLMDet, we record the next-token probabilities of salient n-grams as features to calculate proxy perplexity for each LLM. By jointly analyzing the proxy perplexities of LLMs, we can determine the source of the generated text. Experimental results show that LLMDet yields impressive detection performance while ensuring speed and security, achieving 98.54% precision and x3.5 faster for recognizing human-authored text. Additionally, LLMDet can effortlessly extend its detection capabilities to a new open-source model. We will provide an open-source tool at https://github.com/TrustedLLM/LLMDet.
We introduce BUST, a comprehensive benchmark designed to evaluate detectors of texts generated by instruction-tuned large language models (LLMs). Unlike previous benchmarks, our focus lies on evaluating the performance of detector systems, acknowledging the inevitable influence of the underlying tasks and different LLM generators. Our benchmark dataset consists of 25K texts from humans and 7 LLMs responding to instructions across 10 tasks from 3 diverse sources. Using the benchmark, we evaluated 5 detectors and found substantial performance variance across tasks. A meta-analysis of the dataset characteristics was conducted to guide the examination of detector performance. The dataset was analyzed using diverse metrics assessing linguistic features like fluency and coherence, readability scores, and writer attitudes, such as emotions, convincingness, and persuasiveness. Features impacting detector performance were investigated with surrogate models, revealing emotional content in texts enhanced some detectors, yet the most effective detector demonstrated consistent performance, irrespective of writer’s attitudes and text styles. Our approach focused on investigating relationships between the detectors’ performance and two key factors: text characteristics and LLM generators. We believe BUST will provide valuable insights into selecting detectors tailored to specific text styles and tasks and facilitate a more practical and in-depth investigation of detection systems for LLM-generated text.
… to identifying distinctive LLM artifacts and advancing the state of the art in LLM text detection. By … numerous studies on LLM-generated text detection, a comprehensive benchmarking …
The detection of machine-generated text, especially from large language models (LLMs), is crucial in preventing serious social problems resulting from their misuse.Some methods train dedicated detectors on specific datasets but fall short in generalizing to unseen test data, while other zero-shot ones often yield suboptimal performance.Although the recent DetectGPT has shown promising detection performance, it suffers from significant inefficiency issues, as detecting a single candidate requires querying the source LLM with hundreds of its perturbations.This paper aims to bridge this gap.Concretely, we propose to incorporate a Bayesian surrogate model, which allows us to select typical samples based on Bayesian uncertainty and interpolate scores from typical samples to other samples, to improve query efficiency.Empirical results demonstrate that our method significantly outperforms existing approaches under a low query budget.Notably, when detecting the text generated by LLaMA family models, our method with just 2 or 3 queries can outperform DetectGPT with 200 queries.
Detecting text generated by large language models (LLMs) is a growing challenge as these models produce outputs nearly indistinguishable from human writing. This study explores multiple detection approaches, including a Multi-Layer Perceptron (MLP), Long Short-Term Memory (LSTM) networks, a Transformer block, and a fine-tuned distilled BERT model. Leveraging BERT's contextual understanding, we train the model on diverse datasets containing authentic and synthetic texts, focusing on features like sentence structure, token distribution, and semantic coherence. The fine-tuned BERT outperforms baseline models, achieving high accuracy and robustness across domains, with superior AUC scores and efficient computation times. By incorporating domain-specific training and adversarial techniques, the model adapts to sophisticated LLM outputs, improving detection precision. These findings underscore the efficacy of pretrained transformer models for ensuring authenticity in digital communication, with potential applications in mitigating misinformation, safeguarding academic integrity, and promoting ethical AI usage.
RATIONALE AND OBJECTIVES We aimed to evaluate the efficacy of perplexity scores in distinguishing between human-written and AI-generated radiology abstracts and to assess the relative performance of available AI detection tools in detecting AI-generated content. METHODS Academic articles were curated from PubMed using the keywords "neuroimaging" and "angiography." Filters included English-language, open-access articles with abstracts without subheadings, published before 2021, and within Chatbot processing word limits. The first 50 qualifying articles were selected, and their full texts were used to create AI-generated abstracts. Perplexity scores, which estimate sentence predictability, were calculated for both AI-generated and human-written abstracts. The performance of three AI tools in discriminating human-written from AI-generated abstracts was assessed. RESULTS The selected 50 articles consist of 22 review articles (44%), 12 case or technical reports (24%), 15 research articles (30%), and one editorial (2%). The perplexity scores for human-written abstracts (median; 35.9 IQR; 25.11-51.8) were higher than those for AI-generated abstracts (median; 21.2 IQR; 16.87-28.38), (p=0.057) with an AUC=0.7794. One AI tool performed less than chance in identifying human-written from AI-generated abstracts with an accuracy of 36% (p>0.05) while another tool yielded an accuracy of 95% with an AUC=0.8688. CONCLUSION This study underscores the potential of perplexity scores in detecting AI-generated and potentially fraudulent abstracts. However, more research is needed to further explore these findings and their implications for the use of AI in academic writing. Future studies could also investigate other metrics or methods for distinguishing between human-written and AI-generated texts.
We present SBP-HeteroGNN, a streaming bilingual perplexity-driven heterogeneous GNN for AIGC text detection. Prior detectors are largely monolingual and—to our knowledge—none explicitly targets mixed Chinese-English inputs; we address this gap by constructing a bilingual evaluation set and a detector tailored for bilingual text. We weight each Doc–Term edge by TF-IDF times normalized log-perplexity, then use an HGT to combine semantic cues with ‘how likely a model would write this text’. On HC3-Bilingual, SBP-HeteroGNN reaches Macro-F1 = 0.966 and AUC = 0.994, and improves a TF-IDF+LR baseline by +0.140 Macro-F1 / +0.061 AUC under the same split. Ablations show the parts work well together—mixed tokenization, relation-aware HGT, and the TF-IDF × perplexity edge—making the model steady across EN/ZH with little labeling.
This report presents a comprehensive analysis of a demo application designed for AI detection and paraphrase detection using state-of-the-art natural language processing (NLP) models. The demo leverages the GPT-2 model for AI detection, evaluating the input text based on perplexity and burstiness scores to determine the likelihood of being generated by an AI. Additionally, the demo integrates a paraphrase detector utilizing BERT (Bidirectional Encoder Representations from Transformers) to identify similarities between input sentences. The GPT-2-based AI detection module assesses the input text's perplexity and burstiness scores, which serve as indicators of language complexity and repetition, respectively. Texts with high perplexity and low burstiness scores are flagged as potentially AI-generated. The paraphrase detection module, powered by BERT, employs semantic similarity techniques to compare input sentences and identify paraphrases or closely related sentences. Through a user-friendly interface, the demo allows users to input text for analysis and receive real-time feedback on AI likelihood and paraphrase similarity. The report provides insights into the implementation details, including the integration of the GPT-2 and BERT models, text preprocessing techniques, and result interpretation.
… detection model, we introduce an adversarial learning approach in a dynamic scenario for the ADAT task, whereas the detector iteratively updates its parameters using adversarial …
The increased quality and human-likeness of AI generated texts has resulted in a rising demand for neural text detectors, i.e. software that is able to detect whether a text was written by a human or generated by an AI. Such tools are often used in contexts where the use of AI is restricted or completely prohibited, e.g. in educational contexts. It is, therefore, important for the effectiveness of such tools that they are robust towards deliberate attempts to hide the fact that a text was generated by an AI. In this article, we investigate a broad range of adversarial attacks in English texts with six different neural text detectors, including commercial and research tools. While the results show that no detector is completely invulnerable to adversarial attacks, the latest generation of commercial detectors proved to be very robust and not significantly influenced by most of the evaluated attack strategies.
Large language models (LLMs) have exhibited remarkable capabilities in text generation tasks. However, the utilization of these models carries inherent risks, including but not limited to plagiarism, the dissemination of fake news, and issues in educational exercises. Although several detectors have been proposed to address these concerns, their effectiveness against adversarial perturbations, specifically in the context of student essay writing, remains largely unexplored. This paper aims to bridge this gap by constructing AIG-ASAP, an AI-generated student essay dataset, employing a range of text perturbation methods that are expected to generate high-quality essays while evading detection. Through empirical experiments, we assess the performance of current AIGC detectors on the AIG-ASAP dataset. The results reveal that the existing detectors can be easily circumvented using straightforward automatic adversarial attacks. Specifically, we explore word substitution and sentence substitution perturbation methods that effectively evade detection while maintaining the quality of the generated essays. This highlights the urgent need for more accurate and robust methods to detect AI-generated student essays in the education domain.
Large Language Models (LLMs) have revolutionized the field of Natural Language Generation (NLG) by demonstrating an impressive ability to generate human-like text. However, their widespread usage introduces challenges that necessitate thoughtful examination, ethical scrutiny, and responsible practices. In this study, we delve into these challenges, explore existing strategies for mitigating them, with a particular emphasis on identifying AI-generated text as the ultimate solution. Additionally, we assess the feasibility of detection from a theoretical perspective and propose novel research directions to address the current limitations in this domain.
Natural language processing (NLP) models are widely used in various scenarios, yet they are vulnerable to adversarial attacks. Existing works aim to mitigate this vulnerability, but each work targets a specific attack category or has computational overhead limitations, making them vulnerable to adaptive attacks. In this paper, we exhaustively investigate the adversarial attack algorithms in NLP and discover that existing attack algorithms mainly disrupt the importance distribution of words in a text. A well-trained model can distinguish subtle importance distribution differences between clean and adversarial texts. Based on this intuition, we propose TextDefense, a new adversarial example detection framework that utilizes the target model’s capability to defend against adversarial attacks, requiring no prior knowledge. Unlike previous approaches, TextDefense is attack-type agnostic and outperforms existing methods in experiments with different architectures, datasets, and attack methods. We also discover that the target model’s generalizability is a leading factor influencing the performance of TextDefense. Finally, we provide insights into the adversarial attacks in NLP and the principles of our defense method by analyzing the properties of the target model and the adversarial example.
… Discussion and Recommendations The evaluation conducted to assess the robustness of AI text detectors against adversarial evasion techniques highlights the limitations of these tools…
… However, the current adversarial attack methods still face … Hence, it is necessary to design an adversarial attack method … black-box text adversarial example generation scheme, …
… algorithms cannot perform well in detecting adversarial examples with slight perturbations. … stream detector (ADS-Detector) that can address the detection of adversarial examples with …
Recent studies have revealed the vulnerability of pre-trained language models to adversarial attacks. Adversarial defense techniques have been proposed to reconstruct adversarial examples within feature or text spaces. However, these methods struggle to effectively repair the semantics in adversarial examples, resulting in unsatisfactory defense performance. To repair the semantics in adversarial examples, we introduce a novel approach named Reactive Perturbation Defocusing (Rapid), which employs an adversarial detector to identify the fake labels of adversarial examples and leverages adversarial attackers to repair the semantics in adversarial examples. Our extensive experimental results, conducted on four public datasets, demonstrate the consistent effectiveness of Rapid in various adversarial attack scenarios. For easy evaluation, we provide a click-to-run demo of Rapid at https://tinyurl.com/22ercuf8.
Text classifiers are Artificial Intelligence (AI) models used to classify new documents or text vectors into predefined classes. They are typically built using supervised learning algorithms and labelled datasets. Text classifiers produce a predefined class as an output, which also makes them susceptible to adversarial attacks. Text classifiers with high accuracy that are trained using complex deep learning algorithms are equally susceptible to adversarial examples, due to subtle differences that are indiscernible to human experts. Recent work in this space is mostly focused on improving adversarial robustness and adversarial example detection, instead of detecting adversarial attacks. In this paper, we propose a novel approach, explainable AI with integrated gradients (IGs) for the detection of adversarial attacks on text classifiers. This approach uses IGs to unpack model behavior and identify terms that positively and negatively influence the target prediction. Instead of random substitution of words in the input, we select the top p% words with the greatest positive and negative influence as substitute candidates using attribution scores obtained from IGs to generate k samples of transformed inputs by replacing them with synonyms. This approach does not require changes to the model architecture or the training algorithm. The approach was empirically evaluated on three benchmark datasets, IMDB, SST-2, and AG News. Our approach outperforms baseline models on word substitution rate, detection accuracy, and F1 scores while maintaining equivalent detection performance against adversarial attacks.
In the past few years, it has become increasingly evident that deep neural networks are not resilient enough to withstand adversarial perturbations in input data, leaving them vulnerable to attack. Various authors have proposed strong adversarial attacks for computer vision and Natural Language Processing (NLP) tasks. As a response, many defense mechanisms have also been proposed to prevent these networks from failing. The significance of defending neural networks against adversarial attacks lies in ensuring that the model’s predictions remain unchanged even if the input data is perturbed. Several methods for adversarial defense in NLP have been proposed, catering to different NLP tasks such as text classification, named entity recognition, and natural language inference. Some of these methods not only defend neural networks against adversarial attacks but also act as a regularization mechanism during training, saving the model from overfitting. This survey aims to review the various methods proposed for adversarial defenses in NLP over the past few years by introducing a novel taxonomy. The survey also highlights the fragility of advanced deep neural networks in NLP and the challenges involved in defending them.
Machine-generated Text (MGT) detection is crucial for regulating and attributing online texts.While the existing MGT detectors achieve strong performance, they remain vulnerable to simple perturbations and adversarial attacks.To build an effective defense against malicious perturbations, we view MGT detection from a threat modeling perspective, that is, analyzing the model's vulnerability from an adversary's point of view and exploring effective mitigations.To this end, we introduce an adversarial framework for training a robust MGT detector, named GREedy Adversary PromoTed DefendER (GREATER).The GREATER consists of two key components: an adversary GREATER-A and a detector GREATER-D.The GREATER-D learns to defend against the adversarial attack from GREATER-A and generalizes the defense to other attacks.GREATER-A identifies and perturbs the critical tokens in embedding space, along with greedy search and pruning to generate stealthy and disruptive adversarial examples.Besides, we update the GREATER-A and GREATER-D synchronously, encouraging the GREATER-D to generalize its defense to different attacks and varying attack intensities.Our experimental results across 10 text perturbation strategies and 6 adversarial attacks show that our GREATER-D reduces the Attack Success Rate (ASR) by 0.67% compared with SOTA defense methods while our GREATER-A is demonstrated to be more effective and efficient than SOTA attack approaches.Codes and dataset are available in https:// github.com/Liyuuuu111/GREATER.
The proliferation of large language models has raised growing concerns about their misuse, particularly in cases where AI-generated text is falsely attributed to human authors.Machinegenerated content detectors claim to effectively identify such text under various conditions and from any language model.This paper critically evaluates these claims by assessing several popular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, PHD, LogRank, Binoculars) on a range of domains, datasets, and models that these detectors have not previously encountered.We employ various prompting strategies to simulate practical adversarial attacks, demonstrating that even moderate efforts can significantly evade detection.We emphasize the importance of the true positive rate at a specific false positive rate (TPR@FPR) metric and demonstrate that these detectors perform poorly in certain settings, with TPR@.01 as low as 0%.Our findings suggest that both trained and zero-shot detectors struggle to maintain high sensitivity while achieving a reasonable true positive rate.
Tharindu Kumarage, Amrita Bhattacharjee, Djordje Padejski, Kristy Roschke, Dan Gillmor, Scott Ruston, Huan Liu, Joshua Garland. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
Large language models (LLMs) have transformed natural language generation capabilities across numerous applications, yet their proliferation raises critical concerns regarding content attribution, intellectual property protection, and potential misuse. Watermarking techniques have emerged as promising solutions for embedding verifiable signals into LLM outputs, but existing approaches remain vulnerable to sophisticated evasion attacks that exploit detection mechanisms through adversarial modifications. This paper introduces a novel watermarking framework that integrates multi-scale semantic embedding with cryptographic verification to achieve robust attribution of LLM-generated text. Our approach operates across multiple granularity levels, from token-level perturbations to discourse-level structural patterns, while incorporating error-correcting codes and cryptographic signatures to ensure detection integrity even under aggressive tampering attempts. Through comprehensive evaluation on diverse text generation tasks, we demonstrate that our framework achieves superior robustness against paraphrasing attacks, token substitution, and deletion operations while maintaining high text quality with perplexity comparable to unwatermarked outputs. The integration of cryptographic primitives enables public verifiability without exposing watermarking keys, addressing critical security requirements for real-world deployment. Our results show detection accuracy exceeding 94 percent under various attack scenarios while preserving semantic coherence and stylistic naturalness of generated text.
The rapid advancement of Large Language Models (LLMs) has significantly enhanced the capabilities of text generators. With the potential for misuse escalating, the importance of discerning whether texts are human-authored or generated by LLMs has become paramount. Several preceding studies have ventured to address this challenge by employing binary classifiers to differentiate between human-written and LLM-generated text. Nevertheless, the reliability of these classifiers has been subject to question. Given that consequential decisions may hinge on the outcome of such classification, it is imperative that text source detection is of high caliber. In light of this, the present paper introduces DeepTextMark, a deep learning-driven text watermarking methodology devised for text source identification. By leveraging Word2Vec and Sentence Encoding for watermark insertion, alongside a transformer-based classifier for watermark detection, DeepTextMark epitomizes a blend of blindness, robustness, imperceptibility, and reliability. As elaborated within the paper, these attributes are crucial for universal text source detection, with a particular emphasis in this paper on text produced by LLMs. DeepTextMark offers a viable “add-on” solution to prevailing text generation frameworks, requiring no direct access or alterations to the underlying text generation mechanism. Experimental evaluations underscore the high imperceptibility, elevated detection accuracy, augmented robustness, reliability, and swift execution of DeepTextMark.
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating high-quality text, raising significant concerns regarding copyright protection and content provenance verification. However, most existing watermarking techniques rely on uniform perturbation or rule-based token biasing schemes, which exhibit critical vulnerabilities under adversarial attacks such as paraphrasing, translation, and content truncation, often failing to maintain detection reliability in real-world deployment scenarios. To address these challenges, this paper introduces a novel context-aware robust watermarking framework that dynamically adjusts watermark embedding strength according to contextual semantic characteristics during text generation. The proposed approach incorporates a token-level semantic modulation mechanism that strategically intensifies watermark signals in copyright-sensitive segments while minimizing perturbations in semantically neutral regions, achieving an improved balance between imperceptibility and robustness. Furthermore, an adaptive threshold estimation algorithm is developed for watermark detection, which automatically calibrates detection boundaries based on noise statistics, significantly enhancing resilience against diverse attack vectors. Extensive experiments on the WaterBench benchmark demonstrate superior performance over state-of-the-art baselines, maintaining high detection accuracy with a 95.3% true positive rate (TPR) under clean conditions and strong robustness under severe perturbations, including paraphrasing attacks (82.7% TPR), translation attacks (78.4% TPR), and content truncation (88.9% TPR at 50% retention). Meanwhile, the proposed method reduces false positive rates by 43.2% compared with existing approaches while preserving text quality with negligible perplexity increase (1.8%). These results establish a new paradigm for practical and scalable LLM watermarking in real-world copyright-sensitive deployment scenarios.
This paper presents an effective approach to detect AI-generated text, developed for the Defactify 4.0 shared task at the fourth workshop on multimodal fact checking and hate speech detection. The task consists of two subtasks: Task-A, classifying whether a text is AI generated or human written, and Task-B, classifying the specific large language model that generated the text. Our team (Sarang) achieved the 1st place in both tasks with F1 scores of 1.0 and 0.9531, respectively. The methodology involves adding noise to the dataset to improve model robustness and generalization. We used an ensemble of DeBERTa models to effectively capture complex patterns in the text. The result indicates the effectiveness of our noise-driven and ensemble-based approach, setting a new standard in AI-generated text detection and providing guidance for future developments.
The large language models (LLMs) are able to generate high-quality texts in multiple languages. Such texts are often not recognizable by humans as generated, and therefore present a potential of LLMs for misuse (e.g., plagiarism, spams, disinformation spreading). An automated detection is able to assist humans to indicate the machine-generated texts; however, its robustness to out-of-distribution data is still challenging. This notebook describes our mdok approach in robust detection, based on fine-tuning smaller LLMs for text classification. It is applied to both subtasks of Voight-Kampff Generative AI Detection 2025, providing remarkable performance (1st rank) in both, the binary detection as well as the multiclass classification of various cases of human-AI collaboration.
The advent of Large Language Models (LLMs) has enabled the generation of text that increasingly exhibits human-like characteristics. As the detection of such content is of significant importance, substantial research has been conducted with the objective of developing reliable AI-generated text detectors. These detectors have demonstrated promising results on test data, but recent research has revealed that they can be circumvented by employing different techniques. In this paper, we present homoglyph-based attacks (A $\rightarrow$ Cyrillic A) as a means of circumventing existing detectors. We conduct a comprehensive evaluation to assess the effectiveness of these attacks on seven detectors, including ArguGPT, Binoculars, DetectGPT, Fast-DetectGPT, Ghostbuster, OpenAI's detector, and watermarking techniques, on five different datasets. Our findings demonstrate that homoglyph-based attacks can effectively circumvent state-of-the-art detectors, leading them to classify all texts as either AI-generated or human-written (decreasing the average Matthews Correlation Coefficient from 0.64 to -0.01). Through further examination, we extract the technical justification underlying the success of the attacks, which varies across detectors. Finally, we discuss the implications of these findings and potential defenses against such attacks.
This study explores the challenge of sentence-level AI-generated text detection within human-AI collaborative hybrid texts. Existing studies of AI-generated text detection for hybrid texts often rely on synthetic datasets. These typically involve hybrid texts with a limited number of boundaries. We contend that studies of detecting AI-generated content within hybrid texts should cover different types of hybrid texts generated in realistic settings to better inform real-world applications. Therefore, our study utilizes the CoAuthor dataset, which includes diverse, realistic hybrid texts generated through the collaboration between human writers and an intelligent writing system in multi-turn interactions. We adopt a two-step, segmentation-based pipeline: (i) detect segments within a given hybrid text where each segment contains sentences of consistent authorship, and (ii) classify the authorship of each identified segment. Our empirical findings highlight (1) detecting AI-generated sentences in hybrid texts is overall a challenging task because (1.1) human writers' selecting and even editing AI-generated sentences based on personal preferences adds difficulty in identifying the authorship of segments; (1.2) the frequent change of authorship between neighboring sentences within the hybrid text creates difficulties for segment detectors in identifying authorship-consistent segments; (1.3) the short length of text segments within hybrid texts provides limited stylistic cues for reliable authorship determination; (2) before embarking on the detection process, it is beneficial to assess the average length of segments within the hybrid text. This assessment aids in deciding whether (2.1) to employ a text segmentation-based strategy for hybrid texts with longer segments, or (2.2) to adopt a direct sentence-by-sentence classification strategy for those with shorter segments.
The rise in malicious usage of large language models, such as fake content creation and academic plagiarism, has motivated the development of approaches that identify AI-generated text, including those based on watermarking or outlier detection. However, the robustness of these detection algorithms to paraphrases of AI-generated text remains unclear. To stress test these detectors, we build a 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings while only classifying 1% of human-written sequences as AI-generated. We open-source our models, code and data.
With the rapid advancement of large language model technology, there is growing interest in whether multi-feature approaches can significantly improve AI text detection beyond what single neural models achieve. While intuition suggests that combining semantic, syntactic, and statistical features should provide complementary signals, this assumption has not been rigorously tested with modern LLM-generated text. This paper provides a systematic empirical investigation of multi-hierarchical feature integration for AI text detection, specifically testing whether the computational overhead of combining multiple feature types is justified by performance gains. We implement MHFD (Multi-Hierarchical Feature Detection), integrating DeBERTa-based semantic analysis, syntactic parsing, and statistical probability features through adaptive fusion. Our investigation reveals important negative results: despite theoretical expectations, multi-feature integration provides minimal benefits (0.4-0.5% improvement) while incurring substantial computational costs (4.2x overhead), suggesting that modern neural language models may already capture most relevant detection signals efficiently. Experimental results on multiple benchmark datasets demonstrate that the MHFD method achieves 89.7% accuracy in in-domain detection and maintains 84.2% stable performance in cross-domain detection, showing modest improvements of 0.4-2.6% over existing methods.
We find that large language models (LLMs) are more likely to modify human-written text than AI-generated text when tasked with rewriting. This tendency arises because LLMs often perceive AI-generated text as high-quality, leading to fewer modifications. We introduce a method to detect AI-generated content by prompting LLMs to rewrite text and calculating the editing distance of the output. We dubbed our geneRative AI Detection viA Rewriting method Raidar. Raidar significantly improves the F1 detection scores of existing AI content detection models -- both academic and commercial -- across various domains, including News, creative writing, student essays, code, Yelp reviews, and arXiv papers, with gains of up to 29 points. Operating solely on word symbols without high-dimensional features, our method is compatible with black box LLMs, and is inherently robust on new content. Our results illustrate the unique imprint of machine-generated text through the lens of the machines themselves.
Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses. However, despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs such as spreading misinformation, generating fake news, plagiarism in academia, and contaminating the web. To address these concerns, a consensus among the research community is to develop algorithmic solutions to detect AI-generated text. The basic idea is that whenever we can tell if the given text is either written by a human or an AI, we can utilize this information to address the above-mentioned concerns. To that end, a plethora of detection frameworks have been proposed, highlighting the possibilities of AI-generated text detection. But in parallel to the development of detection frameworks, researchers have also concentrated on designing strategies to elude detection, i.e., focusing on the impossibilities of AI-generated text detection. This is a crucial step in order to make sure the detection frameworks are robust enough and it is not too easy to fool a detector. Despite the huge interest and the flurry of research in this domain, the community currently lacks a comprehensive analysis of recent developments. In this survey, we aim to provide a concise categorization and overview of current work encompassing both the prospects and the limitations of AI-generated text detection. To enrich the collective knowledge, we engage in an exhaustive discussion on critical and challenging open questions related to ongoing research on AI-generated text detection.
The rapid advancement of Large Language Models (LLMs) has ushered in an era where AI-generated text is increasingly indistinguishable from human-generated content. Detecting AI-generated text has become imperative to combat misinformation, ensure content authenticity, and safeguard against malicious uses of AI. In this paper, we propose a novel hybrid approach that combines traditional TF-IDF techniques with advanced machine learning models, including Bayesian classifiers, Stochastic Gradient Descent (SGD), Categorical Gradient Boosting (CatBoost), and 12 instances of Deberta-v3-large models. Our approach aims to address the challenges associated with detecting AI-generated text by leveraging the strengths of both traditional feature extraction methods and state-of-the-art deep learning models. Through extensive experiments on a comprehensive dataset, we demonstrate the effectiveness of our proposed method in accurately distinguishing between human and AI-generated text. Our approach achieves superior performance compared to existing methods. This research contributes to the advancement of AI-generated text detection techniques and lays the foundation for developing robust solutions to mitigate the challenges posed by AI-generated content.
Large Language Models (LLMs) perform impressively well in various applications. However, the potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use. Consequently, the reliable detection of AI-generated text has become a critical area of research. AI text detectors have shown to be effective under their specific settings. In this paper, we stress-test the robustness of these AI text detectors in the presence of an attacker. We introduce recursive paraphrasing attack to stress test a wide range of detection schemes, including the ones using the watermarking as well as neural network-based detectors, zero shot classifiers, and retrieval-based detectors. Our experiments conducted on passages, each approximately 300 tokens long, reveal the varying sensitivities of these detectors to our attacks. Our findings indicate that while our recursive paraphrasing method can significantly reduce detection rates, it only slightly degrades text quality in many cases, highlighting potential vulnerabilities in current detection systems in the presence of an attacker. Additionally, we investigate the susceptibility of watermarked LLMs to spoofing attacks aimed at misclassifying human-written text as AI-generated. We demonstrate that an attacker can infer hidden AI text signatures without white-box access to the detection method, potentially leading to reputational risks for LLM developers. Finally, we provide a theoretical framework connecting the AUROC of the best possible detector to the Total Variation distance between human and AI text distributions. This analysis offers insights into the fundamental challenges of reliable detection as language models continue to advance. Our code is publicly available at https://github.com/vinusankars/Reliability-of-AI-text-detectors.
In this paper, we describe our methodology for the CLEF 2025 SimpleText Task 2, which focuses on detecting and evaluating creative generation and information distortion in scientific text simplification. Our solution integrates multiple strategies: we construct an ensemble framework that leverages BERT-based classifier, semantic similarity measure, natural language inference model, and large language model (LLM) reasoning. These diverse signals are combined using meta-classifiers to enhance the robustness of spurious and distortion detection. Additionally, for grounded generation, we employ an LLM-based post-editing system that revises simplifications based on the original input texts.
Large language models (LLMs) have the potential to generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets. Consequently, detecting whether a text is generated by LLMs has become increasingly important. Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics. However, since we do not have access to the interior of the black-box model, we must resort to surrogate models, which impacts detection quality. In order to achieve high-quality detection of black-box models, we would like to extract deep intrinsic characteristics of the black-box model generated texts. We view the generation process as a coupled process of prompt and intrinsic characteristics of the generative model. Based on this insight, we propose to decouple prompt and intrinsic characteristics (DPIC) for LLM-generated text detection method. Specifically, given a candidate text, DPIC employs an auxiliary LLM to reconstruct the prompt corresponding to the candidate text, then uses the prompt to regenerate text by the auxiliary LLM, which makes the candidate text and the regenerated text align with their prompts, respectively. Then, the similarity between the candidate text and the regenerated text is used as a detection feature, thus eliminating the prompt in the detection process, which allows the detector to focus on the intrinsic characteristics of the generative model. Compared to the baselines, DPIC has achieved an average improvement of 6.76\% and 2.91\% in detecting texts from different domains generated by GPT4 and Claude3, respectively.
The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.
With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.
The emergence of large language models (LLMs) has resulted in the production of LLM-generated texts that is highly sophisticated and almost indistinguishable from texts written by humans. However, this has also sparked concerns about the potential misuse of such texts, such as spreading misinformation and causing disruptions in the education system. Although many detection approaches have been proposed, a comprehensive understanding of the achievements and challenges is still lacking. This survey aims to provide an overview of existing LLM-generated text detection techniques and enhance the control and regulation of language generation models. Furthermore, we emphasize crucial considerations for future research, including the development of comprehensive evaluation metrics and the threat posed by open-source LLMs, to drive progress in the area of LLM-generated text detection.
The development of large language models (LLMs) has raised concerns about potential misuse. One practical solution is to embed a watermark in the text, allowing ownership verification through watermark extraction. Existing methods primarily focus on defending against modification attacks, often neglecting other spoofing attacks. For example, attackers can alter the watermarked text to produce harmful content without compromising the presence of the watermark, which could lead to false attribution of this malicious content to the LLM. This situation poses a serious threat to the LLMs service providers and highlights the significance of achieving modification detection and generated-text detection simultaneously. Therefore, we propose a technique to detect modifications in text for unbiased watermark which is sensitive to modification. We introduce a new metric called ``discarded tokens", which measures the number of tokens not included in watermark detection. When a modification occurs, this metric changes and can serve as evidence of the modification. Additionally, we improve the watermark detection process and introduce a novel method for unbiased watermark. Our experiments demonstrate that we can achieve effective dual detection capabilities: modification detection and generated-text detection by watermark.
This paper presents a system developed for Task 1 of the COLING 2025 Workshop on Detecting AI-Generated Content, focusing on the binary classification of machine-generated versus human-written text. Our approach utilizes an ensemble of models, with weights assigned according to each model's inverse perplexity, to enhance classification accuracy. For the English text detection task, we combined RoBERTa-base, RoBERTa-base with the OpenAI detector, and BERT-base-cased, achieving a Macro F1-score of 0.7458, which ranked us 12th out of 35 teams. We ensembled RemBERT, XLM-RoBERTa-base, and BERT-base-multilingual-case for the multilingual text detection task, employing the same inverse perplexity weighting technique. This resulted in a Macro F1-score of 0.7513, positioning us 4th out of 25 teams. Our results demonstrate the effectiveness of inverse perplexity weighting in improving the robustness of machine-generated text detection across both monolingual and multilingual settings, highlighting the potential of ensemble methods for this challenging task.
This paper presents our approach for Task 3 of the GenAI content detection workshop at COLING-2025, focusing on Cross-Domain Machine-Generated Text (MGT) Detection. We propose an ensemble of fine-tuned transformer models, enhanced by inverse perplexity weighting, to improve classification accuracy across diverse text domains. For Subtask A (Non-Adversarial MGT Detection), we combined a fine-tuned RoBERTa-base model with an OpenAI detector-integrated RoBERTa-base model, achieving an aggregate TPR score of 0.826, ranking 10th out of 23 detectors. In Subtask B (Adversarial MGT Detection), our fine-tuned RoBERTa-base model achieved a TPR score of 0.801, securing 8th out of 22 detectors. Our results demonstrate the effectiveness of inverse perplexity-based weighting for enhancing generalization and performance in both non-adversarial and adversarial MGT detection, highlighting the potential for transformer models in cross-domain AI-generated content detection.
AI-generated text detectors have become essential tools for maintaining content authenticity, yet their robustness against evasion attacks remains questionable. We present PDFuzz, a novel attack that exploits the discrepancy between visual text layout and extraction order in PDF documents. Our method preserves exact textual content while manipulating character positioning to scramble extraction sequences. We evaluate this approach against the ArguGPT detector using a dataset of human and AI-generated text. Our results demonstrate complete evasion: detector performance drops from (93.6 $\pm$ 1.4) % accuracy and 0.938 $\pm$ 0.014 F1 score to random-level performance ((50.4 $\pm$ 3.2) % accuracy, 0.0 F1 score) while maintaining perfect visual fidelity. Our work reveals a vulnerability in current detection systems that is inherent to PDF document structures and underscores the need for implementing sturdy safeguards against such attacks. We make our code publicly available at https://github.com/ACMCMC/PDFuzz.
The growth of highly advanced Large Language Models (LLMs) constitutes a huge dual-use problem, making it necessary to create dependable AI-generated text detection systems. Modern detectors are notoriously vulnerable to adversarial attacks, with paraphrasing standing out as an effective evasion technique that foils statistical detection. This paper presents a comparative study of adversarial robustness, first by quantifying the limitations of standard adversarial training and then by introducing a novel, significantly more resilient detection framework: Perturbation-Invariant Feature Engineering (PIFE), a framework that enhances detection by first transforming input text into a standardized form using a multi-stage normalization pipeline, it then quantifies the transformation's magnitude using metrics like Levenshtein distance and semantic similarity, feeding these signals directly to the classifier. We evaluate both a conventionally hardened Transformer and our PIFE-augmented model against a hierarchical taxonomy of character-, word-, and sentence-level attacks. Our findings first confirm that conventional adversarial training, while resilient to syntactic noise, fails against semantic attacks, an effect we term "semantic evasion threshold", where its True Positive Rate at a strict 1% False Positive Rate plummets to 48.8%. In stark contrast, our PIFE model, which explicitly engineers features from the discrepancy between a text and its canonical form, overcomes this limitation. It maintains a remarkable 82.6% TPR under the same conditions, effectively neutralizing the most sophisticated semantic attacks. This superior performance demonstrates that explicitly modeling perturbation artifacts, rather than merely training on them, is a more promising path toward achieving genuine robustness in the adversarial arms race.
The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.
Text classification systems have been proven vulnerable to adversarial text examples, modified versions of the original text examples that are often unnoticed by human eyes, yet can force text classification models to alter their classification. Often, research works quantifying the impact of adversarial text attacks have been applied only to models trained in English. In this paper, we introduce the first word-level study of adversarial attacks in Arabic. Specifically, we use a synonym (word-level) attack using a Masked Language Modeling (MLM) task with a BERT model in a black-box setting to assess the robustness of the state-of-the-art text classification models to adversarial attacks in Arabic. To evaluate the grammatical and semantic similarities of the newly produced adversarial examples using our synonym BERT-based attack, we invite four human evaluators to assess and compare the produced adversarial examples with their original examples. We also study the transferability of these newly produced Arabic adversarial examples to various models and investigate the effectiveness of defense mechanisms against these adversarial examples on the BERT models. We find that fine-tuned BERT models were more susceptible to our synonym attacks than the other Deep Neural Networks (DNN) models like WordCNN and WordLSTM we trained. We also find that fine-tuned BERT models were more susceptible to transferred attacks. We, lastly, find that fine-tuned BERT models successfully regain at least 2% in accuracy after applying adversarial training as an initial defense mechanism.
As machine learning systems become more widely used, especially for safety critical applications, there is a growing need to ensure that these systems behave as intended, even in the face of adversarial examples. Adversarial examples are inputs that are designed to trick the decision making process, and are intended to be imperceptible to humans. However, for text-based classification systems, changes to the input, a string of text, are always perceptible. Therefore, text-based adversarial examples instead focus on trying to preserve semantics. Unfortunately, recent work has shown this goal is often not met. To improve the quality of text-based adversarial examples, we need to know what elements of the input text are worth focusing on. To address this, in this paper, we explore what parts of speech have the highest impact of text-based classifiers. Our experiments highlight a distinct bias in CNN algorithms against certain parts of speech tokens within review datasets. This finding underscores a critical vulnerability in the linguistic processing capabilities of CNNs.
Recently, with the advancement of deep learning, several applications in text classification have advanced significantly. However, this improvement comes with a cost because deep learning is vulnerable to adversarial examples. This weakness indicates that deep learning is not very robust. Fortunately, the input of a text classifier is discrete. Hence, it can prevent the classifier from state-of-the-art attacks. Nonetheless, previous works have generated black-box attacks that successfully manipulate the discrete values of the input to find adversarial examples. Therefore, instead of changing the discrete values, we transform the input into its embedding vector containing real values to perform the state-of-the-art white-box attacks. Then, we convert the perturbed embedding vector back into a text and name it an adversarial example. In summary, we create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
Adversarial attacks are a serious threat to the reliable deployment of machine learning models in safety-critical applications. They can misguide current models to predict incorrectly by slightly modifying the inputs. Recently, substantial work has shown that adversarial examples tend to deviate from the underlying data manifold of normal examples, whereas pre-trained masked language models can fit the manifold of normal NLP data. To explore how to use the masked language model in adversarial detection, we propose a novel textual adversarial example detection method, namely Masked Language Model-based Detection (MLMD), which can produce clearly distinguishable signals between normal examples and adversarial examples by exploring the changes in manifolds induced by the masked language model. MLMD features a plug and play usage (i.e., no need to retrain the victim model) for adversarial defense and it is agnostic to classification tasks, victim model's architectures, and to-be-defended attack methods. We evaluate MLMD on various benchmark textual datasets, widely studied machine learning models, and state-of-the-art (SOTA) adversarial attacks (in total $3*4*4 = 48$ settings). Experimental results show that MLMD can achieve strong performance, with detection accuracy up to 0.984, 0.967, and 0.901 on AG-NEWS, IMDB, and SST-2 datasets, respectively. Additionally, MLMD is superior, or at least comparable to, the SOTA detection defenses in detection accuracy and F1 score. Among many defenses based on the off-manifold assumption of adversarial examples, this work offers a new angle for capturing the manifold change. The code for this work is openly accessible at \url{https://github.com/mlmddetection/MLMDdetection}.
Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.
Neural ranking models (NRMs) have undergone significant development and have become integral components of information retrieval (IR) systems. Unfortunately, recent research has unveiled the vulnerability of NRMs to adversarial document manipulations, potentially exploited by malicious search engine optimization practitioners. While progress in adversarial attack strategies aids in identifying the potential weaknesses of NRMs before their deployment, the defensive measures against such attacks, like the detection of adversarial documents, remain inadequately explored. To mitigate this gap, this paper establishes a benchmark dataset to facilitate the investigation of adversarial ranking defense and introduces two types of detection tasks for adversarial documents. A comprehensive investigation of the performance of several detection baselines is conducted, which involve examining the spamicity, perplexity, and linguistic acceptability, and utilizing supervised classifiers. Experimental results demonstrate that a supervised classifier can effectively mitigate known attacks, but it performs poorly against unseen attacks. Furthermore, such classifier should avoid using query text to prevent learning the classification on relevance, as it might lead to the inadvertent discarding of relevant documents.
The rapid advancement of generative artificial intelligence (GenAI) has revolutionized content creation across text, visual, and audio domains, simultaneously introducing significant risks such as misinformation, identity fraud, and content manipulation. This paper presents a practical survey of watermarking techniques designed to proactively detect GenAI content. We develop a structured taxonomy categorizing watermarking methods for text, visual, and audio modalities and critically evaluate existing approaches based on their effectiveness, robustness, and practicality. Additionally, we identify key challenges, including resistance to adversarial attacks, lack of standardization across different content types, and ethical considerations related to privacy and content ownership. Finally, we discuss potential future research directions aimed at enhancing watermarking strategies to ensure content authenticity and trustworthiness. This survey serves as a foundational resource for researchers and practitioners seeking to understand and advance watermarking techniques for AI-generated content detection.
Watermarking technology is a method used to trace the usage of content generated by large language models. Sentence-level watermarking aids in preserving the semantic integrity within individual sentences while maintaining greater robustness. However, many existing sentence-level watermarking techniques depend on arbitrary segmentation or generation processes to embed watermarks, which can limit the availability of appropriate sentences. This limitation, in turn, compromises the quality of the generated response. To address the challenge of balancing high text quality with robust watermark detection, we propose CoheMark, an advanced sentence-level watermarking technique that exploits the cohesive relationships between sentences for better logical fluency. The core methodology of CoheMark involves selecting sentences through trained fuzzy c-means clustering and applying specific next sentence selection criteria. Experimental evaluations demonstrate that CoheMark achieves strong watermark strength while exerting minimal impact on text quality.
As open-source large language models (LLMs) like Llama3 become more capable, it is crucial to develop watermarking techniques to detect their potential misuse. Existing watermarking methods either add watermarks during LLM inference, which is unsuitable for open-source LLMs, or primarily target classification LLMs rather than recent generative LLMs. Adapting these watermarks to open-source LLMs for misuse detection remains an open challenge. This work defines two misuse scenarios for open-source LLMs: intellectual property (IP) violation and LLM Usage Violation. Then, we explore the application of inference-time watermark distillation and backdoor watermarking in these contexts. We propose comprehensive evaluation methods to assess the impact of various real-world further fine-tuning scenarios on watermarks and the effect of these watermarks on LLM performance. Our experiments reveal that backdoor watermarking could effectively detect IP Violation, while inference-time watermark distillation is applicable in both scenarios but less robust to further fine-tuning and has a more significant impact on LLM performance compared to backdoor watermarking. Exploring more advanced watermarking methods for open-source LLMs to detect their misuse should be an important future direction.
Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting use of green tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.
As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e-5 false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.
Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the Bias-Inversion Rewriting Attack (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates (>99%) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests.
Text watermarking algorithms for large language models (LLMs) can effectively identify machine-generated texts by embedding and detecting hidden features in the text. Although the current text watermarking algorithms perform well in most high-entropy scenarios, its performance in low-entropy scenarios still needs to be improved. In this work, we opine that the influence of token entropy should be fully considered in the watermark detection process, $i.e.$, the weight of each token during watermark detection should be customized according to its entropy, rather than setting the weights of all tokens to the same value as in previous methods. Specifically, we propose \textbf{E}ntropy-based Text \textbf{W}atermarking \textbf{D}etection (\textbf{EWD}) that gives higher-entropy tokens higher influence weights during watermark detection, so as to better reflect the degree of watermarking. Furthermore, the proposed detection process is training-free and fully automated. From the experiments, we demonstrate that our EWD can achieve better detection performance in low-entropy scenarios, and our method is also general and can be applied to texts with different entropy distributions. Our code and data is available\footnote{\url{https://github.com/luyijian3/EWD}}. Additionally, our algorithm could be accessed through MarkLLM \cite{pan2024markllm}\footnote{\url{https://github.com/THU-BPM/MarkLLM}}.
Watermarking has become a practical tool for tracing language model outputs, but it modifies token probabilities at inference time, which were carefully tuned by alignment training. This creates a tension: how do watermark-induced shifts interact with the procedures intended to make models safe and useful? Experiments on several contemporary models and two representative watermarking schemes reveal that watermarking induces a nontrivial, patterned yet model-specific shift in alignment. We see two failure modes: guard attenuation, where models become more helpful but less safe, and guard amplification, where refusals become overly conservative. These effects persist even after controlling for perplexity degradation, pointing to alignment-specific distortions, not just quality loss. We address this with Alignment Resampling (AR), a procedure that samples multiple watermarked outputs and selects the most aligned response according to an external reward model. Using standard results on the expected maximum of Gaussian random variables, we derive a theoretical lower bound showing that alignment gains grow sublogarithmically with sample size. In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection. This is the first empirical study of watermarking-alignment interactions; it shows that a simple inference-time fix can recover alignment.
The rapid growth of Large Language Models (LLMs) raises concerns about distinguishing AI-generated text from human content. Existing watermarking techniques, like \kgw, struggle with low watermark strength and stringent false-positive requirements. Our analysis reveals that current methods rely on coarse estimates of non-watermarked text, limiting watermark detectability. To address this, we propose Bipolar Watermark (\tool), which splits generated text into positive and negative poles, enhancing detection without requiring additional computational resources or knowledge of the prompt. Theoretical analysis and experimental results demonstrate \tool's effectiveness and compatibility with existing optimization techniques, providing a new optimization dimension for watermarking in LLM-generated content.
As large language models (LLMs) reach human-like fluency, reliably distinguishing AI-generated text from human authorship becomes increasingly difficult. While watermarks already exist for LLMs, they often lack flexibility and struggle with attacks such as paraphrasing. To address these issues, we propose a multi-feature method for generating watermarks that combines multiple distinct watermark features into an ensemble watermark. Concretely, we combine acrostica and sensorimotor norms with the established red-green watermark to achieve a 98% detection rate. After a paraphrasing attack, the performance remains high with 95% detection rate. In comparison, the red-green feature alone as a baseline achieves a detection rate of 49% after paraphrasing. The evaluation of all feature combinations reveals that the ensemble of all three consistently has the highest detection rate across several LLMs and watermark strength settings. Due to the flexibility of combining features in the ensemble, various requirements and trade-offs can be addressed. Additionally, the same detection function can be used without adaptations for all ensemble configurations. This method is particularly of interest to facilitate accountability and prevent societal harm.
To mitigate potential risks associated with language models, recent AI detection research proposes incorporating watermarks into machine-generated text through random vocabulary restrictions and utilizing this information for detection. While these watermarks only induce a slight deterioration in perplexity, our empirical investigation reveals a significant detriment to the performance of conditional text generation. To address this issue, we introduce a simple yet effective semantic-aware watermarking algorithm that considers the characteristics of conditional text generation and the input context. Experimental results demonstrate that our proposed method yields substantial improvements across various text generation models, including BART and Flan-T5, in tasks such as summarization and data-to-text generation while maintaining detection ability.
We propose Easymark, a family of embarrassingly simple yet effective watermarks. Text watermarking is becoming increasingly important with the advent of Large Language Models (LLM). LLMs can generate texts that cannot be distinguished from human-written texts. This is a serious problem for the credibility of the text. Easymark is a simple yet effective solution to this problem. Easymark can inject a watermark without changing the meaning of the text at all while a validator can detect if a text was generated from a system that adopted Easymark or not with high credibility. Easymark is extremely easy to implement so that it only requires a few lines of code. Easymark does not require access to LLMs, so it can be implemented on the user-side when the LLM providers do not offer watermarked LLMs. In spite of its simplicity, it achieves higher detection accuracy and BLEU scores than the state-of-the-art text watermarking methods. We also prove the impossibility theorem of perfect watermarking, which is valuable in its own right. This theorem shows that no matter how sophisticated a watermark is, a malicious user could remove it from the text, which motivate us to use a simple watermark such as Easymark. We carry out experiments with LLM-generated texts and confirm that Easymark can be detected reliably without any degradation of BLEU and perplexity, and outperform state-of-the-art watermarks in terms of both quality and reliability.
合并后形成“综述与可检出性理论—检测器方法与基准评测—检测可靠性压力测试—对抗逃逸与鲁棒防御—水印与可验证溯源—LLM重写/低查询采样检测—现实应用治理—科学文本幻觉检测扩展”八个并列分组。整体研究从方法构建与评测入手,进一步以对抗与可靠性压力测试检验脆弱性;同时以水印提供可验证证据,并发展以LLM为工具的检测范式及低查询加速策略;最后在教育/写作治理与科学文本幻觉缓解等更广应用场景中评估误判与副作用。