AIGC 率文本检测与对抗

合并后的分组全面覆盖了 AIGC 文本检测与对抗的完整技术生命周期。研究已从早期的简单统计二分类，演进为深度挖掘语言学特征与利用对抗博弈提升鲁棒性的复杂体系。目前，该领域呈现出三大趋势：一是“攻防演进”，攻击手段从简单改写转向复杂的“去指纹化”规避，而防御则引入了对抗训练与主动水印技术；二是“泛化与鲁棒”，研究重点转向解决跨模型、跨领域及分布外数据的失效问题；三是“实战化与标准化”，通过建立大规模行业基准和参加国际竞赛，推动技术在学术诚信、网络安全等真实场景中的落地应用。

共 106 篇文献，6 个研究方向

新型检测架构与多维统计/语言学特征挖掘

该组研究侧重于开发高效的检测模型架构（如混合专家网络、Transformer 变体、轻量化模型）以及挖掘 AI 生成文本在统计学（如困惑度、负对数似然、概率曲率）和语言学（如词频分布、句法结构、篇章连贯性）上的细微特征。相关文献: Kaiwen Jin et. al, 2024 等 18 篇文献

跨领域泛化与分布外（OOD）鲁棒性检测

关注检测器在面对未见过的生成模型（如从 GPT-3 到 GPT-4）、未知语义领域（医疗、法律）或跨语言场景时的性能退化问题，通过特征解耦、对比学习、领域知识蒸馏等技术提升模型的泛化能力。相关文献: Guanhua Huang et. al, 2024 等 12 篇文献

对抗性攻击与文本“去指纹化”规避技术

从攻击者视角研究如何通过递归改写、提示词工程（Self-Disguise）、字符级扰动、风格迁移或利用物理格式（PDF）漏洞来消除 AI 文本的统计痕迹，使文本在保持语义的同时规避检测。相关文献: Afsar Khan et. al, 2025 等 19 篇文献

对抗博弈防御与鲁棒性增强机制

研究如何通过对抗训练、联合博弈（如检测器与改写器的相互演进）、语义不变特征提取及改写还原技术，增强检测器在面对恶意扰动和复杂改写时的稳健性。相关文献: Suning Li et. al, 2024 等 21 篇文献

主动防御：文本水印技术及其安全性

探讨在 LLM 生成阶段嵌入隐藏信号（水印）的主动防御手段，包括语义水印、频率水印及低熵增强技术，并研究针对水印的“颜色感知”攻击及水印的抗改写鲁棒性。相关文献: Yijian Lu et. al, 2024 等 13 篇文献

系统性评测基准、理论极限与垂直场景应用

提供大规模评测基准（如 PADBen, SHIELD），分析检测的理论极限（如 KL 散度、成员推理攻击），并探索在学术诚信、钓鱼邮件检测、网络安全报告及人机协作文本等特定垂直场景下的应用实效。相关文献: Héctor Cerezo-Costas et. al, 2024 等 23 篇文献

总计114篇相关文献

Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings

基于嵌入的基于标记概率的人工智能生成文本检测模型对抗攻击方法

Ahmed K. Kadhim, Lei Jiao, R. Shafik 等, 2025-ArXiv

In recent years, text generation tools utilizing Artificial Intelligence (AI) have occasionally been misused across various domains, such as generating student reports or creative writings. This issue prompts plagiarism detection services to enhance their capabilities in identifying AI-generated content. Adversarial attacks are often used to test the robustness of AI-text generated detectors. This work proposes a novel textual adversarial attack on the detection models such as Fast-DetectGPT. The method employs embedding models for data perturbation, aiming at reconstructing the AI generated texts to reduce the likelihood of detection of the true origin of the texts. Specifically, we employ different embedding techniques, including the Tsetlin Machine (TM), an interpretable approach in machine learning for this purpose. By combining synonyms and embedding similarity vectors, we demonstrates the state-of-the-art reduction in detection scores against Fast-DetectGPT. Particularly, in the XSum dataset, the detection score decreased from 0.4431 to 0.2744 AUROC, and in the SQuAD dataset, it dropped from 0.5068 to 0.3532 AUROC.