大语言模型人工评估的局限性

本报告综合揭示了大语言模型评估中的多重局限性：传统人工评估正面临严重的认知偏见、高昂成本与不可扩展性瓶颈；而作为替代方案的‘LLM-as-a-Judge’虽然高效，却引入了长度偏见和自恋倾向等新风险。目前行业的研究重心正从单一指标竞争转向人机协作（HITL）框架的构建、特定高风险领域的专家评测协议制定、以及针对智能体（Agent）和多模态任务的动态、细粒度评估方法论的开发。

共 109 篇文献，7 个研究方向

人工标注的认知偏误、主观性与个体差异

该组文献探讨了人类在作为‘金标准’时的内在缺陷，包括由于人口统计学背景、心理偏见（如性别、种族、文化）、工作顺序效应以及对讽刺或美感等主观现象的不同理解导致的不一致性。相关文献: Rahul Pandey et. al, 2020 等 23 篇文献

人工评估的可扩展性瓶颈、成本与再现性危机

这类文献集中分析了纯人工评估在面对大规模模型输出时面临的挑战，如极高的经济与时间成本、实验设计缺陷导致的不可重复性，以及缺乏透明度等工程化难题。相关文献: Anya Belz et. al, 2023 等 10 篇文献

LLM-as-a-Judge 的可靠性缺陷与自动化偏见

探讨利用LLM替代人类进行自动评测时的局限性，包括长度偏见（Length Bias）、自恋偏见（Self-enhancement）、评分不一致性（Rating Roulette）、缺乏复杂逻辑判断力以及与人类真实偏好的对齐鸿沟。相关文献: Yidong Wang et. al, 2023 等 17 篇文献

人机协作（HITL）与混合评估工作流的优化

这些研究旨在结合人类的深度洞察与LLM的高效率，通过主动学习、反馈回路、AI预标注、人类验证等机制（Human-in-the-Loop）来缓解单一评估模式的不足。相关文献: Sherry Shi et. al, 2025 等 20 篇文献

专业垂直领域及高风险场景下的评估挑战

在医疗诊断、法律判决、军事决策、代码修复等专业领域，传统的通用评估指标往往失效。研究强调专家知识在处理微妙语境和极端准确性要求时的不可替代性。相关文献: Namu Park et. al, 2025 等 15 篇文献

评估方法论、统计建模与基准设计的严谨性

探讨评估框架本身的设计问题，如评分量表与两两比较的效能、序数数据的统计误区、任务诱导出的投机行为，以及如何通过数学模型（如Polyrating）量化并缓解评估偏差。相关文献: Mika Hämäläinen et. al, 2021 等 12 篇文献

交互式、多模态与智能体任务中的动态评估

针对Agent、RAG检索增强生成、长文本生成及多模态内容（语音、3D、视频），研究如何构建能够捕捉真实世界交互复杂性、实时性及幻觉风险的新型评估范式。相关文献: Zarreen Reza et. al, 2025 等 12 篇文献

总计193篇相关文献

Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

多智能体作为评判者：基于LLM智能体的自动评估与多维人类评估的协同

Jiaju Chen, Yuxuan Lu, Xiaojie Wang 等, 2025-arXiv.org

Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging"LLM-as-a-judge"paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate multi-dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ-EVAL can generate evaluation results that better align with human experts'ratings compared with conventional automated evaluation metrics and existing LLM-as-a-judge methods.