大语言模型人工评估的局限性
人工标注的认知偏误、主观性与个体差异
该组文献探讨了人类在作为‘金标准’时的内在缺陷,包括由于人口统计学背景、心理偏见(如性别、种族、文化)、工作顺序效应以及对讽刺或美感等主观现象的不同理解导致的不一致性。
- Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning(Rahul Pandey, Hemant Purohit, Carlos Castillo, Valerie L. Shalin, 2020, ArXiv Preprint)
- MultiPICo: Multilingual Perspectivist Irony Corpus(Silvia Casola, ⊙. SimonaFrenda⋆, Soda Marem Lo, ⋆. Lo, Erhan Sezerer, Antonio Uva, Valerio Basile, Cristina Bosco, Alessandro Pedrani, Chiara Rubagotti, Viviana Patti, Davide Bernardi, 2024, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement(Gabriele Sarti, Vilém Zouhar, Malvina Nissim, Arianna Bisazza, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing)
- Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets(Tommaso Giorgi, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, Stefano Cresci, 2024, ArXiv Preprint)
- Exploring Subjectivity for more Human-Centric Assessment of Social Biases in Large Language Models(Paula Akemi Aoyagui, Sharon Ferguson, Anastasia Kuzminykh, 2024, ArXiv Preprint)
- Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation(Danielle R. Thomas, Conrad Borchers, Ken Koedinger, 2025, arXiv.org)
- Methods for assessing the quality of experts in the verification of large language models(D. Teterevenkov, 2025, Neurocomputers)
- Annotator-Centric Active Learning for Subjective NLP Tasks(Michiel van der Meer, Neele Falk, P. Murukannaiah, Enrico Liscio, 2024, Conference on Empirical Methods in Natural Language Processing)
- Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text(Jennifer Healey, Laurie Byrum, Md Nadeem Akhtar, Surabhi Bhargava, Moumita Sinha, 2025, ArXiv Preprint)
- Large Language Models are overconfident and amplify human bias(Feng Sun, Ningke Li, Kailong Wang, Lorenz F. Goette, 2025, arXiv.org)
- Sycophancy Claims about Language Models: The Missing Human-in-the-Loop(Jan Batzner, Volker Stocker, Stefan Schmid, Gjergji Kasneci, 2025, ArXiv Preprint)
- "All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations(Michael Hardy, 2024, arXiv.org)
- Beyond Mimicry: Auditing Human Bias to Build Fairer AI for Alzheimer’s Assessment(Liu He, Rui Feng, Xinran Han, Yinlong Liu, Jiahong Yuan, 2025, Companion Proceedings of the 27th International Conference on Multimodal Interaction)
- Modeling Subjectivity in Cognitive Appraisal with Language Models(Yuxiang Zhou, Hainiu Xu, Desmond C. Ong, P. Slovák, Yulan He, 2025, Conference on Empirical Methods in Natural Language Processing)
- Subjectivity in Stereotypes against Migrants in Italian: An Experimental Annotation Procedure(Soda Marem Lo, M. Stranisci, Alessandra Teresa Cignarella, Simona Frenda, Valerio Basile, Elisabetta Jezek, Viviana Patti, 2025, Italian Conference on Computational Linguistics)
- Discriminating Similar Languages: Evaluations and Explorations(Cyril Goutte, Serge Léger, Shervin Malmasi, Marcos Zampieri, 2016, ArXiv Preprint)
- One Rating to Rule Them All?: Evidence of Multidimensionality in Human Assessment of Topic Labeling Quality(Amin Hosseiny Marani, Joshua Levine, Eric P. S. Baumer, 2022, Proceedings of the 31st ACM International Conference on Information & Knowledge Management)
- ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models(Aparna Elangovan, Ling Liu, Lei Xu, S. Bodapati, Dan Roth, 2024, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- Inter-Annotator Agreement and Its Reflection in LLMs and Responsible AI(Amir Toliyat, Elena Filatova, Ronak Etemadpour, 2025, The International FLAIRS Conference Proceedings)
- Capturing Human Perspectives in NLP: Questionnaires, Annotations, and Biases(Wiktoria Mieleszczenko-Kowszewicz, Kamil Kanclerz, Julita Bielaniewicz, Marcin Oleksy, Marcin Gruza, Stanisław Woźniak, Ewa Dzieciol, Przemyslaw Kazienko, Jan Kocoń, 2023, No journal)
- Assessment of Agreement Between Human Ratings and Lexicon-Based Sentiment Ratings of Open-Ended Responses on a Behavioral Rating Scale(Olivia Gratz, D. Vos, Megan Burke, Neelkamal Soares, 2021, Assessment)
- Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models(Alaa Alhamzeh, Mays Al Rebdawi, 2025, arXiv.org)
- A Matter of Perspective(s): Contrasting Human and LLM Argumentation in Subjective Decision-Making on Subtle Sexism(Paula Akemi Aoyagui, Kelsey Stemmler, Sharon Ferguson, Young-ho Kim, Anastasia Kuzminykh, 2025, ArXiv Preprint)
人工评估的可扩展性瓶颈、成本与再现性危机
这类文献集中分析了纯人工评估在面对大规模模型输出时面临的挑战,如极高的经济与时间成本、实验设计缺陷导致的不可重复性,以及缺乏透明度等工程化难题。
- Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP(Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondřej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, Diyi Yang, 2023, ArXiv Preprint)
- An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment(Xuanxin Wu, Yuki Arase, 2024, ArXiv Preprint)
- Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark(Vitali Petsiuk, Alexander E. Siemenn, Saisamrit Surbehera, Zad Chin, Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A. Plummer, Ori Kerret, Tonio Buonassisi, Kate Saenko, Armando Solar-Lezama, Iddo Drori, 2022, ArXiv Preprint)
- Evaluating LLM Story Generation through Large-scale Network Analysis of Social Structures(Hiroshi Nonaka, K. E. Perry, 2025, ArXiv Preprint)
- Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations(Kevin L. Wei, Patricia Paskov, Sunishchal Dev, Michael J. Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Calcott, Evie Coxon, Chinmay Deshpande, 2025, ArXiv Preprint)
- DeTAILS: Deep Thematic Analysis with Iterative LLM Support(Ansh Sharma, James R. Wallace, 2025, Proceedings of the 7th ACM Conference on Conversational User Interfaces)
- Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs(Ganghua Wang, Zhaorun Chen, Bo Li, Haifeng Xu, 2025, ArXiv Preprint)
- The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?(Sourav Banerjee, Ayushi Agarwal, Eishkaran Singh, 2024, arXiv.org)
- Toward More Effective Human Evaluation for Machine Translation(Belén Saldías, George Foster, Markus Freitag, Qijun Tan, 2022, ArXiv Preprint)
- Design and Evaluation of Cost-Aware PoQ for Decentralized LLM Inference(Arther Tian, A. Ding, Frank Chen, Alan Wu, Aaron Chan, Bruce Zhang, 2025, arXiv.org)
LLM-as-a-Judge 的可靠性缺陷与自动化偏见
探讨利用LLM替代人类进行自动评测时的局限性,包括长度偏见(Length Bias)、自恋偏见(Self-enhancement)、评分不一致性(Rating Roulette)、缺乏复杂逻辑判断力以及与人类真实偏好的对齐鸿沟。
- PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization(Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xingxu Xie, Wei Ye, Shikun Zhang, Yue Zhang, 2023, International Conference on Learning Representations)
- Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs(Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen, Shashi Bhushan TN, 2023, ArXiv Preprint)
- Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks(Rajarshi Haldar, J. Hockenmaier, 2025, Findings of the Association for Computational Linguistics: EMNLP 2025)
- A Closer Look into Automatic Evaluation Using Large Language Models(Cheng-Han Chiang, Hung-yi Lee, 2023, ArXiv Preprint)
- Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge(Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, Domenico Talia, 2025, ArXiv Preprint)
- Are We on the Right Way to Assessing LLM-as-a-Judge?(Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen, 2025, arXiv.org)
- Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments(Roland Daynauth, Jason Mars, 2024, arXiv.org)
- Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates(Hui Wei, Shenghua He, Tian Xia, Fei Liu, Andy Wong, Jingyang Lin, Mei Han, 2024, ArXiv Preprint)
- No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding(Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner, 2025, arXiv.org)
- Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback(Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang, 2023, ArXiv Preprint)
- Rethinking Human Preference Evaluation of LLM Rationales(Ziang Li, Manasi Ganti, Zixian Ma, Helena Vasconcelos, Qijia He, Ranjay Krishna, 2025, arXiv.org)
- Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks(Qintong Li, Leyang Cui, Lingpeng Kong, Wei Bi, 2023, International Conference on Computational Linguistics)
- LLM-based relevance assessment still can't replace human relevance assessment(Charles L. A. Clarke, Laura Dietz, 2024, International Workshop on Evaluating Information Access)
- Can Large Language Models Be an Alternative to Human Evaluations?(Cheng-Han Chiang, Hung-yi Lee, 2023, ArXiv Preprint)
- Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator(Peiwen Yuan, Yiwei Li, Shaoxiong Feng, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li, 2025, arXiv.org)
- AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment(Kun Li, L. Po, Hongzheng Yang, Xuyuan Xu, Kangcheng Liu, Yuzhi Zhao, 2025, Conference on Empirical Methods in Natural Language Processing)
- The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs(Nitay Calderon, Roi Reichart, Rotem Dror, 2025, ArXiv Preprint)
人机协作(HITL)与混合评估工作流的优化
这些研究旨在结合人类的深度洞察与LLM的高效率,通过主动学习、反馈回路、AI预标注、人类验证等机制(Human-in-the-Loop)来缓解单一评估模式的不足。
- Towards a Human-in-the-Loop Framework for Reliable Patch Evaluation Using an LLM-as-a-Judge(Sherry Shi, Renyao Wei, Michele Tufano, José Cambronero, Runxiang Cheng, Franjo Ivančić, Pat Rondon, 2025, ArXiv Preprint)
- Eras: Improving the quality control in the annotation process for Natural Language Processing tasks(Jonatas S. Grosman, P. Furtado, A. M. B. Rodrigues, Guilherme Gonçalves Schardong, Simone Diniz Junqueira Barbosa, H. Lopes, 2020, Information Systems)
- Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications(Leila Tavakoli, Hamed Zamani, 2025, Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR))
- LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages(Nataliia Kholodna, Sahib Julka, M. Khodādādi, M. Gumus, Michael Granitzer, 2024, No journal)
- Can Unconfident LLM Annotations Be Used for Confident Conclusions?(Kristina Gligorić, Tijana Zrnic, Cinoo Lee, Emmanuel J. Candès, Dan Jurafsky, 2024, ArXiv Preprint)
- Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence(Ariana Sahitaj, Premtim Sahitaj, Veronika Solopova, Jiaao Li, Sebastian Möller, Vera Schmitt, 2025, Proceedings of the Fourth Workshop on NLP for Positive Impact (NLP4PI))
- EvaluLLM: LLM assisted evaluation of generative outputs(Michael Desmond, Zahra Ashktorab, Qian Pan, Casey Dugan, James M. Johnson, 2024, Companion Proceedings of the 29th International Conference on Intelligent User Interfaces)
- TagLab: A human-centric AI system for interactive semantic segmentation(Gaia Pavoni, Massimiliano Corsini, Federico Ponchio, Alessandro Muntoni, Paolo Cignoni, 2021, ArXiv Preprint)
- ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models(Yuzhe Gu, Ziwei Ji, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen, 2024, ArXiv Preprint)
- Human-in-the-Loop Annotation for Image-Based Engagement Estimation: Assessing the Impact of Model Reliability on Annotation Accuracy(Sahana Yadnakudige Subramanya, Ko Watanabe, Andreas Dengel, Shoya Ishimaru, 2025, ArXiv Preprint)
- VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures(Yoo yeon Sung, Hannah Kim, Dan Zhang, 2025, arXiv.org)
- Incorporating Human Judgment in AI-Assisted Content Development: The HEAT Heuristic(Gustav Verhulsdonck, Jennifer Weible, D. Stambler, Tharon W. Howard, Jason Tham, 2024, Technical Communication)
- MEGAnno+: A Human-LLM Collaborative Annotation System(Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, Dan Zhang, 2024, ArXiv Preprint)
- HilMeMe: A Human-in-the-Loop Machine Translation Evaluation Metric Looking into Multi-Word Expressions(Lifeng Han, 2022, ArXiv Preprint)
- Human-in-the-Loop and Generative AI Dilemma: A Hybrid Strategy for Effective Customer Service in Enterprise CRM(2025, International Journal of Business and Technology Management)
- Adobe Summit Concierge Evaluation with Human in the Loop(Yiru Chen, Sally Fang, Sai Sree Harsha, Dan Luo, Vaishnavi Muppala, Fei Wu, Shun Jiang, Kun Qian, Yunyao Li, 2025, ArXiv Preprint)
- Continuous Model Calibration: Leveraging Feedback-Driven Fine-Tuning for Self- Correcting Large Language Models(Opeyemi Joseph Awotunde, 2025, International Journal of Research Publication and Reviews)
- Human-AI collaboration in legal services: empirical insights on task-technology fit and generative AI adoption by legal professionals(Tina Nosrati, Fariba Nosrati, M. Mashayekhi, 2025, SAM Advanced Management Journal)
- From Human Annotation to Automation: LLM-in-the-Loop Active Learning for Arabic Sentiment Analysis(Dania Refai, Alaa Dalaq, Doaa Dalaq, Irfan Ahmad, 2025, arXiv.org)
- FORGE: An LLM-driven Framework for Large-Scale Smart Contract Vulnerability Dataset Construction(Jiachi Chen, Yiming Shen, Jiashuo Zhang, Zihao Li, John C. Grundy, Zhenzhe Shao, Yanlin Wang, Jiashui Wang, Ting Chen, Zibin Zheng, 2025, arXiv.org)
专业垂直领域及高风险场景下的评估挑战
在医疗诊断、法律判决、军事决策、代码修复等专业领域,传统的通用评估指标往往失效。研究强调专家知识在处理微妙语境和极端准确性要求时的不可替代性。
- Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches(Namu Park, Farzad Ahmed, Zhaoyi Sun, Kevin Lybarger, Ethan Breinhorst, Julie Hu, Ozlem Uzuner, Martin Gunn, Meliha Yetisgen, 2025, arXiv.org)
- PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications(Dingkang Yang, Jinjie Wei, Dongling Xiao, Shunli Wang, Tong Wu, Gang Li, Mingcheng Li, Shuaibing Wang, Jiawei Chen, Yue Jiang, Qingyao Xu, Ke Li, Peng Zhai, Lihua Zhang, 2024, ArXiv Preprint)
- Generative Pre-trained Transformer for Pediatric Stroke Research: A Pilot Study.(Anna K. Fiedler, Kai Zhang, T. Lal, Xiaoqian Jiang, Stuart M. Fraser, 2024, Pediatric Neurology)
- Benchmarking and Studying the LLM-based Code Review(Zhengran Zeng, Ruikai Shi, Keke Han, Yixin Li, Kai Sun, Yidong Wang, Zhuohao Yu, Rui Xie, Wei Ye, Shikun Zhang, 2025, arXiv.org)
- The State of Human-centered NLP Technology for Fact-checking(Anubrata Das, Houjiang Liu, Venelin Kovatchev, Matthew Lease, 2023, ArXiv Preprint)
- Human-centred test and evaluation of military AI(David Helmer, Michael Boardman, S. Kate Conroy, Adam J. Hepworth, Manoj Harjani, 2024, ArXiv Preprint)
- The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making(Abinitha Gourabathina, Yuexing Hao, Walter Gerych, Marzyeh Ghassemi, 2025, arXiv.org)
- Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models(D. Leucuța, A. Urda-Cîmpean, Dan Istrate, T. Drugan, 2025, Diagnostics)
- Responsible LLM Deployment for High-Stake Decisions by Decentralized Technologies and Human-AI Interactions(Swati Sachan, Theo Miller, Mai Phuong Nguyen, 2025, ArXiv Preprint)
- LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation(Joseph Enguehard, Morgane Van Ermengem, Katie Atkinson, S. Cha, Arijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah Marlowe, Carina Suzana Negreanu, Kitty Boxall, Diana Mincu, 2025, Proceedings of the Natural Legal Language Processing Workshop 2025)
- Abstract B010: From RECIST to reality: A foundation model pipeline for scalable therapy response evaluation(Katarina Vucic, C. Geady, Katy Scott, Joshua Siraj, Andrew Hope, Benjamin Haibe-Kains, 2025, Clinical Cancer Research)
- Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes(Ora Nova Fandina, Gal Amram, E. Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Rami Katan, Alice Podolsky, Orna Raz, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW))
- Exploring the Potential of the Large Language Models (LLMs) in Identifying Misleading News Headlines(Md Main Uddin Rony, M. Haque, Mohammad Ali, Ahmed Shatil Alam, Naeemul Hassan, 2024, arXiv.org)
- Automated Subjective Answer Evaluation using GenAI and NLP by Enhancing Accuracy, Fairness, and Feedback in Education(Yuvasri J, S. M, Rahul R, R. Venkatesan, G. Sundar, Saravana Kumar C S, 2025, 2025 International Conference on Intelligent Computing and Control Systems (ICICCS))
评估方法论、统计建模与基准设计的严谨性
探讨评估框架本身的设计问题,如评分量表与两两比较的效能、序数数据的统计误区、任务诱导出的投机行为,以及如何通过数学模型(如Polyrating)量化并缓解评估偏差。
- The Great Misalignment Problem in Human Evaluation of NLP Methods(Mika Hämäläinen, Khalid Alnajjar, 2021, ArXiv Preprint)
- What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think(David M. Howcroft, Verena Rieser, 2021, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing)
- Comparison of Rating Scale and Pairwise Comparison Methods for Measuring Human Co-worker Subjective Impression of Robot during Physical Human-Robot Collaboration(Qiao Wang, Ziqi Wang, Marc G. Carmichael, Dikai Liu, Chin-Teng Lin, 2024, 2024 IEEE International Conference on Robotics and Automation (ICRA))
- Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation(Jasper Dekoninck, Maximilian Baader, Martin T. Vechev, 2024, International Conference on Learning Representations)
- Statistical Multicriteria Evaluation of LLM-Generated Text(E. Arias, Hannah Blocher, Julian Rodemann, M. Aßenmacher, Christoph Jansen, 2025, arXiv.org)
- A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation(Irune Zubiaga, A. Soroa, Rodrigo Agerri, 2024, Conference on Empirical Methods in Natural Language Processing)
- PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization(Meiling Tao, Chenghao Zhu, Dongyi Ding, Tiannan Wang, Y. Jiang, Wangchunshu Zhou, 2025, arXiv.org)
- Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena(Aidar Myrzakhan, S. Mahmoud Bsharat, Zhiqiang Shen, 2024, arXiv.org)
- A Human-Centric Assessment Framework for AI(Sascha Saralajew, Ammar Shaker, Zhao Xu, Kiril Gashteovski, Bhushan Kotnis, Wiem Ben Rim, Jürgen Quittek, Carolin Lawrence, 2022, ArXiv Preprint)
- Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences(Shreya Shankar, J.D. Zamfirescu-Pereira, Bjorn Hartmann, Aditya G. Parameswaran, Ian Arawjo, 2024, Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology)
- RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment(Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, T. Zhang, 2023, ArXiv)
交互式、多模态与智能体任务中的动态评估
针对Agent、RAG检索增强生成、长文本生成及多模态内容(语音、3D、视频),研究如何构建能够捕捉真实世界交互复杂性、实时性及幻觉风险的新型评估范式。
- The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation(Zarreen Reza, 2025, ArXiv Preprint)
- AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems(YenTing Lee, Keerthi Koneru, Zahra Moslemi, Sheethal Kumar, Ramesh Radhakrishnan, 2026, ArXiv Preprint)
- SI-Agent: An Agentic Framework for Feedback-Driven Generation and Tuning of Human-Readable System Instructions for Large Language Models(Jeshwanth Challagundla, 2025, 2025 16th International Conference on Information, Intelligence, Systems & Applications (IISA))
- Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards(M. Tamber, F. Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, O. Mendelevitch, Renyi Qu, Jimmy Lin, 2025, Conference on Empirical Methods in Natural Language Processing)
- MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark(Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Pan Zhou, Yao Wan, Lichao Sun, 2024, International Conference on Machine Learning)
- Scalable 3D Captioning with Pretrained Models(Tiange Luo, C. Rockwell, Honglak Lee, Justin Johnson, 2023, Neural Information Processing Systems)
- BERTHA: Video Captioning Evaluation Via Transfer-Learned Human Assessment(Luis Lebron, Yvette Graham, Kevin McGuinness, K. Kouramas, N. O’Connor, 2022, International Conference on Language Resources and Evaluation)
- SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics(Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, H. Saruwatari, 2024, Interspeech)
- Cosmos-Surg-dVRK: World Foundation Model-based Automated Online Evaluation of Surgical Robot Policy Learning(Lukas Zbinden, Nigel Nelson, Juo-Tung Chen, Xinhao Chen, Ji Woong Kim, Mahdi Azizian, Axel Krieger, Sean Huver, 2025, IEEE Robotics and Automation Letters)
- DebateGPT: An Real Time Argument Analyzer using Transformer and NLP(Vruddhi Burad, M. . Ahire, Siddhi Badgujar, Ajeet Gayawale, Pranjal Ahire, 2026, International Journal of Scientific Research in Engineering and Management)
- Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation(Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, Dakuo Wang, 2025, arXiv.org)
- AgentEval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content(Thanh Vu, Richi Nayak, T. Balasubramaniam, 2025, arXiv.org)
本报告综合揭示了大语言模型评估中的多重局限性:传统人工评估正面临严重的认知偏见、高昂成本与不可扩展性瓶颈;而作为替代方案的‘LLM-as-a-Judge’虽然高效,却引入了长度偏见和自恋倾向等新风险。目前行业的研究重心正从单一指标竞争转向人机协作(HITL)框架的构建、特定高风险领域的专家评测协议制定、以及针对智能体(Agent)和多模态任务的动态、细粒度评估方法论的开发。
总计193篇相关文献
Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging"LLM-as-a-judge"paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate multi-dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ-EVAL can generate evaluation results that better align with human experts'ratings compared with conventional automated evaluation metrics and existing LLM-as-a-judge methods.
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to “validate the validators”—aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative nature of alignment. In particular, we identify a phenomenon we dub criteria drift: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appear dependent on the specific LLM outputs observed (rather than independent and definable a priori), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.
Large language models (LLMs) often generate natural language rationales -- free-form explanations that help improve performance on complex reasoning tasks and enhance interpretability for human users. However, evaluating these rationales remains challenging. While recent work has relied on binary preference judgments from humans or LLM judges, such evaluations are often opaque and coarse-grained, offering limited insight into what makes one rationale better than another. In this work, we rethink preference evaluation for LLM-generated rationales by asking: (1) What attributes define good rationales? (2) Can human preferences be explained by these attributes? (3) Can attribute-based evaluation overcome the limitations of binary comparisons? We identify a set of key rationale attributes from prior literature and assess them using automatic metrics, LLM judgments, and human annotations. We then analyze two standard human preference datasets MT Bench and Chatbot Arena using SHAP to identify which attributes best explain human preference outcomes. Finally, we re-evaluate model-generated rationales using attribute-specific ELO scores, revealing more nuanced model comparisons and insights. Our findings suggest that fine-grained attribute evaluations can better characterize rationale quality and guide future research toward more interpretable and reliable evaluation practices.
Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset that includes five distinct fine-grained annotation subtasks. Although LLMs have shown impressive capabilities in general settings, our study reveals that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks. Through a systematic assessment, we demonstrate that LLM predictions are often inconsistent, poorly calibrated, and highly sensitive to prompt variations. To address these limitations, we propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review. Our findings show that this lightweight intervention significantly improves annotation reliability while reducing human effort by up to 45%, offering a relatively scalable and cost-effective yet accurate path forward for deploying LLMs in real-world evaluation settings.
The use of large language models (LLMs) for relevance assessment in information retrieval has gained significant attention, with recent studies suggesting that LLM-based judgments provide comparable evaluations to human judgments. Notably, based on TREC 2024 data, Upadhyay et al make a bold claim that LLM-based relevance assessments, such as those generated by the Umbrela system, can fully replace traditional human relevance assessments in TREC-style evaluations. This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion. First, we question whether the evidence provided by Upadhyay et al. genuinely supports their claim, particularly when the test collection is intended to serve as a benchmark for future research innovations.Second, we submit a system deliberately crafted to exploit automatic evaluation metrics, demonstrating that it can achieve artificially inflated scores without truly improving retrieval quality. Third, we simulate the consequences of circularity by analyzing Kendall's tau correlations under the hypothetical scenario in which all systems adopt Umbrela as a final-stage re-ranker, illustrating how reliance on LLM-based assessments can distort system rankings. Theoretical challenges - including the inherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and the potential degradation of future LLM performance - that must be addressed before LLM-based relevance assessments can be considered a viable replacement for human judgments.
Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.
Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark’s usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC (CITATION) and Arena-Hard v0.1 (CITATION) are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation, semi-supervised learning to generate clusters, and stratified sampling to ensure balanced representation across a wide range of domains and languages. The resulting evaluation set, which includes 1573 samples across 14 categories, demonstrates high separability (84%) across ten top-ranked models, and agreement (84%) with Chatbot Arena and (0.915) Spearman correlation. The agreement values are 9% better than Arena Hard and 20% better than AlpacaEval 2.0 LC, while the Spearman coefficient is 0.7 more than the next best benchmark, showcasing a significant improvement in the usefulness of the benchmark. We further provide an open-source evaluation tool that enables fine-grained analysis of model performance across user-defined categories, offering valuable insights for practitioners. This work contributes to the ongoing effort to enhance the transparency, diversity, and effectiveness of LLM evaluation methodologies.
LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation
Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.
Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into'Legal Data Points'(LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.
This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of generated CNs from different models, organized in a tournament-style format. The proposed evaluation method achieves a high correlation with human preference, with a $\rho$ score of 0.88. As an additional contribution, we leverage LLMs as zero-shot CN generators and provide a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in zero-shot are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.
Rating-based human evaluation has become an essential tool to accurately evaluate the impressive performance of large language models (LLMs). However, current rating systems suffer from several important limitations: first, they fail to account for biases that significantly influence evaluation results, second, they require large and expensive preference datasets to obtain accurate ratings, and third, they do not facilitate meaningful comparisons of model ratings across different tasks. To address these issues, we introduce Polyrating, an expressive and flexible rating system based on maximum a posteriori estimation that enables a more nuanced and thorough analysis of model performance at lower costs. Polyrating can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. Further, Polyrating can reduce the cost of human evaluations by up to $41\%$ for new models and up to $77\%$ for new tasks by leveraging existing benchmark scores. Lastly, Polyrating enables direct comparisons of ratings across different tasks, providing a comprehensive understanding of an LLMs' strengths, weaknesses, and relative performance across different applications.
Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs'capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.
Novelty is a crucial criterion in the peer‐review process for evaluating academic papers. Traditionally, it is judged by experts or measured by unique reference combinations. Both methods have limitations: experts have limited knowledge, and the effectiveness of the combination method is uncertain. Moreover, it is unclear if unique citations truly measure novelty. The large language model (LLM) possesses a wealth of knowledge, while human experts possess judgment abilities that the LLM does not possess. Therefore, our research integrates the knowledge and abilities of LLM and human experts to address the limitations of novelty assessment. One of the most common types of novelty in academic papers is the introduction of new methods. In this paper, we propose leveraging human knowledge and LLM to assist pre‐trained language models (PLMs, e.g., BERT, etc.) in predicting the method novelty of papers. Specifically, we extract sentences related to the novelty of the academic paper from peer‐review reports and use LLM to summarize the methodology section of the academic paper, which are then used to fine‐tune PLMs. In addition, we have designed a text‐guided fusion module with novel Sparse‐Attention to better integrate human and LLM knowledge. We compared the method we proposed with a large number of baselines. Extensive experiments demonstrate that our method achieves superior performance.
Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains"critic"models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as"flawless", even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.
LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.
Retrieval-augmented generation (RAG) aims to reduce hallucinations by grounding responses in external context, yet large language models (LLMs) still frequently introduce unsupported information or contradictions even when provided with relevant context. This paper presents two complementary efforts at Vectara to measure and benchmark LLM faithfulness in RAG. First, we describe our original hallucination leaderboard, which has tracked hallucination rates for LLMs since 2023 using our HHEM hallucination detection model. Motivated by limitations observed in current hallucination detection methods, we introduce FaithJudge, an LLM-as-a-judge framework that leverages a pool of diverse human-annotated hallucination examples to substantially improve the automated hallucination evaluation of LLMs. We introduce an enhanced hallucination leaderboard centered on FaithJudge that benchmarks LLMs on RAG faithfulness in summarization, question-answering, and data-to-text generation tasks. FaithJudge enables a more reliable benchmarking of LLM hallucinations in RAG and supports the development of more trustworthy generative AI systems: https://github.com/vectara/FaithJudge.
Automated Code Review (ACR) is crucial for software quality, yet existing benchmarks often fail to reflect real-world complexities, hindering the evaluation of modern Large Language Models (LLMs). Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics. To address these limitations, we introduce SWRBench , a new benchmark comprising 1000 manually verified Pull Requests (PRs) from GitHub, offering PR-centric review with full project context. SWRBench employs an objective LLM-based evaluation method that aligns strongly with human judgment (~90 agreement) by verifying if issues from a structured ground truth are covered in generated reviews. Our systematic evaluation of mainstream ACR tools and LLMs on SWRBench reveals that current systems underperform, and ACR tools are more adept at detecting functional errors. Subsequently, we propose and validate a simple multi-review aggregation strategy that significantly boosts ACR performance, increasing F1 scores by up to 43.67%. Our contributions include the SWRBench benchmark, its objective evaluation method, a comprehensive study of current ACR capabilities, and an effective enhancement approach, offering valuable insights for advancing ACR research.
AI-based peer review systems tend to produce shallow and overpraising suggestions compared to human feedback. Here, we evaluate how well a reasoning LLM trained with multi-objective reinforcement learning (REMOR) can overcome these limitations. We start by designing a multi-aspect reward function that aligns with human evaluation of reviews. The aspects are related to the review itself (e.g., criticisms, novelty) and the relationship between the review and the manuscript (i.e., relevance). First, we perform supervised fine-tuning of DeepSeek-R1-Distill-Qwen-7B using LoRA on PeerRT, a new dataset of high-quality top AI conference reviews enriched with reasoning traces. We then apply Group Relative Policy Optimization (GRPO) to train two models: REMOR-H (with the human-aligned reward) and REMOR-U (with a uniform reward). Interestingly, the human-aligned reward penalizes aspects typically associated with strong reviews, leading REMOR-U to produce qualitatively more substantive feedback. Our results show that REMOR-U and REMOR-H achieve more than twice the average rewards of human reviews, non-reasoning state-of-the-art agentic multi-modal AI review systems, and general commercial LLM baselines. We found that while the best AI and human reviews are comparable in quality, REMOR avoids the long tail of low-quality human reviews. We discuss how reasoning is key to achieving these improvements and release the Human-aligned Peer Review Reward (HPRR) function, the Peer Review Reasoning-enriched Traces (PeerRT) dataset, and the REMOR models, which we believe can help spur progress in the area.
The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum, from basic metrics to complex benchmarks like GLUE and MMLU. These vulnerabilities manifest through benchmark exploitation, dataset contamination, and evaluation bias, creating a false perception of progress in language understanding capabilities. Through extensive review of contemporary evaluation approaches, we identify significant limitations in static benchmark designs, human evaluation protocols, and LLM-as-judge frameworks, all of which compromise the reliability of current performance assessments. As LLM capabilities evolve and existing benchmarks become redundant, we lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks. This requires frameworks that are adapted dynamically, addressing current limitations and providing a more accurate reflection of LLM performance.
Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. This underscores the need for robust evaluation methodologies to accurately assess LLM-based applications. Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation. Recent research has explored leveraging LLMs to mimic human reasoning and decision-making processes for evaluation purposes known as LLM-as-a-judge framework. However, these existing frameworks have two significant limitations. First, they lack the flexibility to adapt to different text styles, including various answer and ground truth styles, thereby reducing their generalization performance. Second, the evaluation scores produced by these frameworks are often skewed and hard to interpret, showing a low correlation with human judgment. To address these challenges, we propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications. This system iteratively refines evaluation prompts and balances the trade-off between the adaptive requirements of downstream tasks and the alignment with human perception. Our experimental results show that the proposed multi-agent LLM Judge framework not only enhances evaluation accuracy compared to existing methods but also produces evaluation scores that better align with human perception.
Bargaining, a critical aspect of real-world interactions, presents challenges for large language models (LLMs) due to limitations in strategic depth and adaptation to complex human factors. Existing benchmarks often fail to capture this real-world complexity. To address this and enhance LLM capabilities in realistic bargaining, we introduce a comprehensive framework centered on utility-based feedback. Our contributions are threefold: (1) BargainArena, a novel benchmark dataset with six intricate scenarios (e.g., deceptive practices, monopolies) to facilitate diverse strategy modeling; (2) human-aligned, economically-grounded evaluation metrics inspired by utility theory, incorporating agent utility and negotiation power, which implicitly reflect and promote opponent-aware reasoning (OAR); and (3) a structured feedback mechanism enabling LLMs to iteratively refine their bargaining strategies. This mechanism can positively collaborate with in-context learning (ICL) prompts, including those explicitly designed to foster OAR. Experimental results show that LLMs often exhibit negotiation strategies misaligned with human preferences, and that our structured feedback mechanism significantly improves their performance, yielding deeper strategic and opponent-aware reasoning.
Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.
Quantifying tumor burden is central to response assessment in oncology clinical trials and treatment monitoring. Despite widespread use, the clinical standard Response Evaluation Criteria in Solid Tumors (RECIST) has several limitations: it relies on unidimensional measurements of a limited set of lesions, ignores volumetric change, and exhibits high interobserver variability. These constraints limit its sensitivity in complex disease presentations, ultimately affecting treatment decisions and trial outcomes. Volumetric tumor assessment is more robust and sensitive, but the process of manual segmentation is impractical at scale. Advances in deep learning and foundation models provide an opportunity to transform response evaluation by automating volume estimation with minimal human input. We present AI-Augmented Response Assessment (AAuRA), a novel computational workflow that leverages existing RECIST-like measurements to seed MedSAM—a foundation model for semi-automated medical image segmentation. As a proof-of-concept, we reverse-engineer RECIST annotations by decomposing 3D tumor segmentations to extract the longest axial diameter and its perpendicular, to generate bounding boxes that simulate RECIST-based input. We evaluate AAuRA on two public computed tomography imaging datasets with tumor segmentations: LIDC-IDRI and NSCLC-Radiogenomics. Performance is assessed using Volumetric Dice Similarity Coefficient (DSC), 95th percentile Hausdorff Distance (95HD), Added Path Length (APL), and Pearson correlation between predicted and ground truth volume. To account for size-related bias in spatial metrics, APL and 95HD are normalized by tumor volume. In the NSCLC-Radiogenomics dataset, AAuRA achieved a mean DSC of 0.72, APL of 0.20, and 95HD of 0.01. In the LIDC-IDRI dataset, the mean DSC was 0.50, with an APL of 0.42 and 95HD of 0.11. Volume predictions from AAuRA demonstrated strong correlation with ground truth segmentations: the Pearson correlation coefficient was 0.86 (p < 0.001) and 0.90 (p < 0.001) for the LIDC-IDRI and NSCLC-Radiogenomics datasets, respectively. Segmentation performance was lower for smaller, ill-defined lesions—highlighting known limitations of current foundation models and motivating the need for domain-specific adaptation and fine-tuning. AAuRA serves as a clinically viable bridge between current standards and AI-augmented tools for therapy response evaluation. Using familiar annotations as input, AAuRA can avoid disruptive changes to clinical workflows while enhancing accuracy and scalability of tumor burden assessment. This work highlights how clinically meaningful priors—like RECIST—can be repurposed to guide foundation models, offering a blueprint for real-world deployment of AI in medicine. Future work will extend the pipeline to real RECIST annotations across diverse disease sites and explore integration into prospective imaging workflows. AAuRA’s modular design also enables adaptability to new foundation models and evolving segmentation standards. Katarina Vucic, Caryn Geady, Katy L. Scott, Joshua Siraj, Andrew J. Hope, Benjamin Haibe-Kains. From RECIST to reality: A foundation model pipeline for scalable therapy response evaluation [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr B010.
Automated singing assessment is crucial for education, entertainment, and talent discovery. However, existing systems are hindered by two fundamental limitations: first, their reliance on reference tracks (e.g., the original song), which stifles creative expression, and second, their simplification of complex vocal performances into a single, often non-diagnostic score based on pitch and rhythm. This paradigm fails to capture the nuanced, multifaceted attributes that define expert-level singing. Echoing the recent shift in other AI domains from discriminative to descriptive evaluation, we advocate for a new paradigm in singing assessment. This paper aims to build a complete ecosystem for reference-free, multi-dimensional, and descriptive singing assessment. First, we construct Sing-MD, a large-scale, multi-dimensional singing dataset annotated by experts across four core dimensions: breath control, timbre quality, emotional expression, and vocal technique. Analysis of this dataset reveals a key finding: significant annotation inconsistencies among experts, which challenges the validity of traditional accuracy-based evaluation metrics. Second, standard Multimodal Large Language Models (MLLMs) are unable to analyze full-length songs on resource-constrained, consumer-grade hardware due to memory limitations. This challenge leads to a ''human label-audio input mismatch'' problem and results in poor performance. To address this issue, we designed VocalVerse, an efficient hybrid architecture. It leverages a lightweight acoustic encoder and specialized modules to process the entire song, thereby learning global performance features, modeling long-term dependencies, and ultimately overcoming this limitation. Third, to address the shortcomings of automated metrics, we establish a new evaluation benchmark-H-TPR (Human-in-the-loop Tiered Perceptual Ranking)-which evaluates a model's ability to generate perceptually valid performance rankings, rather than predicting a noisy ''ground-truth'' score. Our comprehensive experiments show that on the H-TPR benchmark, our VocalVerse framework can effectively learn and distinguish singing quality across different dimensions, thereby creating perceptually valid quality rankings and significantly outperforming existing baselines. Furthermore, our framework for multi-dimensional scoring and descriptive feedback generation has been successfully commercialized and deployed at scale, demonstrating its significant real-world impact and practical value.
Risk of bias (RoB) assessment is an essential part of systematic reviews that requires reading and understanding each eligible trial and RoB tools. RoB assessment is subject to human error and is time-consuming. Machine learning-based tools have been developed to automate RoB assessment using simple models trained on limited corpuses. ChatGPT is a conversational agent based on a large language model (LLM) that was trained on an internet-scale corpus and has demonstrated human-like abilities in multiple areas including healthcare. LLMs might be able to support systematic reviewing tasks such as assessing RoB. We aim to assess interrater agreement in overall (rather than domain-level) RoB assessment between human reviewers and ChatGPT, in randomized controlled trials of interventions within medical interventions. We will randomly select 100 individually- or cluster-randomized, parallel, two-arm trials of medical interventions from recent Cochrane systematic reviews that have been assessed using the RoB1 or RoB2 family of tools. We will exclude reviews and trials that were performed under emergency conditions (e.g., COVID-19), as well as public health and welfare interventions. We will use 25 of the trials and human RoB assessments to engineer a ChatGPT prompt for assessing overall RoB, based on trial methods text. We will obtain ChatGPT assessments of RoB for the remaining 75 trials and human assessments. We will then estimate interrater agreement using Cohen’s κ. The primary outcome for this study is overall human-ChatGPT interrater agreement. We will report observed agreement with an exact 95% confidence interval, expected agreement under random assessment, Cohen’s κ, and a p-value testing the null hypothesis of no difference in agreement. Several other analyses are also planned. This study is likely to provide the first evidence on interrater agreement between human RoB assessments and those provided by LLMs and will inform subsequent research in this area.
Risk of bias (RoB) assessment is a highly skilled task that is time‐consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task‐specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non‐task‐specific Internet‐scale training sets. They demonstrate human‐like abilities and might be able to support tasks like RoB assessment.
Speech-based AI for Alzheimer’s Disease (AD) risks automating and amplifying the very human biases it is trained on. To build fairer systems, the underlying mechanisms of this bias must be understood. This study dissects this problem through a cross-lingual perception experiment. Using generalized linear mixed-effects models with interaction terms, we uncover a context-dependent "acoustic double standard." In their native language, listeners judge male and female speakers based on different and sometimes inconsistent acoustic criteria. For example, the perceptual impact of vocal monotony is highly dependent on the speaker’s perceived gender. This complex bias mechanism transforms in a non-native context, revealing its deep entanglement with cognitive and social factors. Our findings demonstrate that the human "ground truth" is fundamentally biased, and we present a methodology for auditing these biases at the signal level, offering a path toward fairer diagnostic AI.
Risk of bias (RoB) assessment is essential in systematic reviews and clinical guideline development. Current manual processes are complex, inefficient, and inconsistent. Large language models (LLMs) have potential to assist RoB assessment, but their standalone performance is limited. Efficient integration of LLMs with human judgment remains a challenge. A high-quality dataset of medical literature was developed, encompassing seven bias domains as defined by the Cochrane RoB 1.0 tool. Using structured prompt engineering, two LLMs, DeepSeek V3 and Qwen-plus, were compared on three-class bias risk classification tasks. Three Human-AI collaboration modes were developed: (M1) Evidence Extraction Mode—human judgment based solely on LLM-extracted evidence; (M2) Reasoning Support Mode—combining LLM extracted evidence with reasoning explanations; (M3) Disagreement Trigger Mode—human intervention triggered by model disagreement. Accuracy and human intervention rates were used to evaluate performance and efficiency. We randomly sampled 300 instances per domain from the dataset and evaluated both LLMs using the same structured prompting approach. DeepSeek V3 outperformed Qwen-plus in accuracy in four of seven bias domains, demonstrating superior overall judgment. Model accuracy ranged from 0.403 to 0.777 across domains. In Human-AI collaboration, 60 samples per domain were evaluated with human involvement. M1 showed relatively limited performance in both accuracy and intervention rate; M2 achieved the highest accuracy in most tasks; M3 markedly reduced intervention rates. Accuracy under M2 and M3 ranged from 0.633 to 0.900. Incorporating LLM reasoning improved consistency in human judgments. Disagreement Trigger Mode exhibited high cost-effectiveness in structured and moderate-reasoning tasks, enhancing assessment efficiency, while Reasoning Support Mode was more stable and practical for open-ended and highly subjective tasks. LLMs (like DeepSeek V3 and Qwen-plus) cannot yet replace human RoB assessment but well-designed Human-AI collaboration can improve accuracy and reduce manual workload.
Systematic reviews are essential for evidence based healthcare, but conducting them is time and resource consuming. To date, efforts have been made to accelerate and (semi-) automate various steps of systematic reviews through the use of artificial intelligence and the emergence of large language models (LLMs) promises further opportunities. One crucial but complex task within systematic review conduct is assessing the risk of bias of included studies. Therefore, the aim of this study was to test the LLM Claude 2 for risk of bias assessment of 100 randomized controlled trials using the revised Cochrane risk of bias tool ("RoB 2"; involving judgements for five specific domains and an overall judgement). We assessed the agreement of risk of bias judgements by Claude with human judgements published in Cochrane Reviews. The observed agreement between Claude and Cochrane authors ranged from 41% for the overall judgement to 71% for domain 4 ("outcome measurement"). Cohen's Kappa was lowest for domain 5 ("selective reporting"; 0.10 (95% confidence interval (CI): -0.10-0.31)) and highest for domain 3 ("missing data"; 0.31 (95% CI: 0.10-0.52)), indicating slight to fair agreement. Fair agreement was found for the overall judgement (Cohen's Kappa: 0.22 (95% CI: 0.06-0.38)). Sensitivity analyses using alternative prompting techniques or the more recent version Claude 3 did not result in substantial changes. Currently, Claude's RoB 2 judgements cannot replace human risk of bias assessment. However, the potential of LLMs to support risk of bias assessment should be further explored.
The rise of surgical robots and vision-language-action models has accelerated the development of autonomous surgical policies and efficient assessment strategies. However, evaluating these policies directly on physical robotic platforms such as the da Vinci Research Kit (dVRK) remains hindered by high costs, time demands, reproducibility challenges, and variability in execution. World foundation models (WFM) for physical AI offer a transformative approach to simulate complex real-world surgical tasks, such as soft tissue deformation, with high fidelity. This work introduces Cosmos-Surg-dVRK, a surgical finetune of the Cosmos WFM, which, together with a trained video classifier, enables fully automated online evaluation and benchmarking of surgical policies. We evaluate Cosmos-Surg-dVRK using two distinct surgical datasets. On tabletop suture pad tasks, the automated pipeline achieves strong correlation between online rollouts in Cosmos-Surg-dVRK and policy outcomes on the real dVRK Si platform, as well as good agreement between human labelers and the V-JEPA 2-derived video classifier. Additionally, preliminary experiments with ex-vivo porcine cholecystectomy tasks in Cosmos-Surg-dVRK demonstrate promising alignment with real-world evaluations, highlighting the platform's potential for more complex surgical procedures.
Large language models (LLMs) are revolutionizing every aspect of society. They are increasingly used in problem-solving tasks to substitute human assessment and reasoning. LLMs are trained on what humans write and are thus exposed to human bias. We evaluate whether LLMs inherit one of the most widespread human biases: overconfidence. We algorithmically construct reasoning problems with known ground truths. We prompt LLMs to answer these problems and assess the confidence in their answers, closely following similar protocols in human experiments. We find that all five LLMs we study are overconfident: they overestimate the probability that their answer is correct between 20% and 60%. Humans have accuracy similar to the more advanced LLMs, but far lower overconfidence. Although humans and LLMs are similarly biased in questions which they are certain they answered correctly, a key difference emerges between them: LLM bias increases sharply relative to humans if they become less sure that their answers are correct. We also show that LLM input has ambiguous effects on human decision making: LLM input leads to an increase in the accuracy, but it more than doubles the extent of overconfidence in the answers.
Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.
Multimodal Large Language Models (MLLMs) are increasingly applied in Personalized Image Aesthetic Assessment (PIAA) as a scalable alternative to expert evaluations. However, their predictions may reflect subtle biases influenced by demographic factors such as gender, age, and education. In this work, we propose AesBiasBench, a benchmark designed to evaluate MLLMs along two complementary dimensions: (1) stereotype bias, quantified by measuring variations in aesthetic evaluations across demographic groups; and (2) alignment between model outputs and genuine human aesthetic preferences. Our benchmark covers three subtasks (Aesthetic Perception, Assessment, Empathy) and introduces structured metrics (IFD, NRD, AAS) to assess both bias and alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o, Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL). Results indicate that smaller models exhibit stronger stereotype biases, whereas larger models align more closely with human preferences. Incorporating identity information often exacerbates bias, particularly in emotional judgments. These findings underscore the importance of identity-aware evaluation frameworks in subjective vision-language tasks.
Background/Objectives: Diagnostic accuracy studies are essential for the evaluation of the performance of medical tests. The risk of bias (RoB) for these studies is commonly assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool. This study aimed to assess the capabilities and reasoning accuracy of large language models (LLMs) in evaluating the RoB in diagnostic accuracy studies, using QUADAS 2, compared to human experts. Methods: Four LLMs were used for the AI assessment: ChatGPT 4o model, X.AI Grok 3 model, Gemini 2.0 flash model, and DeepSeek V3 model. Ten recent open-access diagnostic accuracy studies were selected. Each article was independently assessed by human experts and by LLMs using QUADAS 2. Results: Out of 110 signaling questions assessments (11 questions for each of the 10 articles) by the four AI models, and the mean percentage of correct assessments of all the models was 72.95%. The most accurate model was Grok 3, followed by ChatGPT 4o, DeepSeek V3, and Gemini 2.0 Flash, with accuracies ranging from 74.45% to 67.27%. When analyzed by domain, the most accurate responses were for “flow and timing”, followed by “index test”, and then similarly for “patient selection” and “reference standard”. An extensive list of reasoning errors was documented. Conclusions: This study demonstrates that LLMs can achieve a moderate level of accuracy in evaluating the RoB in diagnostic accuracy studies. However, they are not yet a substitute for expert clinical and methodological judgment. LLMs may serve as complementary tools in systematic reviews, with compulsory human supervision.
The SLAM paper demonstrated that on-device Small Language Models (SLMs) are a viable and cost-effective alternative to API-based Large Language Models (LLMs), such as OpenAI's GPT-4, offering comparable performance and stability. However, SLAM also identified discrepancies between human preferences and traditional auto-evaluators. This follow-up paper explores methods to align LLM evaluator preferences with human evaluations by addressing biases, particularly toward higher token counts. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases. For instance, spearman's ranking correlation score in the Recommendation use case improved from -27.27 to 44.55. These results highlight the importance of accounting for biases in automated evaluations to ensure fair and accurate model assessments. The recalibration process enhances the reliability of automated evaluators, leading to better AI models that align with human values and expectations. This study provides a robust methodology for future research into bias correction and emphasizes the feasibility and benefits of developing human-aligned AI evaluation systems.
Evaluating video captioning systems is a challenging task as there are multiple factors to consider; for instance: the fluency of the caption, multiple actions happening in a single scene, and the human bias of what is considered important. Most metrics try to measure how similar the system generated captions are to a single or a set of human-annotated captions. This paper presents a new method based on a deep learning model to evaluate these systems. The model is based on BERT, which is a language model that has been shown to work well in multiple NLP tasks. The aim is for the model to learn to perform an evaluation similar to that of a human. To do so, we use a dataset that contains human evaluations of system generated captions. The dataset consists of the human judgments of the captions produces by the system participating in various years of the TRECVid video to text task. BERTHA obtain favourable results, outperforming the commonly used metrics in some setups.
Currently, nearly all evaluations of foundation models focus on objective metrics, emphasizing quiz performance to define model capabilities. While this model-centric approach enables rapid performance assessment, it fails to reflect authentic human experiences. To address this gap, we propose a Human-Centric subjective Evaluation (HCE) framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience. Through experiments involving Deepseek R1, OpenAI o3 mini, Grok 3, and Gemini 2.5, we conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks, yielding a comprehensive subjective dataset. This dataset captures diverse user feedback across multiple disciplines, revealing distinct model strengths and adaptability. Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind. By offering a novel framework and a rich dataset, this study not only enhances subjective evaluation methodologies but also lays the foundation for standardized, automated assessments, advancing LLM development for research and practical scenarios. Our dataset link is https://github.com/yijinguo/Human-Centric-Evaluation.
No abstract available
Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling, rendering them rare and costly. The scarcity of data and the absence of preexisting tools exacerbate these challenges, especially since these languages may not be adequately represented in various NLP datasets. To address this gap, we propose leveraging the potential of LLMs in the active learning loop for data annotation. Initially, we conduct evaluations to assess inter-annotator agreement and consistency, facilitating the selection of a suitable LLM annotator. The chosen annotator is then integrated into a training loop for a classifier using an active learning paradigm, minimizing the amount of queried data required. Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements, as indicated by estimated potential cost savings of at least 42.45 times compared to human annotation. Our proposed solution shows promising potential to substantially reduce both the monetary and computational costs associated with automation in low-resource settings. By bridging the gap between low-resource languages and AI, this approach fosters broader inclusion and shows the potential to enable automation across diverse linguistic landscapes.
Annotation bias in NLP datasets remains a major challenge for developing multilingual Large Language Models (LLMs), particularly in culturally diverse settings. Bias from task framing, annotator subjectivity, and cultural mismatches can distort model outputs and exacerbate social harms. We propose a comprehensive framework for understanding annotation bias, distinguishing among instruction bias, annotator bias, and contextual and cultural bias. We review detection methods (including inter-annotator agreement, model disagreement, and metadata analysis) and highlight emerging techniques such as multilingual model divergence and cultural inference. We further outline proactive and reactive mitigation strategies, including diverse annotator recruitment, iterative guideline refinement, and post-hoc model adjustments. Our contributions include: (1) a typology of annotation bias; (2) a synthesis of detection metrics; (3) an ensemble-based bias mitigation approach adapted for multilingual settings, and (4) an ethical analysis of annotation processes. Together, these insights aim to inform more equitable and culturally grounded annotation pipelines for LLMs.
Human evaluation is traditionally used as the "gold standard" for checking the quality of texts generated by large language models (LLMs). However, it is itself subject to subjectivity, variability and systematic errors of perception, which calls into question the reliability and reproducibility of model verification results. To systematize and analyze modern approaches to assessing the quality of work of expert annotators involved in the verification of the results of large language models, in order to increase the objectivity and reliability of human evaluation. The main sources of unreliability of expert judgments are analyzed. Methods for improving the reliability of human evaluation are presented and considered in detail: multiple annotation and calculation of agreement coefficients (Cohen's κ, Fleiss' κ, Krippendorff's α, ICC), the use of control ("gold") tasks, blind testing, meta-evaluation and statistical monitoring of evaluators' work. Special attention is paid to probabilistic models of annotator quality, such as the Dawid-Skene model. The application of systematic calibration and verification of experts is a necessary condition for ensuring the objectivity and reproducibility of human-in-the-loop experiments in natural language processing research and evaluation of large language models. The considered methods make it possible to formalize the process of human evaluation, minimize subjective distortions and increase the reliability of the data obtained, which is critical for the correct comparison and development of LLMs.
Recent research on Responsible AI, particularly in addressing algorithmic biases, has gained significant attention. Natural Language Processing (NLP) algorithms, which rely on human-generated and human-labeled data, often reflect these challenges. In this paper, we analyze inter-annotator agreement in the task of labeling hate speech data and examine how annotators’ backgrounds influence their labeling decisions. Specifically, we investigate differences in hate speech annotations that arise when annotators identify with the targeted groups. Our findings reveal substantial differences in agreement between a general pool of annotators and those who personally relate to the targets of the hate speech they label. Additionally, we evaluate the OpenAI GPT-4o model on the same dataset. Our results highlight the need to consider annotators’ backgrounds when assessing the performance of Large Language Models (LLMs) in hate speech detection.
Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model's fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies.
Exploring the Potential of the Large Language Models (LLMs) in Identifying Misleading News Headlines
In the digital age, the prevalence of misleading news headlines poses a significant challenge to information integrity, necessitating robust detection mechanisms. This study explores the efficacy of Large Language Models (LLMs) in identifying misleading versus non-misleading news headlines. Utilizing a dataset of 60 articles, sourced from both reputable and questionable outlets across health, science&tech, and business domains, we employ three LLMs- ChatGPT-3.5, ChatGPT-4, and Gemini-for classification. Our analysis reveals significant variance in model performance, with ChatGPT-4 demonstrating superior accuracy, especially in cases with unanimous annotator agreement on misleading headlines. The study emphasizes the importance of human-centered evaluation in developing LLMs that can navigate the complexities of misinformation detection, aligning technical proficiency with nuanced human judgment. Our findings contribute to the discourse on AI ethics, emphasizing the need for models that are not only technically advanced but also ethically aligned and sensitive to the subtleties of human interpretation.
Mental health in children and adolescents has been steadily deteriorating over the past few years. The recent advent of Large Language Models (LLMs) offers much hope for cost and time efficient scaling of monitoring and intervention, yet despite specifically prevalent issues such as school bullying and eating disorders, previous studies on have not investigated performance in this domain or for open information extraction where the set of answers is not predetermined. We create a new dataset of Reddit posts from adolescents aged 12-19 annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY and TREATMENT and compare expert labels to annotations from two top performing LLMs (GPT3.5 and GPT4). In addition, we create two synthetic datasets to assess whether LLMs perform better when annotating data as they generate it. We find GPT4 to be on par with human inter-annotator agreement and performance on synthetic data to be substantially higher, however we find the model still occasionally errs on issues of negation and factuality and higher performance on synthetic data is driven by greater complexity of real data rather than inherent advantage.
Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.
We introduce GODEL (Grounded Open Dialogue Language Model), a large pre-trained language model for dialog. In contrast with earlier models such as DialoGPT, GODEL leverages a new phase of grounded pre-training designed to better support adapting GODEL to a wide range of downstream dialog tasks that require information external to the current conversation (e.g., a database or document) to produce good responses. Experiments against an array of benchmarks that encompass task-oriented dialog, conversational QA, and grounded open-domain dialog show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups, in terms of both human and automatic evaluation. A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses (extrinsic evaluation) in addition to their communicative features (intrinsic evaluation). We show that extrinsic evaluation offers improved inter-annotator agreement and correlation with automated metrics. Code and data processing scripts are publicly available.
Abstract The increasing amount of valuable, unstructured textual information poses a major challenge to extract value from those texts. We need to use NLP (Natural Language Processing) techniques, most of which rely on manually annotating a large corpus of text for its development and evaluation. Creating a large annotated corpus is laborious and requires suitable computational support. There are many annotation tools available, but their main weaknesses are the absence of data management features for quality control and the need for a commercial license. As the quality of the data used to train an NLP model directly affects the quality of the results, the quality control of the annotations is essential. In this paper, we introduce ERAS, a novel web-based text annotation tool developed to facilitate and manage the process of text annotation. ERAS includes not only the key features of current mainstream annotation systems but also other features necessary to improve the curation process, such as the inter-annotator agreement, self-agreement and annotation log visualization, for annotation quality control. ERAS also implements a series of features to improve the customization of the user’s annotation workflow, such as: random document selection, re-annotation stages, and warm-up annotations. We conducted two empirical studies to evaluate the tool’s support to text annotation, and the results suggest that the tool not only meets the basic needs of the annotation task but also has some important advantages over the other tools evaluated in the studies. ERAS is freely available at https://github.com/grosmanjs/eras .
With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM meth-ods. However, a closer analysis reveals issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we pro-pose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called S UMM E DITS . This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on S UMM E DITS , with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance, highlighting the gaps in LLMs’ ability to reason about facts and detect inconsistencies when they occur.
Propaganda detection on social media remains challenging due to task complexity and limited high-quality labeled data. This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability. We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques into three broader categories, conduct a human annotation study on the HQP dataset that reveals low inter-annotator agreement for fine-grained labels, and implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label. A secondary human verification study shows significant improvements in both agreement and time-efficiency. Building on this, we fine-tune smaller language models (SLMs) to perform structured annotation. Instead of fine-tuning on human annotations, we train on high-quality LLM-generated data, allowing a large model to produce these annotations and a smaller model to learn to generate them via knowledge distillation. Our work contributes towards the development of scalable and robust propaganda detection systems, supporting the idea of transparent and accountable media ecosystems in line with SDG 16. The code is publicly available at our GitHub repository.
Current approaches to music emotion annotation remain heavily reliant on manual labelling, a process that imposes significant resource and labour burdens, severely limiting the scale of available annotated data. This study examines the feasibility and reliability of employing a large language model (GPT-4o) for music emotion annotation. In this study, we annotated GiantMIDI-Piano, a classical MIDI piano music dataset, in a four-quadrant valence-arousal framework using GPT-4o, and compared against annotations provided by three human experts. We conducted extensive evaluations to assess the performance and reliability of GPT-generated music emotion annotations, including standard accuracy, weighted accuracy that accounts for inter-expert agreement, inter-annotator agreement metrics, and distributional similarity of the generated labels. While GPT's annotation performance fell short of human experts in overall accuracy and exhibited less nuance in categorizing specific emotional states, inter-rater reliability metrics indicate that GPT's variability remains within the range of natural disagreement among experts. These findings underscore both the limitations and potential of GPT-based annotation: despite its current shortcomings relative to human performance, its cost-effectiveness and efficiency render it a promising scalable alternative for music emotion annotation.
Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs'multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation (https://huggingface.co/datasets/facebook/menlo).
Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p<0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.
Immune-related adverse events (irAEs) affect up to 40% of patients receiving immune checkpoint inhibitors, yet their identification depends on laborious and inconsistent manual chart review. Here we developed and evaluated an agentic large language model system to extract the presence, temporality, severity grade, attribution, and certainty of six irAE types from clinical notes. Retrospectively (263 notes), the system achieved macro-averaged F1 of 0.92 for detection and 0.66 for multi-class severity grading; self-consistency improved F1 by 0.14. The best-performing configuration cost approximately $0.02 per note. In prospective silent deployment over three months (884 notes), detection F1 was 0.72-0.79. In a randomized crossover study of clinical trial staff (17 participants, 316 observations), agentic assistance reduced annotation time by 40% (P < 0.001), increased complete-match accuracy (OR 1.45; 95% CI 1.01-2.09; P = 0.045), and improved inter-annotator agreement (Krippendorff's alpha from 0.22-0.51 to 0.82-0.85). These results demonstrate that agentic AI coupled with human verification could enhance efficiency, performance, and consistency for irAE assessment.
With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8\% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.
Recently, several scholars have contributed to the growth of a new theoretical framework in NLP called perspectivism. This approach aims to leverage data annotated by different individuals to model diverse perspectives that affect their opinions on subjective phenomena such as irony. In this context, we propose MultiPICo, a multilingual perspectivist corpus of ironic short conversations in different languages and linguistic varieties extracted from Twitter and Reddit. The corpus includes sociodemographic information about its annotators. Our analysis of the annotated corpus shows how different demographic cohorts may significantly disagree on their annotation of irony and how certain cultural factors influence the perception of the phenomenon and the agreement on the annotation. Moreover, we show how disaggregated annotations and rich annotator metadata can be exploited to benchmark the ability of large language models to recognize irony, their po-sitionality with respect to sociodemographic groups, and the efficacy of perspective-taking prompting for irony detection in multiple languages.
Qualitative thematic analysis (TA) is valuable but challenging to scale because of the time, cost, and complexity involved. Existing tools offer limited automation and often lack flexibility, trustworthiness, and usability with unstructured data. To address this gap, we developed DeTAILS 1 - an interactive toolkit that integrates large language models into reflexive TA workflows. Built on a modern architecture, DeTAILS supports qualitative data analysis through iterative, LLM-in-the-loop interactions. Next, we plan to conduct a formal evaluation with domain experts to assess its usability, trust, and analytic value. By embedding interactive LLM-in‑the‑loop processes, DeTAILS strives to extend TA to far larger datasets while preserving analytic depth and researcher reflexivity-going beyond the throughput limits of manual coding in a given time.
We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.
Neural Machine Translation is one of the key research directions in Natural Language Processing. However, limited by the scale and quality of parallel corpus, the translation quality of low-resource Neural Machine Translation has always been unsatisfactory. When Reinforcement Learning from Human Feedback (RLHF) is applied to low-resource machine translation, commonly encountered issues of substandard preference data quality and the higher cost associated with manual feedback data. Therefore, a more cost-effective method for obtaining feedback data is proposed. At first, optimizing the quality of preference data through the prompt engineering of the Large Language Model (LLM), then combining human feedback to complete the evaluation. In this way, the reward model could acquire more semantic information and human preferences during the training phase, thereby enhancing feedback efficiency and the result’s quality. Experimental results demonstrate that compared with the traditional RLHF method, our method has been proven effective on multiple datasets and exhibits a notable improvement of 1.07 in BLUE. Meanwhile, it is also more favorably received in the assessments conducted by human evaluators and GPT-4o.
Decentralized large language model (LLM) inference promises transparent and censorship resistant access to advanced AI, yet existing verification approaches struggle to scale to modern models. Proof of Quality (PoQ) replaces cryptographic verification of computation with consensus over output quality, but the original formulation ignores heterogeneous computational costs across inference and evaluator nodes. This paper introduces a cost-aware PoQ framework that integrates explicit efficiency measurements into the reward mechanism for both types of nodes. The design combines ground truth token level F1, lightweight learned evaluators, and GPT based judgments within a unified evaluation pipeline, and adopts a linear reward function that balances normalized quality and cost. Experiments on extractive question answering and abstractive summarization use five instruction tuned LLMs ranging from TinyLlama-1.1B to Llama-3.2-3B and three evaluation models spanning cross encoder and bi encoder architectures. Results show that a semantic textual similarity bi encoder achieves much higher correlation with both ground truth and GPT scores than cross encoders, indicating that evaluator architecture is a critical design choice for PoQ. Quality-cost analysis further reveals that the largest models in the pool are also the most efficient in terms of quality per unit latency. Monte Carlo simulations over 5\,000 PoQ rounds demonstrate that the cost-aware reward scheme consistently assigns higher average rewards to high quality low cost inference models and to efficient evaluators, while penalizing slow low quality nodes. These findings suggest that cost-aware PoQ provides a practical foundation for economically sustainable decentralized LLM inference.
Relevance evaluation plays a crucial role in personalized search systems to ensure that search results align with a user's queries and intent. While human annotation is the traditional method for relevance evaluation, its high cost and long turnaround time limit its scalability. In this work, we present our approach at Pinterest Search to automate relevance evaluation for online experiments using fine-tuned LLMs. We rigorously validate the alignment between LLM-generated judgments and human annotations, demonstrating that LLMs can provide reliable relevance measurement for experiments while greatly improving the evaluation efficiency. Leveraging LLM-based labeling further unlocks the opportunities to expand the query set, optimize sampling design, and efficiently assess a wider range of search experiences at scale. This approach leads to higher-quality relevance metrics and significantly reduces the Minimum Detectable Effect (MDE) in online experiment measurements.
Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing"ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is"support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.
AI practitioners increasingly use large language model (LLM) agents in compound AI systems to solve complex reasoning tasks, these agent executions often fail to meet human standards, leading to errors that compromise the system's overall performance. Addressing these failures through human intervention is challenging due to the agents' opaque reasoning processes, misalignment with human expectations, the complexity of agent dependencies, and the high cost of manual inspection. This paper thus introduces a human-centered evaluation framework for Verifying LLM Agent failures (VeriLA), which systematically assesses agent failures to reduce human effort and make these agent failures interpretable to humans. The framework first defines clear expectations of each agent by curating human-designed agent criteria. Then, it develops a human-aligned agent verifier module, trained with human gold standards, to assess each agent's execution output. This approach enables granular evaluation of each agent's performance by revealing failures from a human standard, offering clear guidelines for revision, and reducing human cognitive load. Our case study results show that VeriLA is both interpretable and efficient in helping practitioners interact more effectively with the system. By upholding accountability in human-agent collaboration, VeriLA paves the way for more trustworthy and human-aligned compound AI systems.
Introduction The adoption of Large Language Models (LLMs) in search systems necessitates new evaluation methodologies beyond traditional rule-based or manual approaches. Methods We propose a general framework for evaluating structured outputs using LLMs, focusing on search query parsing within an online classified platform. Our approach leverages LLMs' contextual reasoning capabilities through three evaluation methodologies: Pointwise, Pairwise, and Pass/Fail assessments. Additionally, we introduce a Contextual Evaluation Prompt Routing strategy to improve reliability and reduce hallucinations. Results Experiments conducted on both small- and large-scale datasets demonstrate that LLM-based evaluation achieves approximately 90% agreement with human judgments. Discussion These results validate LLM-driven evaluation as a scalable, interpretable, and effective alternative to traditional evaluation methods, providing robust query parsing for real-world search systems.
Thematic analysis (TA) is a widely used qualitative method for identifying underlying meanings within unstructured text. However, TA requires manual processes, which become increasingly labour-intensive and time-consuming as datasets grow. While large language models (LLMs) have been introduced to assist with TA on small-scale datasets, three key limitations hinder their effectiveness. First, current approaches often depend on interactions between an LLM agent and a human coder, a process that becomes challenging with larger datasets. Second, with feedback from the human coder, the LLM tends to mirror the human coder, which provides a narrower viewpoint of the data. Third, existing methods follow a sequential process, where codes are generated for individual samples without recalling previous codes and associated data, reducing the ability to analyse data holistically. To address these limitations, we propose Thematic-LM, an LLM-based multi-agent system for large-scale computational thematic analysis. Thematic-LM assigns specialised tasks to each agent, such as coding, aggregating codes, and maintaining and updating the codebook. We assign coder agents different identity perspectives to simulate the subjective nature of TA, fostering a more diverse interpretation of the data. We applied Thematic-LM to the Dreaddit dataset and the Reddit climate change dataset to analyse themes related to social media stress and online opinions on climate change. We evaluate the resulting themes based on trustworthiness principles in qualitative research. Our study reveals insights such as assigning different identities to coder agents promotes divergence in codes and themes.
Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing ''ground truth''. A crucial factor in RAG evaluation is ''support'', or whether the information in the cited documents supports the answer. We conducted a comparative study of submissions to the TREC 2024 RAG Track, evaluating an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate good agreement between human and GPT-4o predictions. Further analysis of the disagreements shows that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. We provide a qualitative analysis of human and GPT-4o errors to help guide future evaluations.
11160 Background: Abstraction is a critical step for converting clinical data from unstructured EHRs into a structured format suitable for real-world data analyses. Typically this is a manual, labor-intensive activity requiring substantial training. While prior work has shown that abstraction by humans is reliable, advances in LLMs may improve the efficiency of abstraction. We aim to measure the performance of LLMs in abstracting a diverse set of oncology data elements. Methods: Two clinical abstractors independently abstracted unstructured records of 222 advanced or metastatic NSCLC patients (mean: 248 pages per case). A two-stage LLM system balancing cost and comprehensiveness was used to abstract clinical elements for demographics, diagnosis, third-party lab biomarker testing, and first line (1L) treatment. The initial stage extracted 16 documents semantically similar to the abstraction query and input them, along with abstraction rules, into an LLM (GPT-4o). The LLM was instructed to provide both the abstracted field and a completeness assessment of provided context. If the first phase resulted in a low completeness score, the entire patient record was then input into a long-context LLM (Gemini-Pro-1.5) to re-attempt abstraction. Gwet’s agreement coefficient (AC) was the primary measure of agreement between the LLM and each abstractor. Date agreement was calculated within ±30 days. Results: The LLM system yielded abstracted values for 90% of elements where both abstractors provided non-missing values. In these cases, the LLM also demonstrated high agreement with each abstractor (≥0.81 across all categories). Agreement was highest in demographic and diagnosis domains and lower for 1L treatment domain, which require deeper understanding of a patient's temporal journey. For elements where neither abstractor provided values, the LLM sometimes provided outputs (frequency: 4.9% for non-biomarker elements; 38.5% for biomarker elements). These discrepancies were primarily driven by nuances in abstraction rules; the LLM often included Tempus-tested biomarkers, while abstractors were more rigorous in abstracting only third-party biomarker results. Conclusions: LLMs show high completion rates and high agreement with human abstractors across a variety of critical abstraction fields. The use of LLMs may significantly reduce the burden of human abstraction and allow for large-scale curation of oncology records. Challenges in handling nuanced contexts underscore the need for careful refinement and evaluation prior to deployment. Domain LLM agreement with abstractors (AC, min-max) Demographic (birth date, sex, race, smoking status) 0.96-1 Diagnosis (stage, histology, year of diagnosis) 0.92-0.98 Third Party Biomarker (EGFR, ALK, ROS1, PDL1, BRAF, RET, NTRK) 0.87-1 1L Treatment (agents, initiation date) 0.81-0.86
We introduce a novel workflow, TICODER, designed to enhance the trust and accuracy of LLM-based code generation through interactive and guided intent formalization. TICODER partially formalizes ambiguous intent in natural language prompts by generating a set of tests to distinguish common divergent behaviours in generated code suggestions. We evaluate the code generation accuracy improvements provided by TICODER at scale across four competitive LLMs, and evaluate the cost-benefit trade off of evaluating tests surfaced by TICODER through a user study with 15 participants.
Argument Structure Constructions (ASCs) are one of the most well-studied construction groups, providing a unique opportunity to demonstrate the usefulness of Construction Grammar (CxG). For example, the caused-motion construction (CMC, ``She sneezed the foam off her cappuccino'') demonstrates that constructions must carry meaning, otherwise the fact that ``sneeze'' in this context causes movement cannot be explained. We form the hypothesis that this remains challenging even for state-of-the-art Large Language Models (LLMs), for which we devise a test based on substituting the verb with a prototypical motion verb. To be able to perform this test at statistically significant scale, in the absence of adequate CxG corpora, we develop a novel pipeline of NLP-assisted collection of linguistically annotated text. We show how dependency parsing and GPT-3.5 can be used to significantly reduce annotation cost and thus enable the annotation of rare phenomena at scale. We then evaluate GPT, Gemini, Llama2 and Mistral models for their understanding of the CMC using the newly collected corpus. We find that all models struggle with understanding the motion component that the CMC adds to a sentence.
Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale.
High-quality smart contract vulnerability datasets are critical for evaluating security tools and advancing smart contract security research. Two major limitations of current manual dataset construction are (1) labor-intensive and error-prone annotation processes limiting the scale, quality, and evolution of the dataset, and (2) absence of standardized classification rules results in inconsistent vulnerability categories and labeling results across different datasets. To address these limitations, we present FORGE, the first automated approach for constructing smart contract vulnerability datasets. FORGE leverages an LLM-driven pipeline to extract high-quality vulnerabilities from real-world audit reports and classify them according to the CWE, the most widely recognized classification in software security. FORGE employs a divide-and-conquer strategy to extract structured and self-contained vulnerability information from these reports. Additionally, it uses a tree-of-thoughts technique to classify the vulnerability information into the hierarchical CWE classification. To evaluate FORGE's effectiveness, we run FORGE on 6,454 real-world audit reports and generate a dataset comprising 81,390 solidity files and 27,497 vulnerability findings across 296 CWE categories. Manual assessment of the dataset demonstrates high extraction precision and classification consistency with human experts (precision of 95.6% and inter-rater agreement k-$\alpha$ of 0.87). We further validate the practicality of our dataset by benchmarking 13 existing security tools on our dataset. The results reveal the significant limitations in current detection capabilities. Furthermore, by analyzing the severity-frequency distribution patterns through a unified CWE perspective in our dataset, we highlight inconsistency between current smart contract research focus and priorities identified from real-world vulnerabilities...
With the rapid improvement in large language model (LLM) capabilities, its becoming more difficult to measure the quality of outputs generated by natural language generation (NLG) systems. Conventional metrics such as BLEU and ROUGE are bound to reference data, and are generally unsuitable for tasks that require creative or diverse outputs. Human evaluation is an option, but manually evaluating generated text is difficult to do well, and expensive to scale and repeat as requirements and quality criteria change. Recent work has focused on the use of LLMs as customize-able NLG evaluators, and initial results are promising. In this demonstration we present EvaluLLM, an application designed to help practitioners setup, run and review evaluation over sets of NLG outputs, using an LLM as a custom evaluator. Evaluation is formulated as a series of choices between pairs of generated outputs conditioned on a user provided evaluation criteria. This approach simplifies the evaluation task and obviates the need for complex scoring algorithms. The system can be applied to general evaluation, human assisted evaluation, and model selection problems.
This study aims to investigate the use of generative artificial intelligence (GenAI) in the legal profession, focusing on its fit with tasks performed by legal practitioners and its impact on performance and adoption. This study uses a mixed methods approach, combining a survey of 279 legal professionals with qualitative insights from open-ended responses. The quantitative part uses structural equation modeling-partial least squares (PLS-SEM) offering statistical evidence on the relationships between Task Characteristics, Technology Characteristics, Task-Technology Fit (TTF), Utilization and Performance Impact. The qualitative analysis explores participants’ detailed experiences, perceptions and concerns through thematic and sentiment analyses, providing deeper contextual insights. The study highlights variability in the alignment between legal tasks and GenAI capabilities. GenAI fits data-intensive tasks like research but struggles with complex human judgment. A strong TTF improves performance and adoption. Familiarity helps results but does not increase use, as legal practitioners use GenAI selectively, even when they are highly familiar with its capabilities. Participants’ comments highlight both opportunities and challenges, including efficiency gains and concerns over data security, trust and output quality. Despite these challenges, most respondents expressed a positive sentiment. By extending the TTF theory to GenAI in the legal domain and integrating quantitative and qualitative evidence, the study identifies where GenAI adds value and where professional oversight is essential. It offers practical recommendations, including deploying GenAI in areas where it is most suitable and promoting responsible use through targeted training, professional development and confidence-building initiatives that also address associated risks.
BACKGROUND Pediatric stroke is an important cause of morbidity in children. Although research can be challenging, large amounts of data have been captured through collaborative efforts in the International Pediatric Stroke Study (IPSS). This study explores the use of an advanced artificial intelligence program, the Generative Pre-trained Transformer (GPT), to enter pediatric stroke data into the IPSS. METHODS The most recent 50 clinical notes of patients with ischemic stroke or cerebral venous sinus thrombosis at the UTHealth Pediatric Stroke Clinic were deidentified. Domain-specific prompts were engineered for an offline artificial intelligence program (GPT) to answer IPSS questions. Responses from GPT were compared with the human rater. Percent agreement was assessed across 50 patients for each of the 114 queries developed from the IPSS database outcome questionnaire. RESULTS GPT demonstrated strong performance on several questions but showed variability overall. In its early iterations it was able to match human judgment occasionally with an accuracy score of 1.00 (n = 20, 17.5%), but it scored as low as 0.26 in some patients. Prompts were adjusted in four subsequent iterations to increase accuracy. In its fourth iteration, agreement was 93.6%, with a maximum agreement of 100% and minimum of 62%. Of 2400 individual items assessed, our model entered 2247 (93.6%) correctly and 153 (6.4%) incorrectly. CONCLUSIONS Although our tailored generative model with domain-specific prompt engineering and ontological guidance shows promise for research applications, further refinement is needed to enhance its accuracy. It cannot enter data entirely independently, but it can be employed in tandem with human oversight contributing to a collaborative approach that reduces overall effort.
Through extensive experience training professionals and individual users in AI tool adoption since the GPT-3 era, I have observed a consistent pattern: the same AI tool produces dramatically different results depending on who uses it. While some frame AI as a replacement for human intelligence, and others warn of cognitive decline, this position paper argues for a third perspective grounded in practical observation: AI as a cognitive amplifier that magnifies existing human capabilities rather than substituting for them. Drawing on research in human-computer interaction, cognitive augmentation theory, and educational technology, alongside field observations from corporate training across writing, software development, and data analysis domains, I present a framework positioning AI tools as intelligence amplification systems where output quality depends fundamentally on user expertise and judgment. Through analysis of empirical studies on expert-novice differences and systematic observations from professional training contexts, I demonstrate that domain knowledge, quality judgment, and iterative refinement capabilities create substantial performance gaps between users. I propose a three-level model of AI engagement -- from passive acceptance through iterative collaboration to cognitive direction -- and argue that the transition between levels requires not technical training but development of domain expertise and metacognitive skills. This position has critical implications for workforce development and AI system design. Rather than focusing solely on AI literacy or technical prompt engineering, I advocate for integrated approaches that strengthen domain expertise, evaluative judgment, and reflective practice.
No abstract available
This study examines how generative artificial intelligence can augment human judgment in assurance of learning assessments within business education, using the task-technology fit framework as a guiding lens. A case study in a college of business – where the Management Information Systems program served as a central unit in the assurance of learning cycle – compared generative artificial intelligence-driven evaluations of student writing with traditional faculty assessments. The results demonstrate that when mediated by human-based prompt engineering and moderated by human oversight, generative artificial intelligence markedly improves assessment efficiency and scoring consistency while providing more in-depth feedback without compromising evaluation accuracy. These findings indicate that generative artificial intelligence is most effective as a complement to rather than a replacement for human evaluators. The study extends task-technology fit theory to generative artificial intelligence-driven educational assessment and introduces a human-integrated, generative artificial intelligence-augmented theoretical model for assurance of learning assessments. In this model, human expertise acts as an iterative mediator (via prompt engineering) to strengthen task-technology alignment, while human oversight serves as a moderator ensuring contextual fidelity and output quality. Beyond its theoretical contribution, the study highlights practical implications for information systems educators and curriculum designers.
No abstract available
Customer relationship management (CRM) systems are changing more than ever before, thanks to rapid advances in artificial intelligence space. Generative AI (GenAI) now is capable to answer most routine customer questions automatically. Human agents step in only for complex cases that need expert judgment, empathy, or careful handling due to rules and regulations. However, companies still find it difficult to make AI and people work in harmony. This research introduces and tests a hybrid CRM approach that combines GenAI with a clear human-in-the-loop (HITL) process. We use the NATCS dataset, which contains real customer service conversations from 2023 in fields like banking, finance, health, travel, and insurance. We fine-tune an AI assistant, set up rules for when humans should take over based on uncertainty in the conversation, and measure results using both data and human feedback. Our hybrid system makes customer service faster, empathetic and simpler. It also provides streamlined guidelines for managing risks, security, and data, meeting high academic standards. The results show that this hybrid approach increases first-contact resolution by 27%, cuts average handling time by 34%, and boosts customer satisfaction by 22% compared to using AI alone. It also reduces the number of times human agents need to speak by 41% compared to traditional methods.
Online A/B testing is widely used as an experimental methodology for product improvement and business optimization. However, interpreting experimental results often involves subjective judgment and biases from experiment designers, which can undermine the reliability and reproducibility of test outcomes. In particular, experiment designers frequently exhibit inconsistent decision-making when dealing with neutral results—cases where neither statistically significant positive nor negative effects are observed. This study aims to explore the feasibility of automating A/B test decision-making using Generative AI and empirically analyze how well AI decisions align with those of experiment designers and experts. Utilizing 1,407 experimental cases from 48 companies on the Hackle online experimentation platform, the study compares decision-making outcomes between experiment designers and Generative AI, analyzing agreement rates and identifying patterns across companies. Statistical analyses, including chi-square tests and inter-rater agreement evaluation, were employed to assess differences and reliability. The findings indicate meaningful discrepancies between AI and experiment designers but demonstrate that AI decisions closely align with expert judgments. These results suggest that Generative AI can serve as a complementary tool to enhance the consistency and reliability of A/B test result interpretation.
Those involved in education see great importance in promoting self-regulated learning (SRL) and problem-solving (PS) among learners. Teachers are a major factor in promoting SRL-PS in the classroom, but very little is known about whether and how they promote it in their classroom, therefore a tool is needed to assess teachers in this domain. Furthermore, the situational judgment test (SJT) is a potential tool to diagnose teachers’ knowledge. However, such a tool is currently lacking in the SRL-PS domain. Human interaction with generative artificial intelligence (GenAI) enables overcoming difficulties and mutually complementing each other. The purpose of this paper is to present an initial attempt to develop an SJT tool for assessing teachers’ SRL-PS knowledge using the assistance of ChatGPT. This humane-machine interaction led to the formulation of 15 difficulties categories and 20 scenarios that form the basis of the SJT tool for SRL-PS. It was found that scenarios created by the researchers can be complemented by those generated by ChatGPT. In some instances, the scenarios from both sources are quite similar, while in others, those formulated by ChatGPT either expand upon or present an alternative perspective on the difficulty. Moreover, ChatGPT has proposed new scenarios in certain cases. A Significant product of the study is a map that enables the analysis of the pool of scenarios and the identification of over- or underrepresented difficulties categories.
This study investigates the metacognitive capabilities of Large Language Models relative to human metacognition in the context of the International Coaching Federation ICF mimicking exam, a situational judgment test related to coaching competencies. Using a mixed method approach, we assessed the metacognitive performance, including sensitivity, accuracy in probabilistic predictions, and bias, of human participants and five advanced LLMs (GPT-4, Claude-3-Opus 3, Mistral Large, Llama 3, and Gemini 1.5 Pro). The results indicate that LLMs outperformed humans across all metacognitive metrics, particularly in terms of reduced overconfidence, compared to humans. However, both LLMs and humans showed less adaptability in ambiguous scenarios, adhering closely to predefined decision frameworks. The study suggests that Generative AI can effectively engage in human-like metacognitive processing without conscious awareness. Implications of the study are discussed in relation to development of AI simulators that scaffold cognitive and metacognitive aspects of mastering coaching competencies. More broadly, implications of these results are discussed in relation to development of metacognitive modules that lead towards more autonomous and intuitive AI systems.
Purpose: As technical and professional communicators (TPCs) use AI to develop content, inaccuracies due to AI limitations are introduced; it is vital TPCs evaluate AI-generated content to improve accuracy and human-centeredness. In this article, we present a human-in-the-loop AI content heuristic (HEAT: Human experience, Expertise, Accuracy, and Trust) as a rating mechanism. Method: This exploratory case study evaluated the quality of content generated by ChatGPT from the perspective of beginner TPC students. We used multiple prompting strategies asking ChatGPT to create documentation on personas using two Darwin Information Type Architecture (DITA) information types namely, concept topics and task instructions, and we evaluated the results with HEAT. Results: HEAT had good intraclass correlation coefficient (ICC) reliability (.743 pilot; .825 for scenarios) indicating its fitness as a heuristic for evaluating generative AI output. The findings indicate that ChatGPT was generally good at writing concept topics; however, it performed less well creating step-by-step task instructions. Expert TPC input helped develop a better prompt for improved output. We also found that tokenization in ChatGPT (the way it breaks up text) has a large role in terms of noncompliance with format specifications. Conclusion: There is a need for TPCs to (1) develop new models for AI-assisted content creation, (2) recognize the impact of different prompting strategies on developing specific structured authoring units such as concept and task topics, and (3) be aware of the limitations of AI such as ChatGPT. Human-in-the-loop quality check mechanisms, such as HEAT, can help validate and modify AI-generated content to better serve end users.
Large language models (LLMs), initially developed for generative AI, are now evolving into agentic AI systems, which make decisions in complex, real-world contexts. Unfortunately, while their generative capabilities are well-documented, their decision-making processes remain poorly understood. This is particularly evident when testing targeted decision-making: for instance, how models handle exceptions, a critical and challenging aspect of decision-making made relevant by the inherent incompleteness of contracts. Here we demonstrate that LLMs, even ones that excel at reasoning, deviate significantly from human judgments because they adhere strictly to policies, even when such adherence is impractical, suboptimal, or even counterproductive. We then evaluate three approaches to tuning AI agents to handle exceptions: ethical framework prompting, chain-of-thought reasoning, and supervised fine-tuning. We find that while ethical framework prompting fails and chain-of-thought prompting provides only slight improvements, supervised fine-tuning - specifically with human explanations - yields markedly better results. Surprisingly, in our experiments, supervised fine-tuning even enabled models to generalize human-like decision-making to novel scenarios, demonstrating transfer learning of human-aligned decision-making across contexts. Furthermore, fine-tuning with explanations, not just labels, was critical for alignment, suggesting that aligning LLMs with human judgment requires explicit training on how decisions are made, not just which decisions are made. These findings highlight the need to address LLMs'shortcomings in handling exceptions in order to guide the development of agentic AI toward models that can effectively align with human judgment and simultaneously adapt to novel contexts.
Modern businesses are increasingly challenged by the time and expense required to generate and assess high-quality content. Human writers face time constraints, and extrinsic evaluations can be costly. While Large Language Models (LLMs) offer potential in content creation, concerns about the quality of AI-generated content persist. Traditional evaluation methods, like human surveys, further add operational costs, highlighting the need for efficient, automated solutions. This research introduces Generative Agents as a means to tackle these challenges. These agents can rapidly and cost-effectively evaluate AI-generated content, simulating human judgment by rating aspects such as coherence, interestingness, clarity, fairness, and relevance. By incorporating these agents, businesses can streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations. The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation.
This paper proposes a structured framework for evaluating empathy in Generative AI (GenAI) customer-service systems, addressing a critical gap in how human-centric qualities are assessed in automated interactions. Traditional service-quality models, such as SERVQUAL and the Interpersonal Reactivity Index, conceptualize empathy through human judgment and emotional capability—dimensions that do not directly transfer to GenAI. To adapt these constructs, the paper introduces a dual-constraint model in which emotional alignment and policy compliance jointly determine a bounded form of empathy suitable for AI-mediated service. The study employs conceptual simulation, integrating psychological theory, service-quality research, and responsible-AI standards including EmotionML, IEEE 7010, and ISO/IEC 23894. Symbolic modelling and unit-free visualization are used to illustrate how empathy, factual integrity, and customer outcomes interact across feasible and infeasible regions. Governance metrics—such as Bounded Empathy Drift, Policy Breach Rate, and Insensitive Response Rate—demonstrate how organizations can monitor empathic behaviour during deployment and ensure compliance with ethical constraints. The framework contributes a measurable, governable conceptualization of empathy for GenAI services, offering a foundation for empirical validation and practical implementation. It positions empathy as a strategic performance dimension influencing satisfaction, trust, and service experience within AI-enabled customer ecosystems.
Generative AI in entrepreneurial evaluation reveals sharp contrasts with human judgment. Using a mixed-methods design on 80 early-stage startup proposals, the analysis shows that human scorers produce bimodal distributions reflecting categorical conviction, while AI outputs remain unimodal, centralised, and conservative. These tendencies indicate both algorithmic bureaucracy and behavioral convergence, as evaluators embed preferences during scoring, collapsing ranking and selection into a single act. A hybrid framework is advanced that positions AI as a baseline reference, preserving human interpretive depth and contextual sensitivity while mitigating bureaucratic inertia.
Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human-in-the-Loop
Large Language Models have found application in various mundane and repetitive tasks including Human Resource (HR) support. We worked with the domain experts of a large multinational company to develop an HR support chatbot as an efficient and effective tool for addressing employee inquiries. We inserted a human-in-the-loop in various parts of the development cycles such as dataset collection, prompt optimization, and evaluation of generated output. By enhancing the LLM-driven chatbot’s response quality and exploring alternative retrieval methods, we have created an efficient, scalable, and flexible tool for HR professionals to address employee inquiries effectively. Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data through internal reasoning capabilities. Additionally, through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and Prometheus demonstrate reliability closely aligned with that of human evaluation.
Improving Hebrew offensive language classification using LLM-assisted human-in-the-loop annotation!!
Abstract This study explores the challenges and opportunities of using large language models (LLMs) for automatic annotation of offensive language in Hebrew. The analysis is based on a six-level taxonomy of offensive discourse – covering offensiveness, target, target presence, vulgarity, offense strength, and specific aspects – enabling systematic examination of nuanced Hebrew patterns. Several prompting strategies were tested, including few-shot, role-based prompting, chain-of-thought, and LLM-as-judge, and their outputs were compared to human annotations using standard evaluation metrics and inter-annotator reliability. Findings reveal that salient categories such as explicit threats show strong interpretive stability, while ambiguous ones, such as discrediting attacks, require higher precision. The study also introduces a two-step classification method: first identifying the two most plausible categories, then selecting the more accurate one. This approach reduces the model’s bias toward general categories and improves fine-grained classification. Overall, the study contributes by (1) offering a methodological framework to assess LLMs’ interpretive limits compared to humans, and (2) laying groundwork for building more refined datasets to advance Hebrew offensive language research.
Despite significant progress in model developments, evaluating eXplainable Artificial Intelligence (XAI) remains elusive and challenging in Alzheimer's Disease (AD) detection using modalities from low-cost or wearable devices. This paper introduces a fine-grained validation framework named 'FairAD-XAI', which provides a comprehensive assessment through twelve properties of explanations, forming a detailed Likert questionnaire. This framework ensures a thorough evaluation of XAI methods, capturing their fairness aspects and supporting the improvement of how humans assess the reliability and transparency of these methods. Moreover, fairness in XAI evaluation is critical, as users from diverse demographic backgrounds may have different perspectives and perceptions towards the system. These variations can lead to biases in human-grounded evaluations and, subsequently, biased decisions from the AI system when deploying. To mitigate this risk, we installed two fairness metrics tailored to assess and ensure fairness in XAI evaluations, promoting more equitable outcomes. In summary, the proposed 'FairAD-XAI' framework provides a comprehensive tool for evaluating XAI methods and assessing the essential aspect of fairness. This makes it a multifactoral tool for developing unbiased XAI methods for AI-based AD detection tools, ensuring these technologies are both effective and equitable.
Text summarization, a key natural language generation (NLG) task, is vital in various domains. However, the high cost of inaccurate summaries in risk-critical applications, particularly those involving human-in-the-loop decision-making, raises concerns about the reliability of uncertainty estimation on text summarization (UE-TS) evaluation methods. This concern stems from the dependency of uncertainty model metrics on diverse and potentially conflicting NLG metrics. To address this issue, we introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions. The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets, with human-annotation analysis incorporated where applicable. We also assess the performance of 14 common uncertainty estimation methods within this benchmark. Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques. Our code and data are available: https://github.com/he159ok/Benchmark-of-Uncertainty-Estimation-Methods-in-Text-Summarization.
Humans can be notoriously imperfect evaluators. They are often biased, unreliable, and unfit to define"ground truth."Yet, given the surging need to produce large amounts of training data in educational applications using AI, traditional inter-rater reliability (IRR) metrics like Cohen's kappa remain central to validating labeled data. IRR remains a cornerstone of many machine learning pipelines for educational data. Take, for example, the classification of tutors'moves in dialogues or labeling open responses in machine-graded assessments. This position paper argues that overreliance on human IRR as a gatekeeper for annotation quality hampers progress in classifying data in ways that are valid and predictive in relation to improving learning. To address this issue, we highlight five examples of complementary evaluation methods, such as multi-label annotation schemes, expert-based approaches, and close-the-loop validity. We argue that these approaches are in a better position to produce training data and subsequent models that produce improved student learning and more actionable insights than IRR approaches alone. We also emphasize the importance of external validity, for example, by establishing a procedure of validating tutor moves and demonstrating that it works across many categories of tutor actions (e.g., providing hints). We call on the field to rethink annotation quality and ground truth--prioritizing validity and educational impact over consensus alone.
This study re-examines the role of Explainable AI (XAI) within human-AI collaborative environments and proposes a design and evaluation framework for a human-AI collaboration system that integrates Large Language Models (LLMs) and state-of-the-art AI agent technology. The proposed methodology, which consists of an AI model, an explanation generation module, and a human-AI interface, enhances the adaptability and reliability of explanations. A key contribution of this research is the introduction of an LLM-XAI collaborative architecture that integrates personalized, adaptive explanations with a feedbackdriven improvement mechanism. Notably, the system presents a novel paradigm for explanations that distinguishes it from conventional XAI methods by utilizing Chain-of-Thought reasoning traces, natural language explanations, and a multi-stage verification mechanism provided by Deep Research and LLM-based agents. The system defines core quality metrics such as explainability, transparency, reliability, interactivity, and adaptability, and concurrently develops a multi-dimensional evaluation framework to assess these metrics using both quantitative and qualitative data. This system is structured with a feedback loop that enables continuous learning and improvement while transparently explaining the AI’s decisionmaking process. The quality of explanations is also assessed with quantitative metrics, and the system improves continuously through user feedback. This study also presents quantitative and qualitative evaluation metrics and user research methodologies to validate the system’s effectiveness, which is expected to contribute to achieving trust-based human-AI collaboration. Furthermore, to demonstrate its practical applicability, a pilot implementation in a medical diagnosis support scenario is presented, offering an ideal model where humans and AI collaborate complementarily, thereby playing a crucial role in promoting the ethical use and social acceptance of AI systems.
The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review. In this work, we propose: 1. ReviewEval, a comprehensive evaluation framework for AI-generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines; and 2. ReviewAgent, an LLM-based review generation agent featuring a novel alignment mechanism to tailor feedback to target conferences and journals, along with a self-refinement loop that iteratively optimizes its intermediate outputs and an external improvement loop using ReviewEval to improve upon the final reviews. ReviewAgent improves actionable insights by 6.78% and 47.62% over existing AI baselines and expert reviews respectively. Further, it boosts analytical depth by 3.97% and 12.73%, enhances adherence to guidelines by 10.11% and 47.26% respectively. This paper establishes essential metrics for AIbased peer review and substantially enhances the reliability and impact of AI-generated reviews in academic research.
System Instructions (SIs), or system prompts, are pivotal for guiding Large Language Models (LLMs) but manual crafting is resource-intensive and often suboptimal. Existing automated methods frequently generate non-human-readable “soft prompts,” sacrificing interpretability. This paper introduces SI-Agent, a novel agentic framework designed to automatically generate and iteratively refine human-readable SIs through a feedback-driven loop. SI-Agent employs three collaborating agents: an Instructor Agent, an Instruction Follower Agent (target LLM), and a Feedback/Reward Agent evaluating task performance and optionally SI readability. The framework utilizes iterative cycles where feedback guides the Instructor's refinement strategy (e.g., LLM-based editing, evolutionary algorithms). We detail the framework's architecture, agent roles, the iterative refinement process, and contrast it with existing methods. We present experimental results validating SI-Agent's effectiveness, focusing on metrics for task performance, SI readability, and efficiency. Our findings indicate that SI-Agent generates effective, readable SIs, offering a favorable trade-off between performance and interpretability compared to baselines. Potential implications include democratizing LLM customization and enhancing model transparency. Challenges related to computational cost and feedback reliability are acknowledged.
"Gold"and"ground truth"human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even"super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.
Assessing classification confidence is essential for effectively leveraging Large Language Models (LLMs) in automated data labeling, particularly within the sensitive contexts of Computational Social Science (CSS) tasks. In this study, we evaluate five uncertainty quantification (UQ) strategies across three CSS classification problems: stance detection, ideology identification, and frame detection. We benchmark these strategies using three different LLMs. To enhance human-in-the-loop classification performance, we introduce an ensemble-based UQ aggregation method, C_ensemble, and propose a novel evaluation metric, Misclassified Recall, designed to better assess model uncertainty on mislabeled or ambiguous data points. Our results show that C_ensemble outperforms existing UQ techniques in six out of nine model-dataset combinations, achieving an average AUC improvement of 8.7%. These findings highlight the potential of UQ-driven methods to significantly improve the reliability and efficiency of human-in-the-loop data annotation pipelines.
Application modernization in legacy languages such as COBOL, PL/I, and REXX faces an acute shortage of resources, both in expert availability and in high-quality human evaluation data. While Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review, their reliability must be validated before being trusted in high-stakes workflows. Without principled validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs, potentially reinforcing unreliable judgments and compromising downstream deployment decisions. Although various automated approaches to validating LaaJs have been proposed, alignment with human judgment remains a widely used and conceptually grounded validation strategy. In many real-world domains, the availability of human-labeled evaluation data is severely limited, making it difficult to assess how well a LaaJ aligns with human judgment. We introduce SparseAlign, a formal framework for assessing LaaJ alignment with sparse human-labeled data. SparseAlign combines a novel pairwise-confidence concept with a scoresensitive alignment metric that jointly capture ranking consistency and score proximity, enabling reliable evaluator selection even when traditional statistical methods are ineffective due to limited annotated examples. SparseAlign was applied internally to select LaaJs for COBOL code explanation. The top-aligned evaluators were integrated into assessment workflows, guiding model release decisions.
Evaluating Retrieval-Augmented Generation in Scholarly Domains: A Contextual and Semantic Assessment
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance large language models (LLMs) by grounding outputs in external knowledge sources. This paper evaluates the effectiveness of RAG in scholarly domains through a dual assessment combining automated metrics and human expert validation. BLEU score analysis showed that the system achieved high accuracy in structured, ontology-based queries, with the strongest performance of BLUE score is 0.807 observed in ontology-driven counselling applications, while responses to abstract queries were less precise. Expert evaluation, conducted across six criteria—Correctness, Faithfulness, Completeness, Relevance, Error Categorization, and Overall Judgment—revealed that despite partial correctness and incomplete coverage, the generated outputs were highly relevant and received an overall Excellent rating. These findings demonstrate that RAG can provide contextually meaningful and useful responses in academic applications, while highlighting the need for improvements in retrieval accuracy, evidence grounding, and completeness. Future work will focus on refining retrieval strategies, applying domain-specific fine-tuning, and integrating human-in-the-loop feedback to further strengthen the reliability and scholarly utility of RAG systems
Cybersecurity breaches in digital substations can pose significant challenges to the stability and reliability of power system operations. To address these challenges, defense and mitigation techniques are required. Identifying and detecting anomalies in information and communication technology (ICT) is crucial to ensure secure device interactions within digital substations. This paper proposes a task-oriented dialogue (ToD) system for anomaly detection (AD) in datasets of multicast messages e.g., generic object oriented substation event (GOOSE) and sampled value (SV) in digital substations using large language models (LLMs). This model has a lower potential error and better scalability and adaptability than a process that considers the cybersecurity guidelines recommended by humans, known as the human-in-the-loop (HITL) process. Also, this methodology significantly reduces the effort required when addressing new cyber threats or anomalies compared with machine learning (ML) techniques, since it leaves the models complexity and precision unaffected and offers a faster implementation. These findings present a comparative assessment, conducted utilizing standard and advanced performance evaluation metrics for the proposed AD framework and the HITL process. To generate and extract datasets of IEC 61850 communications, a hardware-in-the-loop (HIL) testbed was employed.
The rapid advancement of artificial intelligence, driven by Generative Pre-trained Transformers (GPT), has transformed natural language processing. Prompt engineering plays a key role in guiding model outputs effectively. Our primary objective was to explore the possibilities and limitations of a custom GPT, developed via prompt engineering, as a patient education tool, which delivers publicly available information through a user-friendly design that facilitates more effective access to cervical cancer screening knowledge. The system was developed using the OpenAI GPT-4 model and Python programming language, with the interface built on Streamlit for cloud-based accessibility and testing. It initially presented questions to testers for preliminary assessment. For cervical cancer-related information, we referenced medical guidelines. Iterative testing optimized the prompts for quality and relevance; techniques like context provision, question chaining, and prompt-based constraints were used. Human-in-the-loop and two independent medical doctor evaluations were employed. Additionally, system performance metrics were measured. The web application was tested 115 times over a three-week period in 2024, with 87 female (76%) and 28 male (24%) participants. A total of 112 users completed the user experience questionnaire. Statistical analysis showed a significant association between age and perceived personalization (p = 0.047) and between gender and system customization (p = 0.037). Younger participants reported higher engagement, though not significantly. Females valued guidance on screening schedules and early detection, while males highlighted the usefulness of information regarding HPV vaccination and its role in preventing HPV-related cancers. Independent evaluations by medical doctors demonstrated consistent assessments of the system’s responses in terms of accuracy, clarity, and usefulness. While the system demonstrates potential to enhance public health awareness and promote preventive behaviors, encouraging individuals to seek information on cervical cancer screening and HPV vaccination, its conversational capabilities remain constrained by the inherent limitations of current language model technology. Although custom GPTs can not substitute a healthcare consultations, these tools can streamline workflows, expedite information access, and support personalized care. Further research should focus on conducting well-designed randomized controlled trials to establish definitive conclusions regarding its impact and reliability. Not applicable.
Large Language Models (LLMs) have revolutionized natural language processing by enabling advanced text generation, comprehension, and interactive capabilities. However, their performance often degrades when confronted with real-world variability, requiring continuous refinement to maintain accuracy, reliability, and ethical integrity. Traditional model calibration relies on periodic updates and static fine-tuning, which fail to address evolving language patterns, contextual nuances, and emergent biases. To overcome these limitations, continuous model calibration introduces a feedback-driven fine-tuning mechanism that enables self-correcting capabilities in LLMs. This approach integrates progressive tuning techniques, real-time human-AI collaboration, and anomaly detection frameworks to dynamically adjust model behavior. Progressive tuning leverages reinforcement learning with human feedback (RLHF) and adaptive loss functions to iteratively refine LLM responses, ensuring alignment with contextual accuracy and user expectations. Human-AI collaboration further enhances model calibration by incorporating domain experts' insights and structured feedback loops to mitigate ethical risks, bias propagation, and factual inconsistencies. Additionally, anomaly detection mechanisms identify distributional shifts and inconsistencies in generated responses, allowing automated interventions to preempt erroneous or misleading outputs. This study explores the interplay between self-correction methodologies and real-world applications, emphasizing the need for transparent governance and robust evaluation metrics. We examine case studies across conversational AI, legal reasoning, and healthcare applications to demonstrate the efficacy of feedback-driven fine-tuning in maintaining model adaptability. By establishing a continuous improvement framework, this research aims to optimize AI reliability, enhance interpretability, and promote ethically aligned decision-making in dynamic environments.
The current Chinese and English translation scoring system has defects such as long working times and significant differences between system scoring data and human scoring data. To reduce the difficulty of human scoring and improve scoring effectiveness, an artificial intelligence scoring management platform for English translation based on natural language information processing was developed. A hierarchical structure of the Chinese-English translation scoring platform was established with the functions of translator material collection, message feature acquisition, language analysis model construction, and data transmission and evaluation. Then, a sentence analysis model for the Chinese-English translation scoring management platform is created, and the performance is statistically analyzed for the probability distribution of certain sentences or vocabulary orders in the translation for the evaluation of the intelligence of the Chinese-English translation network platform. The experimental results show that the minimum error between the scoring of the proposed platform and the traditional manual scoring is only 0.1 points and the accuracy of the evaluation results is better than that of the current scoring platform. The running time for scoring is shorter than that of the current scoring platform with more stable and better quality.
Natural language processing (NLP) is central to the communication with machines and among ourselves, and NLP research field has long sought to produce human‐quality language. Identification of informative criteria for measuring NLP‐produced language quality will support development of ever‐better NLP tools. The authors hypothesize that mentalizing network neural activity may be used to distinguish NLP‐produced language from human‐produced language, even for cases where human judges cannot subjectively distinguish the language source. Using the social chatbots Google Meena in English and Microsoft XiaoIce in Chinese to generate NLP‐produced language, behavioral tests which reveal that variance of personality perceived from chatbot chats is larger than for human chats are conducted, suggesting that chatbot language usage patterns are not stable. Using an identity rating task with functional magnetic resonance imaging, neuroimaging analyses which reveal distinct patterns of brain activity in the mentalizing network including the DMPFC and rTPJ in response to chatbot versus human chats that cannot be distinguished subjectively are conducted. This study illustrates a promising empirical basis for measuring the quality of NLP‐produced language: adding a judge's implicit perception as an additional criterion.
No abstract available
As the utilization of language models in interdisciplinary, human-centered studies grow, expectations of their capabilities continue to evolve. Beyond excelling at conventional tasks, models are now expected to perform well on user-centric measurements involving confidence and human (dis)agreement-factors that reflect subjective preferences. While modeling subjectivity plays an essential role in cognitive science and has been extensively studied, its investigation at the intersection with NLP remains under-explored. In light of this gap, we explore how language models can quantify subjectivity in cognitive appraisal by conducting comprehensive experiments and analyses with both fine-tuned models and prompt-based large language models (LLMs). Our quantitative and qualitative results demonstrate that personality traits and demographic information are critical for measuring subjectivity, yet existing post-hoc calibration methods often fail to achieve satisfactory performance. Furthermore, our in-depth analysis provides valuable insights to guide future research at the intersection of NLP and cognitive science.
The Rating Scale method has been long deemed the standard for measuring subjective perceptions. However, in the field of physical human-robot collaboration (pHRC), its aptness should be put under scrutiny due to inherent challenges such as response bias, between-subject variations, and the granularity nature.Individual variances can introduce significant bias in the rating scale results. A high granularity in the scale could overwhelm participants, leading to unclear and biased responses, while a low granularity may gloss over the fine nuances of human feelings. Additionally, there’s a notable risk of receiving careless responses, which compromise data reliability. Recognizing these challenges, this paper proposes the application of Pairwise Comparison (PC) in pHRC — an alternative survey technique that emphasizes direct comparisons between items on the defined criteria. By using the NASA Task Load Index (NASA-TLX) as a template, RS and PC questionnaires are designed and used in a series of pHRC experiments. Our preliminary findings suggest that PC is more precise and robust than the rating scale method. Compared to RS, PC fosters authentic participant interests in the experiment by intuitive question design and reducing the experimental duration. Besides, the accuracy and reliability of PC are also found to be consistent regardless of the variations in our experimental procedure design.
Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.
To date, there is a paucity of research conducting natural language processing (NLP) on the open-ended responses of behavior rating scales. Using three NLP lexicons for sentiment analysis of the open-ended responses of the Behavior Assessment System for Children-Third Edition, the researchers discovered a moderately positive correlation between the human composite rating and the sentiment score using each of the lexicons for strengths comments and a slightly positive correlation for the concerns comments made by guardians and teachers. In addition, the researchers found that as the word count increased for open-ended responses regarding the child’s strengths, there was a greater positive sentiment rating. Conversely, as word count increased for open-ended responses regarding child concerns, the human raters scored comments more negatively. The authors offer a proof-of-concept to use NLP-based sentiment analysis of open-ended comments to complement other data for clinical decision making.
No abstract available
Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).
The problem in subjective evaluation of answer is that the traditional evaluation is not consistent, not fair, and not time efficient. Utilizing Generative AI (GenAI) and Natural Language Processing (NLP), this project presents an innovative technique which automates the evaluation process for subjective answers. It consists of the system that analyzes student’s response using predefined grading rubrics to assign them scores based on quality and relevance and gives them personalized feedback. The system, using a fine tuned GenAI model, reduces human bias and inconsistencies by being nuanced on context and coherence. The system is an easy to use and user friendly tool for Educators to manage and review evaluations through Streamlit hosted system. The solution proposed aims to achieve higher grading accuracy, reduce assessment workflow complexities, while providing perspective that can inform students learning on what is being assessed. The potential applications of this technology are great and include essay grading, peer review and professional training test selection.
Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether LLMs can serve as reliable alternatives to humans, we examine the fine-grained alignment between LLM evaluators and human annotators, particularly in understanding the target evaluation tasks and conducting evaluations that meet diverse criteria. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning), each with different evaluation criteria. Our analysis shows that 1) LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts. 2) LLM evaluators excel in general criteria, such as fluency, but face challenges with complex criteria, such as numerical reasoning. We also find that LLM-pre-drafting before human evaluation can help reduce the impact of human subjectivity and minimize annotation outliers in pure human evaluation, leading to more objective evaluation.
In the context of remanufacturing, particularly mobile device refurbishing, effective operator training is crucial for accurate defect identification and process inspection efficiency. This study examines the application of Natural Language Processing (NLP) techniques to evaluate operator expertise based on subjective textual responses gathered during a defect analysis task. Operators were asked to describe screen defects using open-ended questions, and their responses were compared with expert responses to evaluate their accuracy and consistency. We employed four NLP models, including finetuned Sentence-BERT (SBERT), pre-trained SBERT, Word2Vec, and Dice similarity, to determine their effectiveness in interpreting short, domain-specific text. A novel disagreement tagging framework was introduced to supplement traditional similarity metrics with explainable insights. This framework identifies the root causes of model–human misalignment across four categories: defect type, severity, terminology, and location. Results show that a finetuned SBERT model significantly outperforms other models by achieving Pearsons’s correlation of 0.93 with MAE and RMSE scores of 0.07 and 0.12, respectively, providing more accurate and context-aware evaluations. In contrast, other models exhibit limitations in semantic understanding and consistency. The results highlight the importance of finetuning NLP models for domain-specific applications and demonstrate how qualitative tagging methods can enhance interpretability and model debugging. This combined approach indicates a scalable and transparent methodology for the evaluation of operator responses, supporting the development of more effective training programmes in industrial settings where remanufacturing and sustainability generally are a key performance metric.
While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTScore computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths. We also propose SpeechBLEU and SpeechTokenDistance, which are computed on speech discrete tokens. The evaluations on synthesized speech show that our method correlates better with human subjective ratings than mel cepstral distortion and a recent mean opinion score prediction model. Also, they are effective in noisy speech evaluation and have cross-lingual applicability.
It's fundamental action in the learning process, yet most of the time, assessment is prone to subjectivity and latent biases associated with handwriting or language competences level or even social background. Currently known automatic grading methods are efficient, however they can hardly be considered transparent or fair. In this paper we introduce an Adaptive and Explainable Fair AI-Powered Assessment System which enables fairness in a transparent way for the evaluation of student answers. The proposed system applies Machine Learning and Natural Language Processing (NLP) methods to examine semantic content, rather than writing style or linguistic complexity. A human-in-the-loop bias correction framework ensures fairness can iteratively and continuously improve by the inclusion of feedback from educators in updating models. Furthermore, a cross-lingual fairness module with multilingual transformer models helps in scoring equally across various languages. For the purpose of reducing risk and for interpretability, we include an explainability engine which generates nuanced rationales about why each score is returned in terms of content coverage, key concept relevance and where failure exists. Experimental results show that the system can reduce bias, justify transparently and accommodate personalized feedback for learning. This method adds an innovative, adaptive and ethical framework for AI schooling evaluation.
In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful large language models -- which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
Active Learning (AL) addresses the high costs of collecting human annotations by strategically annotating the most informative samples. However, for subjective NLP tasks, incorporating a wide range of perspectives in the annotation process is crucial to capture the variability in human judgments. We introduce Annotator-Centric Active Learning (ACAL), which incorporates an annotator selection strategy following data sampling. Our objective is two-fold: (1) to efficiently approximate the full diversity of human judgments, and (2) to assess model performance using annotator-centric metrics, which value minority and majority perspectives equally. We experiment with multiple annotator selection strategies across seven subjective NLP tasks, employing both traditional and novel, human-centered evaluation metrics. Our findings indicate that ACAL improves data efficiency and excels in annotator-centric performance evaluations. However, its success depends on the availability of a sufficiently large and diverse pool of annotators to sample from.
Two general approaches are common for evaluating automatically generated labels in topic modeling: direct human assessment; or performance metrics that can be calculated without, but still correlate with, human assessment. However, both approaches implicitly assume that the quality of a topic label is single-dimensional. In contrast, this paper provides evidence that human assessments about the quality of topic labels consist of multiple latent dimensions. This evidence comes from human assessments of four simple labeling techniques. For each label, study participants responded to several items asking them to assess each label according to a variety of different criteria. Exploratory factor analysis shows that these human assessments of labeling quality have a two-factor latent structure. Subsequent analysis demonstrates that this multi-item, two-factor assessment can reveal nuances that would be missed using either a single-item human assessment of perceived label quality or established performance metrics. The paper concludes by suggesting future directions for the development of human-centered approaches to evaluating NLP and ML systems more broadly.
Abstract - Debate and public speaking evaluation traditionally depend on human judgment, which can be subjective, inconsistent, and time-intensive. With advancements in Artificial Intelligence (AI) and Natural Language Processing (NLP), automated systems can now provide structured and objective language analysis. This paper presents DebateGPT, a real-time AI-based debate argument analyzer that integrates speech recognition, transformer-based NLP models, sentiment analysis, and grammatical evaluation into a unified mobile-compatible framework. The proposed system captures spoken input, converts it into text using advanced Speech-to-Text (STT) models, and analyzes argument structure, logical coherence, and emotional tone. The backend is developed using FastAPI and Python, while the frontend is implemented using Kotlin Compose for Android. Transformer models from Hugging Face are utilized for argument mining and sentiment classification, and speech recognition is powered by models such as OpenAI Whisper. The system generates structured feedback reports in real time, enabling debaters to identify strengths and weaknesses immediately. The proposed architecture aims to reduce subjectivity in debate evaluation while enhancing learning efficiency and communication skill development. Key terms Natural Language Processing, Transformer Models, Speech Recognition, Argument Mining, Sentiment Analysis, Real-Time Evaluation.
Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo) can assist with the task of developing a bias benchmark dataset from responses to an open-ended community survey. We also extend the previous work to a new community and set of biases: the Jewish community and antisemitism. Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output. Thus, we conclude that GPT-3.5-Turbo is not an appropriate substitute for human annotation in sensitive tasks related to social biases, and that its use actually negates many of the benefits of community-sourcing bias benchmarks.
While recent research increasingly emphasizes the value of human-LLM collaboration in competitive programming and proposes numerous empirical methods, a comprehensive understanding remains elusive due to the fragmented nature of existing studies and their use of diverse, application-specific human feedback. Thus, our work serves a three-fold purpose: First, we present the first taxonomy of human feedback consolidating the entire programming process, which promotes fine-grained evaluation. Second, we introduce ELABORATIONSET, a novel programming dataset specifically designed for human-LLM collaboration, meticulously annotated to enable large-scale simulated human feedback and facilitate costeffective real human interaction studies. Third, we introduce ELABORATION, a novel benchmark to facilitate a thorough assessment of human-LLM competitive programming. With ELABORATION, we pinpoint strengthes and weaknesses of existing methods, thereby setting the foundation for future improvement. Our code and dataset are available at https://github.com/SCUNLP/ELABORATION
Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: \url{https://mllm-judge.github.io/}.
LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.
As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.
The rapid advancement of large language models (LLMs) has led to a surge in both model supply and application demands. To facilitate effective matching between them, reliable, generic and efficient benchmark generators are widely needed. However, human annotators are constrained by inefficiency, and current LLM benchmark generators not only lack generalizability but also struggle with limited reliability, as they lack a comprehensive evaluation framework for validation and optimization. To fill this gap, we first propose an automated and unbiased evaluation framework, structured around four dimensions and ten criteria. Under this framework, we carefully analyze the advantages and weaknesses of directly prompting LLMs as generic benchmark generators. To enhance the reliability, we introduce a series of methods to address the identified weaknesses and integrate them as BenchMaker. Experiments across multiple LLMs and tasks confirm that BenchMaker achieves superior or comparable performance to human-annotated benchmarks on all metrics, highlighting its generalizability and reliability. More importantly, it delivers highly consistent evaluation results across 12 LLMs (0.967 Pearson correlation against MMLU-Pro), while taking only $0.005 and 0.38 minutes per sample.
LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.
Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
Inconsistent political statements represent a form of misinformation. They erode public trust and pose challenges to accountability, when left unnoticed. Detecting inconsistencies automatically could support journalists in asking clarification questions, thereby helping to keep politicians accountable. We propose the Inconsistency detection task and develop a scale of inconsistency types to prompt NLP-research in this direction. To provide a resource for detecting inconsistencies in a political domain, we present a dataset of 698 human-annotated pairs of political statements with explanations of the annotators' reasoning for 237 samples. The statements mainly come from voting assistant platforms such as Wahl-O-Mat in Germany and Smartvote in Switzerland, reflecting real-world political issues. We benchmark Large Language Models (LLMs) on our dataset and show that in general, they are as good as humans at detecting inconsistencies, and might be even better than individual humans at predicting the crowd-annotated ground-truth. However, when it comes to identifying fine-grained inconsistency types, none of the model have reached the upper bound of performance (due to natural labeling variation), thus leaving room for improvement. We make our dataset and code publicly available.
The quality of peer review plays a critical role in scientific publishing, yet remains poorly understood and challenging to evaluate at scale. In this work, we introduce RottenReviews, a benchmark designed to facilitate systematic assessment of review quality. RottenReviews comprises over 15,000 submissions from four distinct academic venues enriched with over 9,000 reviewer scholarly profiles and paper metadata. We define and compute a diverse set of quantifiable review-dependent and reviewer-dependent metrics, and compare them against structured assessments from large language models (LLMs) and expert human annotations. Our human-annotated subset includes over 700 paper-review pairs labeled across 13 explainable and conceptual dimensions of review quality. Our empirical findings reveal that LLMs, both zero-shot and fine-tuned, exhibit limited alignment with human expert evaluations of peer review quality. Surprisingly, simple interpretable models trained on quantifiable features outperform fine-tuned LLMs in predicting overall review quality. We publicly release all data, code, and models at https://github.com/Reviewerly-Inc/RottenReviews to support further research in this area.
Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM's focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the reliability of PandaLM, we collect a diverse human-annotated test dataset, where all contexts are generated by humans and labels are aligned with human preferences. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset. PandaLM enables the evaluation of LLM to be fairer but with less cost, evidenced by significant improvements achieved by models tuned through PandaLM compared to their counterparts trained with default Alpaca's hyperparameters. In addition, PandaLM does not depend on API-based evaluations, thus avoiding potential data leakage. All resources of PandaLM are released at https://github.com/WeOpenML/PandaLM.
Natural language processing (NLP), particularly sentiment analysis, plays a vital role in areas like marketing, customer service, and social media monitoring by providing insights into user opinions and emotions. However, progress in Arabic sentiment analysis remains limited due to the lack of large, high-quality labeled datasets. While active learning has proven effective in reducing annotation efforts in other languages, few studies have explored it in Arabic sentiment tasks. Likewise, the use of large language models (LLMs) for assisting annotation and comparing their performance to human labeling is still largely unexplored in the Arabic context. In this paper, we propose an active learning framework for Arabic sentiment analysis designed to reduce annotation costs while maintaining high performance. We evaluate multiple deep learning architectures: Specifically, long short-term memory (LSTM), gated recurrent units (GRU), and recurrent neural networks (RNN), across three benchmark datasets: Hunger Station, AJGT, and MASAC, encompassing both modern standard Arabic and dialectal variations. Additionally, two annotation strategies are compared: Human labeling and LLM-assisted labeling. Five LLMs are evaluated as annotators: GPT-4o, Claude 3 Sonnet, Gemini 2.5 Pro, DeepSeek Chat, and LLaMA 3 70B Instruct. For each dataset, the best-performing LLM was used: GPT-4o for Hunger Station, Claude 3 Sonnet for AJGT, and DeepSeek Chat for MASAC. Our results show that LLM-assisted active learning achieves competitive or superior performance compared to human labeling. For example, on the Hunger Station dataset, the LSTM model achieved 93% accuracy with only 450 labeled samples using GPT-4o-generated labels, while on the MASAC dataset, DeepSeek Chat reached 82% accuracy with 650 labeled samples, matching the accuracy obtained through human labeling.
With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model's ability to generate responses tailored to explicit personas. PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers based on the contextual complexity of the user personas and the difficulty in distinguishing subtle differences between two personalized responses. We conduct comprehensive evaluations across a wide range of models. The empirical results reveal that even state-of-the-art LLMs that can solve complex real-world reasoning tasks could fall short on the hard tier of PersonaFeedback where even human evaluators may find the distinctions challenging. Furthermore, we conduct an in-depth analysis of failure modes across various types of systems, demonstrating that the current retrieval-augmented framework should not be seen as a de facto solution for personalization tasks. All benchmark data, annotation protocols, and the evaluation pipeline will be publicly available to facilitate future research on LLM personalization.
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs). Typically, an LLM is given a question and selects the answer deemed most probable after adjustments for factors like length. Unfortunately, LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities, influencing the prediction of answers based on these IDs. Previous research has introduced methods to reduce this ''selection bias'' by simply permutating options on a few test samples and applying to new ones. Another problem of MCQ is the lottery ticket choice by ''random guessing''. The LLM does not learn particular knowledge, but the option is guessed correctly. This situation is especially serious for those small-scale LLMs. To address them, a more thorough approach involves shifting from MCQ to open-style questions, which can fundamentally eliminate selection bias and random guessing issues. However, transitioning causes its own set of challenges in (1) identifying suitable open-style questions and (2) validating the correctness of LLM open-style responses against human-annotated ground-truths. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions. Consequently, we introduce the Open-LLM-Leaderboard to track various LLMs' performance and reflect true capability of them, such as GPT-4o/4/3.5, Claude 3, Gemini, etc. Our code and dataset are available at https://github.com/VILA-Lab/Open-LLM-Leaderboard.
Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, many automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SE-Jury, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SE-Jury first defines five distinct evaluation strategies, each implemented by an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges as a team to produce a final correctness score through ensembling. We evaluate SE-Jury across a diverse set of software engineering (SE) benchmarks that span three popular SE tasks: code generation, automated program repair, and code summarization. Results demonstrate that SE-Jury consistently achieves a higher correlation with human judgments, with improvements ranging from 29.6% to 140.8% over existing automatic metrics. SE-Jury reaches agreement levels with human annotators that are close to inter-annotator agreement in code generation and program repair. These findings underscore SE-Jury's potential as a scalable and reliable alternative to human evaluation in these SE tasks.
High annotation costs from hiring or crowdsourcing complicate the creation of large, high-quality datasets needed for training reliable text classifiers. Recent research suggests using Large Language Models (LLMs) to automate the annotation process, reducing these costs while maintaining data quality. LLMs have shown promising results in annotating downstream tasks like hate speech detection and political framing. Building on the success in these areas, this study investigates whether LLMs are viable for annotating the complex task of media bias detection and whether a downstream media bias classifier can be trained on such data. We create annolexical, the first large-scale dataset for media bias classification with over 48000 synthetically annotated examples. Our classifier, fine-tuned on this dataset, surpasses all of the annotator LLMs by 5-9 percent in Matthews Correlation Coefficient (MCC) and performs close to or outperforms the model trained on human-labeled data when evaluated on two media bias benchmark datasets (BABE and BASIL). This study demonstrates how our approach significantly reduces the cost of dataset creation in the media bias domain and, by extension, the development of classifiers, while our subsequent behavioral stress-testing reveals some of its current limitations and trade-offs.
Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, many automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts.In this paper, we present SE-Jury, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SE-Jury first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges as a team to produce a final correctness score through ensembling. We evaluate SE-Jury across a diverse set of software engineering (SE) benchmarks that span three popular SE tasks: code generation, automated program repair, and code summarization. Results demonstrate that SE-Jury consistently achieves a higher correlation with human judgments, with improvements ranging from 29.6%–140.8% over existing automatic metrics. SE-Jury reaches agreement levels with human annotators that are close to inter-annotator agreement in code generation and program repair. These findings underscore SE-Jury’s potential as a scalable and reliable alternative to human evaluation in these SE tasks.
Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) style variation (e.g., uncertain phrasing or colloquial tone); and (3) format changes (e.g., LLM-generated multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs. We find that LLMs are more sensitive to gender and style perturbations while human annotators are more sensitive to LLM-generated format perturbations such as clinical summaries. Our results highlight the need for evaluation frameworks that go beyond static benchmarks to assess the similarity between human clinician and LLM decisions under the variability characteristic of clinical settings.
Integrating Large Language Models (LLMs) into business process management tools promises to democratize Business Process Model and Notation (BPMN) modeling for non-experts. While automated frameworks assess syntactic and semantic quality, they miss human factors like trust, usability, and professional alignment. We conducted a mixed-methods evaluation of our proposed solution, an LLM-powered BPMN copilot, with five process modeling experts using focus groups and standardized questionnaires. Our findings reveal a critical tension between acceptable perceived usability (mean CUQ score: 67.2/100) and notably lower trust (mean score: 48.8\%), with reliability rated as the most critical concern (M=1.8/5). Furthermore, we identified output-quality issues, prompting difficulties, and a need for the LLM to ask more in-depth clarifying questions about the process. We envision five use cases ranging from domain-expert support to enterprise quality assurance. We demonstrate the necessity of human-centered evaluation complementing automated benchmarking for LLM modeling agents.
Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored. In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide...
Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs' simplification capabilities. We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. We believe that these models offer a representative selection across large, medium, and small sizes of LLMs. Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's and Qwen2.5-72B's struggle with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that these metrics lack sufficient sensitivity to assess the overall high-quality simplifications, particularly those generated by high-performance LLMs.
We outline the Great Misalignment Problem in natural language processing research, this means simply that the problem definition is not in line with the method proposed and the human evaluation is not in line with the definition nor the method. We study this misalignment problem by surveying 10 randomly sampled papers published in ACL 2020 that report results with human evaluation. Our results show that only one paper was fully in line in terms of problem definition, method and evaluation. Only two papers presented a human evaluation that was in line with what was modeled in the method. These results highlight that the Great Misalignment Problem is a major one and it affects the validity and reproducibility of results obtained by a human evaluation.
Chatbots have been an interesting application of natural language generation since its inception. With novel transformer based Generative AI methods, building chatbots have become trivial. Chatbots which are targeted at specific domains for example medicine and psychology are implemented rapidly. This however, should not distract from the need to evaluate the chatbot responses. Especially because the natural language generation community does not entirely agree upon how to effectively evaluate such applications. With this work we discuss the issue further with the increasingly popular LLM based evaluations and how they correlate with human evaluations. Additionally, we introduce a comprehensive factored evaluation mechanism that can be utilized in conjunction with both human and LLM-based evaluations. We present the results of an experimental evaluation conducted using this scheme in one of our chatbot implementations which consumed educational reports, and subsequently compare automated, traditional human evaluation, factored human evaluation, and factored LLM evaluation. Results show that factor based evaluation produces better insights on which aspects need to be improved in LLM applications and further strengthens the argument to use human evaluation in critical spaces where main functionality is not direct retrieval.
The REAIM 2024 Blueprint for Action states that AI applications in the military domain should be ethical and human-centric and that humans must remain responsible and accountable for their use and effects. Developing rigorous test and evaluation, verification and validation (TEVV) frameworks will contribute to robust oversight mechanisms. TEVV in the development and deployment of AI systems needs to involve human users throughout the lifecycle. Traditional human-centred test and evaluation methods from human factors need to be adapted for deployed AI systems that require ongoing monitoring and evaluation. The language around AI-enabled systems should be shifted to inclusion of the human(s) as a component of the system. Standards and requirements supporting this adjusted definition are needed, as are metrics and means to evaluate them. The need for dialogue between technologists and policymakers on human-centred TEVV will be evergreen, but dialogue needs to be initiated with an objective in mind for it to be productive. Development of TEVV throughout system lifecycle is critical to support this evolution including the issue of human scalability and impact on scale of achievable testing. Communication between technical and non technical communities must be improved to ensure operators and policy-makers understand risk assumed by system use and to better inform research and development. Test and evaluation in support of responsible AI deployment must include the effect of the human to reflect operationally realised system performance. Means of communicating the results of TEVV to those using and making decisions regarding the use of AI based systems will be key in informing risk based decisions regarding use.
Trustworthy Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance, where even small errors can have severe consequences. Practitioners and researchers face a choice: let models generate citations during decoding, or let models draft answers first and then attach appropriate citations. To clarify this choice, we introduce two paradigms: Generation-Time Citation (G-Cite), which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite), which adds or verifies citations after drafting. We conduct a comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets and provide evidence-based recommendations that weigh trade-offs across use cases. Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms. P-Cite methods achieve high coverage with competitive correctness and moderate latency, whereas G-Cite methods prioritize precision at the cost of coverage and speed. We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification. Our codes and human evaluation results are available at https://anonymous.4open.science/r/Citation_Paradigms-BBB5/
LLM evaluation is challenging even the case of base models. In real world deployments, evaluation is further complicated by the interplay of task specific prompts and experiential context. At scale, bias evaluation is often based on short context, fixed choice benchmarks that can be rapidly evaluated, however, these can lose validity when the LLMs' deployed context differs. Large scale human evaluation is often seen as too intractable and costly. Here we present our journey towards developing a semi-automated bias evaluation framework for free text responses that has human insights at its core. We discuss how we developed an operational definition of bias that helped us automate our pipeline and a methodology for classifying bias beyond multiple choice. We additionally comment on how human evaluation helped us uncover problematic templates in a bias benchmark.
We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evaluation comparing the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid progress to the point that many recent models have demonstrated their ability to create realistic high-resolution images for various prompts. However, current text-to-image methods and the broader body of research in vision-language understanding still struggle with intricate text prompts that contain many objects with multiple attributes and relationships. We introduce a new text-to-image benchmark that contains a suite of thirty-two tasks over multiple applications that capture a model's ability to handle different features of a text prompt. For example, asking a model to generate a varying number of the same object to measure its ability to count or providing a text prompt with several objects that each have a different attribute to identify its ability to match objects and attributes correctly. Rather than subjectively evaluating text-to-image results on a set of prompts, our new multi-task benchmark consists of challenge tasks at three difficulty levels (easy, medium, and hard) and human ratings for each generated image.
High-stakes decision domains are increasingly exploring the potential of Large Language Models (LLMs) for complex decision-making tasks. However, LLM deployment in real-world settings presents challenges in data security, evaluation of its capabilities outside controlled environments, and accountability attribution in the event of adversarial decisions. This paper proposes a framework for responsible deployment of LLM-based decision-support systems through active human involvement. It integrates interactive collaboration between human experts and developers through multiple iterations at the pre-deployment stage to assess the uncertain samples and judge the stability of the explanation provided by post-hoc XAI techniques. Local LLM deployment within organizations and decentralized technologies, such as Blockchain and IPFS, are proposed to create immutable records of LLM activities for automated auditing to enhance security and trace back accountability. It was tested on Bert-large-uncased, Mistral, and LLaMA 2 and 3 models to assess the capability to support responsible financial decisions on business lending.
Improvements in text generation technologies such as machine translation have necessitated more costly and time-consuming human evaluation procedures to ensure an accurate signal. We investigate a simple way to reduce cost by reducing the number of text segments that must be annotated in order to accurately predict a score for a complete test set. Using a sampling approach, we demonstrate that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline. We achieve gains of up to 20% in average absolute error by leveraging stratified sampling and control variates. Our techniques can improve estimates made from a fixed annotation budget, are easy to implement, and can be applied to any problem with structure similar to the one we study.
In subjective decision-making, where decisions are based on contextual interpretation, Large Language Models (LLMs) can be integrated to present users with additional rationales to consider. The diversity of these rationales is mediated by the ability to consider the perspectives of different social actors. However, it remains unclear whether and how models differ in the distribution of perspectives they provide. We compare the perspectives taken by humans and different LLMs when assessing subtle sexism scenarios. We show that these perspectives can be classified within a finite set (perpetrator, victim, decision-maker), consistently present in argumentations produced by humans and LLMs, but in different distributions and combinations, demonstrating differences and similarities with human responses, and between models. We argue for the need to systematically evaluate LLMs' perspective-taking to identify the most suitable models for a given decision-making task. We discuss the implications for model evaluation.
LLM-as-a-Judge has been widely applied to evaluate and compare different LLM alignmnet approaches (e.g., RLHF and DPO). However, concerns regarding its reliability have emerged, due to LLM judges' biases and inconsistent decision-making. Previous research has developed evaluation frameworks to assess reliability of LLM judges and their alignment with human preferences. However, the employed evaluation metrics often lack adequate explainability and fail to address LLM internal inconsistency. Additionally, existing studies inadequately explore the impact of various prompt templates when applying LLM-as-a-Judge methods, leading to potentially inconsistent comparisons between different alignment algorithms. In this work, we systematically evaluate LLM-as-a-Judge on alignment tasks by defining more theoretically interpretable evaluation metrics and explicitly mitigating LLM internal inconsistency from reliability metrics. We develop an open-source framework to evaluate, compare, and visualize the reliability and alignment of LLM judges, which facilitates practitioners to choose LLM judges for alignment tasks. In the experiments, we examine effects of diverse prompt templates on LLM-judge reliability and also demonstrate our developed framework by comparing various LLM judges on two common alignment datasets (i.e., TL;DR Summarization and HH-RLHF-Helpfulness). Our results indicate a significant impact of prompt templates on LLM judge performance, as well as a mediocre alignment level between the tested LLM judges and human evaluators.
Sycophantic response patterns in Large Language Models (LLMs) have been increasingly claimed in the literature. We review methodological challenges in measuring LLM sycophancy and identify five core operationalizations. Despite sycophancy being inherently human-centric, current research does not evaluate human perception. Our analysis highlights the difficulties in distinguishing sycophantic responses from related concepts in AI alignment and offers actionable recommendations for future research.
Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks: open-ended story generation and adversarial attacks. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs. We also find that the results of LLM evaluation are stable over different formatting of the task instructions and the sampling algorithm used to generate the answer. We are the first to show the potential of using LLMs to assess the quality of texts and discuss the limitations and ethical considerations of LLM evaluation.
Research in interpretable machine learning proposes different computational and human subject approaches to evaluate model saliency explanations. These approaches measure different qualities of explanations to achieve diverse goals in designing interpretable machine learning systems. In this paper, we propose a human attention benchmark for image and text domains using multi-layer human attention masks aggregated from multiple human annotators. We then present an evaluation study to evaluate model saliency explanations obtained using Grad-cam and LIME techniques. We demonstrate our benchmark's utility for quantitative evaluation of model explanations by comparing it with human subjective ratings and ground-truth single-layer segmentation masks evaluations. Our study results show that our threshold agnostic evaluation method with the human attention baseline is more effective than single-layer object segmentation masks to ground truth. Our experiments also reveal user biases in the subjective rating of model saliency explanations.
Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. This alignment requires a vast corpus of human feedback to learn a reward model, which is subsequently used to finetune language models. However, we have identified that the reward model often finds shortcuts to bypass its intended objectives, misleadingly assuming that humans prefer longer responses. The emergence of length bias often induces the model to favor longer outputs, yet it doesn't equate to an increase in helpful information within these outputs. In this paper, we propose an innovative solution, applying the Product-of-Experts (PoE) technique to separate reward modeling from the influence of sequence length. In our framework, the main expert concentrates on understanding human intents, while the biased expert targets the identification and capture of length bias. To further enhance the learning of bias, we introduce perturbations into the bias-focused expert, disrupting the flow of semantic information. Experimental results validate the effectiveness of our approach, indicating that language model performance is improved, irrespective of sequence length.
Recent organizations have started to adopt AI-based decision support tools to optimize human resource development practices, while facing various challenges of using AIs in highly contextual and sensitive domains. We present our case study that aims to help professional assessors make decisions in human assessment, in which they conduct interviews with assessees and evaluate their suitability for certain job roles. Our workshop with two industrial assessors elucidated troubles they face (i.e., maintaining stable and non-subjective observation of assessees' behaviors) and derived requirements of AI systems (i.e., extracting their nonverbal cues from interview videos in an interpretable manner). In response, we employed an unsupervised anomaly detection algorithm using multimodal behavioral features such as facial keypoints, body and head pose, and gaze. The algorithm extracts outlier scenes from the video based on behavioral features as well as informing which feature contributes to the outlierness. We first evaluated how the assessors would perceive the extracted cues and discovered that the algorithm is useful in suggesting scenes to which assessors would pay attention, thanks to its interpretability. Then, we developed an interface prototype incorporating the algorithm and had six assessors use it for their actual assessment. Their comments revealed the effectiveness of introducing unsupervised anomaly detection to enhance their feeling of confidence and objectivity of the assessment along with potential use scenarios of such AI-based systems in human assessment. Our approach, which builds on top of the idea of separating observation and interpretation in human-AI collaboration, will facilitate human decision making in highly contextual domains, such as human assessment, while keeping their trust in the system.
With the rise of AI systems in real-world applications comes the need for reliable and trustworthy AI. An essential aspect of this are explainable AI systems. However, there is no agreed standard on how explainable AI systems should be assessed. Inspired by the Turing test, we introduce a human-centric assessment framework where a leading domain expert accepts or rejects the solutions of an AI system and another domain expert. By comparing the acceptance rates of provided solutions, we can assess how the AI system performs compared to the domain expert, and whether the AI system's explanations (if provided) are human-understandable. This setup -- comparable to the Turing test -- can serve as a framework for a wide range of human-centric AI system assessments. We demonstrate this by presenting two instantiations: (1) an assessment that measures the classification accuracy of a system with the option to incorporate label uncertainties; (2) an assessment where the usefulness of provided explanations is determined in a human-centric manner.
In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines
Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential pitfalls and areas for improvement ("leaderboard smells"). In this regard, we collect up to 1,045 FM leaderboards from five different sources: GitHub, Hugging Face Spaces, Papers With Code, spreadsheet and independent platform, to examine their documentation and engage in direct communication with leaderboard operators to understand their workflows. Through card sorting and negotiated agreement, we identify five distinct workflow patterns and develop a domain model that captures the key components and their interactions within these workflows. We then identify eight unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.
The growing integration of Large Language Models (LLMs) into critical societal domains has raised concerns about embedded biases that can perpetuate stereotypes and undermine fairness. Such biases may stem from historical inequalities in training data, linguistic imbalances, or adversarial manipulation. Despite mitigation efforts, recent studies show that LLMs remain vulnerable to adversarial attacks that elicit biased outputs. This work proposes a scalable benchmarking framework to assess LLM robustness to adversarial bias elicitation. Our methodology involves: (i) systematically probing models across multiple tasks targeting diverse sociocultural biases, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach, and (iii) employing jailbreak techniques to reveal safety vulnerabilities. To facilitate systematic benchmarking, we release a curated dataset of bias-related prompts, named CLEAR-Bias. Our analysis, identifying DeepSeek V3 as the most reliable judge LLM, reveals that bias resilience is uneven, with age, disability, and intersectional biases among the most prominent. Some small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale. However, no model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective across model families. We also find that successive LLM generations exhibit slight safety gains, while models fine-tuned for the medical domain tend to be less safe than their general-purpose counterparts.
Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM
Developing intelligent pediatric consultation systems offers promising prospects for improving diagnostic efficiency, especially in China, where healthcare resources are scarce. Despite recent advances in Large Language Models (LLMs) for Chinese medicine, their performance is sub-optimal in pediatric applications due to inadequate instruction data and vulnerable training procedures. To address the above issues, this paper builds PedCorpus, a high-quality dataset of over 300,000 multi-task instructions from pediatric textbooks, guidelines, and knowledge graph resources to fulfil diverse diagnostic demands. Upon well-designed PedCorpus, we propose PediatricsGPT, the first Chinese pediatric LLM assistant built on a systematic and robust training pipeline. In the continuous pre-training phase, we introduce a hybrid instruction pre-training mechanism to mitigate the internal-injected knowledge inconsistency of LLMs for medical domain adaptation. Immediately, the full-parameter Supervised Fine-Tuning (SFT) is utilized to incorporate the general medical knowledge schema into the models. After that, we devise a direct following preference optimization to enhance the generation of pediatrician-like humanistic responses. In the parameter-efficient secondary SFT phase, a mixture of universal-specific experts strategy is presented to resolve the competency conflict between medical generalist and pediatric expertise mastery. Extensive results based on the metrics, GPT-4, and doctor evaluations on distinct doctor downstream tasks show that PediatricsGPT consistently outperforms previous Chinese medical LLMs. Our model and dataset will be open-source for community development.
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications. Current hallucination detection and mitigation datasets are limited in domains and sizes, which struggle to scale due to prohibitive labor costs and insufficient reliability of existing hallucination annotators. To facilitate the scalable oversight of LLM hallucinations, this paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset and improves the accuracy of the hallucination annotator. Based on the Expectation Maximization (EM) algorithm, in each iteration, the framework first applies a hallucination annotation pipeline to annotate a scaled dataset and then trains a more accurate hallucination annotator on the dataset. This new hallucination annotator is adopted in the hallucination annotation pipeline used for the next iteration. Extensive experimental results demonstrate that the finally obtained hallucination annotator with only 7B parameters surpasses the performance of GPT-4 and obtains new state-of-the-art hallucination detection results on HaluEval and HalluQA by zero-shot inference. Such an annotator can not only evaluate the hallucination levels of various LLMs on the large-scale dataset but also help to mitigate the hallucination of LLMs generations, with the Natural Language Inference (NLI) metric increasing from 25% to 37% on HaluEval.
We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ensemble and oracle combination, and provide learning curves to help us understand which languages are more challenging. A number of difficult sentences are identified and investigated further with human annotation.
This paper presents the challenges in creating and managing large parallel corpora of 12 major Indian languages (which is soon to be extended to 23 languages) as part of a major consortium project funded by the Department of Information Technology (DIT), Govt. of India, and running parallel in 10 different universities of India. In order to efficiently manage the process of creation and dissemination of these huge corpora, the web-based (with a reduced stand-alone version also) annotation tool ILCIANN (Indian Languages Corpora Initiative Annotation Tool) has been developed. It was primarily developed for the POS annotation as well as the management of the corpus annotation by people with differing amount of competence and at locations physically situated far apart. In order to maintain consistency and standards in the creation of the corpora, it was necessary that everyone works on a common platform which was provided by this tool.
Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.
Evaluating the creative capabilities of large language models (LLMs) in complex tasks often requires human assessments that are difficult to scale. We introduce a novel, scalable methodology for evaluating LLM story generation by analyzing underlying social structures in narratives as signed character networks. To demonstrate its effectiveness, we conduct a large-scale comparative analysis using networks from over 1,200 stories, generated by four leading LLMs (GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash) and a human-written corpus. Our findings, based on network properties like density, clustering, and signed edge weights, show that LLM-generated stories consistently exhibit a strong bias toward tightly-knit, positive relationships, which aligns with findings from prior research using human assessment. Our proposed approach provides a valuable tool for evaluating limitations and tendencies in the creative storytelling of current and future LLMs.
Using large language models (LLMs) to evaluate text quality has recently gained popularity. Some prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.
Evaluating large language model (LLM)-based multi-agent systems remains a critical challenge, as these systems must exhibit reliable coordination, transparent decision-making, and verifiable performance across evolving tasks. Existing evaluation approaches often limit themselves to single-response scoring or narrow benchmarks, which lack stability, extensibility, and automation when deployed in enterprise settings at multi-agent scale. We present AEMA (Adaptive Evaluation Multi-Agent), a process-aware and auditable framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under human oversight. Compared to a single LLM-as-a-Judge, AEMA achieves greater stability, human alignment, and traceable records that support accountable automation. Our results on enterprise-style agent workflows simulated using realistic business scenarios demonstrate that AEMA provides a transparent and reproducible pathway toward responsible evaluation of LLM-based multi-agent systems. Keywords Agentic AI, Multi-Agent Systems, Trustworthy AI, Verifiable Evaluation, Human Oversight
As Large Language Models (LLMs) transition from static tools to autonomous agents, traditional evaluation benchmarks that measure performance on downstream tasks are becoming insufficient. These methods fail to capture the emergent social and cognitive dynamics that arise when agents communicate, persuade, and collaborate in interactive environments. To address this gap, we introduce a novel evaluation framework that uses multi-agent debate as a controlled "social laboratory" to discover and quantify these behaviors. In our framework, LLM-based agents, instantiated with distinct personas and incentives, deliberate on a wide range of challenging topics under the supervision of an LLM moderator. Our analysis, enabled by a new suite of psychometric and semantic metrics, reveals several key findings. Across hundreds of debates, we uncover a powerful and robust emergent tendency for agents to seek consensus, consistently reaching high semantic agreement (μ > 0.88) even without explicit instruction and across sensitive topics. We show that assigned personas induce stable, measurable psychometric profiles, particularly in cognitive effort, and that the moderators persona can significantly alter debate outcomes by structuring the environment, a key finding for external AI alignment. This work provides a blueprint for a new class of dynamic, psychometrically grounded evaluation protocols designed for the agentic setting, offering a crucial methodology for understanding and shaping the social behaviors of the next generation of AI agents. We have released the code and results at https://github.com/znreza/multi-agent-LLM-eval-for-debate.
As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the performance of large language models (LLMs). However, there is a lack of systematic analysis and guidance on determining the sufficiency of test data or selecting informative samples for evaluation. This paper introduces a certifiable and cost-efficient evaluation framework for LLMs. Our framework adapts to different evaluation objectives and outputs confidence intervals that contain true values with high probability. We use ``test sample complexity'' to quantify the number of test points needed for a certifiable evaluation and derive tight bounds on test sample complexity. Based on the developed theory, we develop a partition-based algorithm, named Cer-Eval, that adaptively selects test points to minimize the cost of LLM evaluation. Real-world experiments demonstrate that Cer-Eval can save 20% to 40% test points across various benchmarks, while maintaining an estimation error level comparable to the current evaluation process and providing a 95% confidence guarantee.
As AI technology continues to advance, the importance of human-AI collaboration becomes increasingly evident, with numerous studies exploring its potential in various fields. One vital field is data science, including feature engineering (FE), where both human ingenuity and AI capabilities play pivotal roles. Despite the existence of AI-generated recommendations for FE, there remains a limited understanding of how to effectively integrate and utilize humans' and AI's knowledge. To address this gap, we design a readily-usable prototype, human\&AI-assisted FE in Jupyter notebooks. It harnesses the strengths of humans and AI to provide feature suggestions to users, seamlessly integrating these recommendations into practical workflows. Using the prototype as a research probe, we conducted an exploratory study to gain valuable insights into data science practitioners' perceptions, usage patterns, and their potential needs when presented with feature suggestions from both humans and AI. Through qualitative analysis, we discovered that the Creator of the feature (i.e., AI or human) significantly influences users' feature selection, and the semantic clarity of the suggested feature greatly impacts its adoption rate. Furthermore, our findings indicate that users perceive both differences and complementarity between features generated by humans and those generated by AI. Lastly, based on our study results, we derived a set of design recommendations for future human&AI FE design. Our findings show the collaborative potential between humans and AI in the field of FE.
Many important decisions in daily life are made with the help of advisors, e.g., decisions about medical treatments or financial investments. Whereas in the past, advice has often been received from human experts, friends, or family, advisors based on artificial intelligence (AI) have become more and more present nowadays. Typically, the advice generated by AI is judged by a human and either deemed reliable or rejected. However, recent work has shown that AI advice is not always beneficial, as humans have shown to be unable to ignore incorrect AI advice, essentially representing an over-reliance on AI. Therefore, the aspired goal should be to enable humans not to rely on AI advice blindly but rather to distinguish its quality and act upon it to make better decisions. Specifically, that means that humans should rely on the AI in the presence of correct advice and self-rely when confronted with incorrect advice, i.e., establish appropriate reliance (AR) on AI advice on a case-by-case basis. Current research lacks a metric for AR. This prevents a rigorous evaluation of factors impacting AR and hinders further development of human-AI decision-making. Therefore, based on the literature, we derive a measurement concept of AR. We propose to view AR as a two-dimensional construct that measures the ability to discriminate advice quality and behave accordingly. In this article, we derive the measurement concept, illustrate its application and outline potential future research.
Fully automatic semantic segmentation of highly specific semantic classes and complex shapes may not meet the accuracy standards demanded by scientists. In such cases, human-centered AI solutions, able to assist operators while preserving human control over complex tasks, are a good trade-off to speed up image labeling while maintaining high accuracy levels. TagLab is an open-source AI-assisted software for annotating large orthoimages which takes advantage of different degrees of automation; it speeds up image annotation from scratch through assisted tools, creates custom fully automatic semantic segmentation models, and, finally, allows the quick edits of automatic predictions. Since the orthoimages analysis applies to several scientific disciplines, TagLab has been designed with a flexible labeling pipeline. We report our results in two different scenarios, marine ecology, and architectural heritage.
The utilization of face masks is an essential healthcare measure, particularly during times of pandemics, yet it can present challenges in communication in our daily lives. To address this problem, we propose a novel approach known as the human-in-the-loop StarGAN (HL-StarGAN) face-masked speech enhancement method. HL-StarGAN comprises discriminator, classifier, metric assessment predictor, and generator that leverages an attention mechanism. The metric assessment predictor, referred to as MaskQSS, incorporates human participants in its development and serves as a "human-in-the-loop" module during the learning process of HL-StarGAN. The overall HL-StarGAN model was trained using an unsupervised learning strategy that simultaneously focuses on the reconstruction of the original clean speech and the optimization of human perception. To implement HL-StarGAN, we curated a face-masked speech database named "FMVD," which comprises recordings from 34 speakers in three distinct face-masked scenarios and a clean condition. We conducted subjective and objective tests on the proposed HL-StarGAN using this database. The outcomes of the test results are as follows: (1) MaskQSS successfully predicted the quality scores of face mask voices, outperforming several existing speech assessment methods. (2) The integration of the MaskQSS predictor enhanced the ability of HL-StarGAN to transform face mask voices into high-quality speech; this enhancement is evident in both objective and subjective tests, outperforming conventional StarGAN and CycleGAN-based systems.
Generative AI assistants offer significant potential to enhance productivity, streamline information access, and improve user experience in enterprise contexts. In this work, we present Summit Concierge, a domain-specific AI assistant developed for Adobe Summit. The assistant handles a wide range of event-related queries and operates under real-world constraints such as data sparsity, quality assurance, and rapid deployment. To address these challenges, we adopt a human-in-the-loop development workflow that combines prompt engineering, retrieval grounding, and lightweight human validation. We describe the system architecture, development process, and real-world deployment outcomes. Our experience shows that agile, feedback-driven development enables scalable and reliable AI assistants, even in cold-start scenarios.
With the fast development of Machine Translation (MT) systems, especially the new boost from Neural MT (NMT) models, the MT output quality has reached a new level of accuracy. However, many researchers criticised that the current popular evaluation metrics such as BLEU can not correctly distinguish the state-of-the-art NMT systems regarding quality differences. In this short paper, we describe the design and implementation of a linguistically motivated human-in-the-loop evaluation metric looking into idiomatic and terminological Multi-word Expressions (MWEs). MWEs have played a bottleneck in many Natural Language Processing (NLP) tasks including MT. MWEs can be used as one of the main factors to distinguish different MT systems by looking into their capabilities in recognising and translating MWEs in an accurate and meaning equivalent manner.
Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs
In recent years, Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities, surpassing those seen in earlier language models. A particularly intriguing application of LLMs is their role as evaluators for texts produced by various generative models. In this study, we delve into the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models. Initially, we introduce an innovative approach for factuality assessment using LLMs. This entails employing a singular LLM for the entirety of the question-answering-based factuality scoring process. Following this, we examine the efficacy of various LLMs in direct factuality scoring, benchmarking them against traditional measures and human annotations. Contrary to initial expectations, our results indicate a lack of significant correlations between factuality metrics and human evaluations, specifically for GPT-4 and PaLM-2. Notable correlations were only observed with GPT-3.5 across two factuality subcategories. These consistent findings across various factual error categories suggest a fundamental limitation in the current LLMs' capability to accurately gauge factuality. This version presents the information more concisely while maintaining the main points and findings of the original text.
We use the notion of oracle machines and reductions from computability theory to formalise different Human-in-the-loop (HITL) setups for AI systems, distinguishing between trivial human monitoring (i.e., total functions), single endpoint human action (i.e., many-one reductions), and highly involved human-AI interaction (i.e., Turing reductions). We then proceed to show that the legal status and safety of different setups vary greatly. We present a taxonomy to categorise HITL failure modes, highlighting the practical limitations of HITL setups. We then identify omissions in UK and EU legal frameworks, which focus on HITL setups that may not always achieve the desired ethical, legal, and sociotechnical outcomes. We suggest areas where the law should recognise the effectiveness of different HITL setups and assign responsibility in these contexts, avoiding human "scapegoating". Our work shows an unavoidable trade-off between attribution of legal responsibility, and technical explainability. Overall, we show how HITL setups involve many technical design decisions, and can be prone to failures out of the humans' control. Our formalisation and taxonomy opens up a new analytic perspective on the challenges in creating HITL setups, helping inform AI developers and lawmakers on designing HITL setups to better achieve their desired outcomes.
High-quality human annotations are necessary for creating effective machine learning-driven stream processing systems. We study hybrid stream processing systems based on a Human-In-The-Loop Machine Learning (HITL-ML) paradigm, in which one or many human annotators and an automatic classifier (trained at least partially by the human annotators) label an incoming stream of instances. This is typical of many near-real-time social media analytics and web applications, including annotating social media posts during emergencies by digital volunteer groups. From a practical perspective, low-quality human annotations result in wrong labels for retraining automated classifiers and indirectly contribute to the creation of inaccurate classifiers. Considering human annotation as a psychological process allows us to address these limitations. We show that human annotation quality is dependent on the ordering of instances shown to annotators and can be improved by local changes in the instance sequence/order provided to the annotators, yielding a more accurate annotation of the stream. We adapt a theoretically-motivated human error framework of mistakes and slips for the human annotation task to study the effect of ordering instances (i.e., an "annotation schedule"). Further, we propose an error-avoidance approach to the active learning paradigm for stream processing applications robust to these likely human errors (in the form of slips) when deciding a human annotation schedule. We support the human error framework using crowdsourcing experiments and evaluate the proposed algorithm against standard baselines for active learning via extensive experimentation on classification tasks of filtering relevant social media posts during natural disasters.
Human-in-the-loop (HITL) frameworks are increasingly recognized for their potential to improve annotation accuracy in emotion estimation systems by combining machine predictions with human expertise. This study focuses on integrating a high-performing image-based emotion model into a HITL annotation framework to evaluate the collaborative potential of human-machine interaction and identify the psychological and practical factors critical to successful collaboration. Specifically, we investigate how varying model reliability and cognitive framing influence human trust, cognitive load, and annotation behavior in HITL systems. We demonstrate that model reliability and psychological framing significantly impact annotators' trust, engagement, and consistency, offering insights into optimizing HITL frameworks. Through three experimental scenarios with 29 participants--baseline model reliability (S1), fabricated errors (S2), and cognitive bias introduced by negative framing (S3)--we analyzed behavioral and qualitative data. Reliable predictions in S1 yielded high trust and annotation consistency, while unreliable outputs in S2 led to increased critical evaluations but also heightened frustration and response variability. Negative framing in S3 revealed how cognitive bias influenced participants to perceive the model as more relatable and accurate, despite misinformation regarding its reliability. These findings highlight the importance of both reliable machine outputs and psychological factors in shaping effective human-machine collaboration. By leveraging the strengths of both human oversight and automated systems, this study establishes a scalable HITL framework for emotion annotation and lays the foundation for broader applications in adaptive learning and human-computer interaction.
Reliable evaluation is crucial for advancing Automated Program Repair (APR), but prevailing benchmarks rely on execution-based evaluation methods (unit test pass@k), which fail to capture true patch validity. Determining validity can require costly manual annotation. To reduce this cost, we introduce a human-in-the-loop approach to LLM-based patch validity judgment. Inspired by the observation that human judgment is better aligned when using a shared rubric, we first employ an LLM to generate a per-bug rubric, followed by a one-time human review and optional refinement to this rubric, and then employ an LLM to judge patches using the refined rubric. We apply this approach to assign binary validity labels to patches for issues found by Google sanitizer tools. Our results show that this approach yields substantial agreement with human consensus (Cohen's kappa 0.75), high recall (0.94) and high precision (0.80), when considering patches that have unanimous agreement from 3 human raters on the validity labels. On the full dataset including patches where human raters disagree, we find this approach can still be further improved (Cohen's kappa 0.57, recall 0.93, precision 0.65) and identify possible future directions.
Misinformation threatens modern society by promoting distrust in science, changing narratives in public health, heightening social polarization, and disrupting democratic elections and financial markets, among a myriad of other societal harms. To address this, a growing cadre of professional fact-checkers and journalists provide high-quality investigations into purported facts. However, these largely manual efforts have struggled to match the enormous scale of the problem. In response, a growing body of Natural Language Processing (NLP) technologies have been proposed for more scalable fact-checking. Despite tremendous growth in such research, however, practical adoption of NLP technologies for fact-checking still remains in its infancy today. In this work, we review the capabilities and limitations of the current NLP technologies for fact-checking. Our particular focus is to further chart the design space for how these technologies can be harnessed and refined in order to better meet the needs of human fact-checkers. To do so, we review key aspects of NLP-based fact-checking: task formulation, dataset construction, modeling, and human-centered strategies, such as explainable models and human-in-the-loop approaches. Next, we review the efficacy of applying NLP-based fact-checking tools to assist human fact-checkers. We recommend that future research include collaboration with fact-checker stakeholders early on in NLP research, as well as incorporation of human-centered design practices in model development, in order to further guide technology development for human use and practical adoption. Finally, we advocate for more research on benchmark development supporting extrinsic evaluation of human-centered fact-checking technologies.
An essential aspect of evaluating Large Language Models (LLMs) is identifying potential biases. This is especially relevant considering the substantial evidence that LLMs can replicate human social biases in their text outputs and further influence stakeholders, potentially amplifying harm to already marginalized individuals and communities. Therefore, recent efforts in bias detection invested in automated benchmarks and objective metrics such as accuracy (i.e., an LLMs output is compared against a predefined ground truth). Nonetheless, social biases can be nuanced, oftentimes subjective and context-dependent, where a situation is open to interpretation and there is no ground truth. While these situations can be difficult for automated evaluation systems to identify, human evaluators could potentially pick up on these nuances. In this paper, we discuss the role of human evaluation and subjective interpretation to augment automated processes when identifying biases in LLMs as part of a human-centred approach to evaluate these models.
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.
HCI and NLP traditionally focus on different evaluation methods. While HCI involves a small number of people directly and deeply, NLP traditionally relies on standardized benchmark evaluations that involve a larger number of people indirectly. We present five methodological proposals at the intersection of HCI and NLP and situate them in the context of ML-based NLP models. Our goal is to foster interdisciplinary collaboration and progress in both fields by emphasizing what the fields can learn from each other.
Text simplification is essential for making public health information accessible to diverse populations, including those with limited health literacy. However, commonly used evaluation metrics in Natural Language Processing (NLP), such as BLEU, FKGL, and SARI, mainly capture surface-level features and fail to account for human-centered qualities like clarity, trustworthiness, tone, cultural relevance, and actionability. This limitation is particularly critical in high-stakes health contexts, where communication must be not only simple but also usable, respectful, and trustworthy. To address this gap, we propose the Human-Centered Readability Score (HCRS), a five-dimensional evaluation framework grounded in Human-Computer Interaction (HCI) and health communication research. HCRS integrates automatic measures with structured human feedback to capture the relational and contextual aspects of readability. We outline the framework, discuss its integration into participatory evaluation workflows, and present a protocol for empirical validation. This work aims to advance the evaluation of health text simplification beyond surface metrics, enabling NLP systems that align more closely with diverse users' needs, expectations, and lived experiences.
The "LLM-as-an-annotator" and "LLM-as-a-judge" paradigms employ Large Language Models (LLMs) as annotators, judges, and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure, the Alternative Annotator Test (alt-test), that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM annotators and judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming the open-source LLMs we examine, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.
Large language models (LLMs) have shown high agreement with human raters across a variety of tasks, demonstrating potential to ease the challenges of human data collection. In computational social science (CSS), researchers are increasingly leveraging LLM annotations to complement slow and expensive human annotations. Still, guidelines for collecting and using LLM annotations, without compromising the validity of downstream conclusions, remain limited. We introduce Confidence-Driven Inference: a method that combines LLM annotations and LLM confidence indicators to strategically select which human annotations should be collected, with the goal of producing accurate statistical estimates and provably valid confidence intervals while reducing the number of human annotations needed. Our approach comes with safeguards against LLM annotations of poor quality, guaranteeing that the conclusions will be both valid and no less accurate than if we only relied on human annotations. We demonstrate the effectiveness of Confidence-Driven Inference over baselines in statistical estimation tasks across three CSS settings--text politeness, stance, and bias--reducing the needed number of human annotations by over 25% in each. Although we use CSS settings for demonstration, Confidence-Driven Inference can be used to estimate most standard quantities across a broad range of NLP problems.
The rise of online platforms exacerbated the spread of hate speech, demanding scalable and effective detection. However, the accuracy of hate speech detection systems heavily relies on human-labeled data, which is inherently susceptible to biases. While previous work has examined the issue, the interplay between the characteristics of the annotator and those of the target of the hate are still unexplored. We fill this gap by leveraging an extensive dataset with rich socio-demographic information of both annotators and targets, uncovering how human biases manifest in relation to the target's attributes. Our analysis surfaces the presence of widespread biases, which we quantitatively describe and characterize based on their intensity and prevalence, revealing marked differences. Furthermore, we compare human biases with those exhibited by persona-based LLMs. Our findings indicate that while persona-based LLMs do exhibit biases, these differ significantly from those of human annotators. Overall, our work offers new and nuanced results on human biases in hate speech annotations, as well as fresh insights into the design of AI-driven hate speech detection systems.
Large language models (LLMs) can label data faster and cheaper than humans for various NLP tasks. Despite their prowess, LLMs may fall short in understanding of complex, sociocultural, or domain-specific context, potentially leading to incorrect annotations. Therefore, we advocate a collaborative approach where humans and LLMs work together to produce reliable and high-quality labels. We present MEGAnno+, a human-LLM collaborative annotation system that offers effective LLM agent and annotation management, convenient and robust LLM annotation, and exploratory verification of LLM labels by humans.
Traditional image annotation tasks rely heavily on human effort for object selection and label assignment, making the process time-consuming and prone to decreased efficiency as annotators experience fatigue after extensive work. This paper introduces a novel framework that leverages the visual understanding capabilities of large multimodal models (LMMs), particularly GPT, to assist annotation workflows. In our proposed approach, human annotators focus on selecting objects via bounding boxes, while the LMM autonomously generates relevant labels. This human-AI collaborative framework enhances annotation efficiency by reducing the cognitive and time burden on human annotators. By analyzing the system's performance across various types of annotation tasks, we demonstrate its ability to generalize to tasks such as object recognition, scene description, and fine-grained categorization. Our proposed framework highlights the potential of this approach to redefine annotation workflows, offering a scalable and efficient solution for large-scale data labeling in computer vision. Finally, we discuss how integrating LMMs into the annotation pipeline can advance bidirectional human-AI alignment, as well as the challenges of alleviating the "endless annotation" burden in the face of information overload by shifting some of the work to AI.
Do LLMs align with human perceptions of safety? We study this question via annotation alignment, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, \textit{higher} than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether LLMs exhibit disparities in how well they correlate with different demographic groups. Also, there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
本报告综合揭示了大语言模型评估中的多重局限性:传统人工评估正面临严重的认知偏见、高昂成本与不可扩展性瓶颈;而作为替代方案的‘LLM-as-a-Judge’虽然高效,却引入了长度偏见和自恋倾向等新风险。目前行业的研究重心正从单一指标竞争转向人机协作(HITL)框架的构建、特定高风险领域的专家评测协议制定、以及针对智能体(Agent)和多模态任务的动态、细粒度评估方法论的开发。