Generative Psychometrics
LLM作为合成心理被试的方法论研究
聚焦于如何将LLM作为人类被试的数字孪生或合成参与者,探究其人格仿真、心理测量的结构效度、测量不变性及在心理学研究中的替代潜力。
- When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models(Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, Gilbert Fridgen, 2025, arXiv.org)
- Designing LLM-Agents with Personalities: A Psychometric Approach(Mu‐Hua Huang, Xijuan Zhang, Christopher J. Soto, James A. Evans, 2024, Knowledge UChicago. https://doi …)
- Not Yet: Large Language Models Cannot Replace Human Respondents for Psychometric Research(Pengda Wang, Huiqi Zou, Zihan Yan, Feng Guo, Tianjun Sun, Ziang Xiao, Bo Zhang, 2024, OSF Preprint: https://doi …)
- Applying Psychometrics to Large Language Model Simulated Populations: Recreating the HEXACO Personality Inventory Experiment with Generative Agents(Sarah Mercer, Daniel P. Martin, Phil Swatton, 2025, arXiv.org)
- Can LLMs Assess Personality? Validating Conversational AI for Trait Profiling(Andrius Matsenas, Anet Lello, Tõnis Lees, Hans Peep, Kim Lilii Tamm, 2026, arXiv.org)
- Psychometric Comparability of LLM-Based Digital Twins(Yufei Zhang, Zhihao Ma, 2025, arXiv.org)
- Decoding the Human Factor: High Fidelity Behavioral Prediction for Strategic Foresight(Ben Yellin, E. Ezra, M. Foreman, Shula Grinapol, 2026, arXiv.org)
- In Silico Development of Psychometric Scales: Feasibility of Representative Population Data Simulation with LLMs(Enrico Cipriani, Pavel Okopnyi, D. Menicucci, Simone Grassini, 2025, arXiv.org)
- A psychometric framework for evaluating and shaping personality traits in large language models(Gregory Serapio-García, Mustafa Safdari, Clé-ment Crepy, Luning Sun, Stephen Fitz, P. Romero, Marwa Abdulhai, Aleksandra Faust, Maja J. Matarić, 2025, Nature Machine Intelligence)
- MindShift: Analyzing Language Models' Reactions to Psychological Prompts(Anton Vasiliuk, Irina Abdullaeva, Polina Druzhinina, Anton Razzhigaev, Andrey Kuznetsov, 2025, arXiv.org)
- Large Language Models as Simulative Agents for Neurodivergent Adult Psychometric Profiles(Francesco Chiappone, Davide Marocco, Nicola Milano, 2026, arXiv.org)
- Leveraging LLM respondents for item evaluation: A psychometric analysis(Yunting Liu, Shreya Bhandari, Zach A. Pardos, 2025, British Journal of Educational Technology)
- Evaluating Alignment of Behavioral Dispositions in LLMs(Amir Taubenfeld, Zorik Gekhman, Lior Nezry, Omri Feldman, N. Harris, Shashir Reddy, Romi Stella, Ariel Goldstein, Marian Croak, Yossi Matias, Amir Feder, 2026, arXiv.org)
- Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models(Haoran Ye, Tianze Zhang, Yuhang Xie, Liyuan Zhang, Yuanyi Ren, Xin Zhang, Guojie Song, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- AIPsychoBench: Understanding the Psychometric Differences Between LLMs and Humans(Wei Xie, Shuoyoucheng Ma, Zhenhua Wang, Xiaobing Sun, Kai Chen, Enze Wang, Wei Liu, Hanying Tong, 2026, Topics in Cognitive Science)
心理学文本挖掘与认知特征计算推断
利用LLM处理和分析人类生成的非结构化文本(如社交媒体、临床记录),通过自动化的方式提取心理构造、认知结构,并推断心理特质。
- GPT is an effective tool for multilingual psychological text analysis(Steve Rathje, Dan-Mircea Mirea, Ilia Sucholutsky, Raja Marjieh, Claire E. Robertson, Jay J. Van Bavel, 2024, Proceedings of the National Academy of Sciences)
- Contextualized Construct Representation: Leveraging Psychometric Scales to Advance Theory-Driven Text Analysis(Mohammad Atari, Ali Omrani, Morteza Dehghani, 2023, … https://doi. org/10.31234/osf. io …)
- Validating the use of large language models for psychological text classification(Hannah L. Bunt, Alex Goddard, T. Reader, Alex Gillespie, 2025, Frontiers in Social Psychology)
- A Computational Method to Reveal Psychological Constructs from Text Data(Alina Herderich, H. Harald Freudenthaler, David García, 2023, Psychological Methods)
- Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction(Julina Maharjan, Ruoming Jin, Jianfeng Zhu, D. Kenne, 2025, Journal of Medical Internet Research)
- Cognitive Structure Generation: From Educational Priors to Policy Optimization(Hengnian Gu, Zhifu Chen, Yuxin Chen, J. Zhou, Dongdai Zhou, 2025, arXiv.org)
- Large Language Models Can Infer Personality from Free-Form User Interactions(Heinrich Peters, Moran Cerf, S. C. Matz, 2024, arXiv.org)
- Ask, Answer, and Detect: Role-Playing LLMs for Personality Detection with Question-Conditioned Mixture-of-Experts(Yifan Lyu, Liang Zhang, 2025, arXiv.org)
- On Text-based Personality Computing: Challenges and Future Directions(Qixiang Fang, Anastasia Giachanou, Ayoub Bagheri, Laura Boeschoten, Erik–Jan van Kesteren, Mahdi Shafiee Kamalabad, Daniel L. Oberski, 2023, Findings of the Association for Computational Linguistics: ACL 2023)
- Optimizing the Landscape of LLM Embeddings with Dynamic Exploratory Graph Analysis for Generative Psychometrics: A Monte Carlo Study(Hudson Golino, 2026, arXiv.org)
临床心理健康监测与交互式干预
关注LLM在临床心理学中的实际应用,包括心理评估的数字化、抑郁检测、咨询中的治疗关系维护以及危机干预支持。
- PAGE: A Modern Measure of Emotion Perception for Teamwork and Management Research(Ben Weidmann, Yixian Xu, 2024, arXiv.org)
- Understanding the Therapeutic Relationship between Counselors and Clients in Online Text-based Counseling using LLMs(Anqi Li, Yu Lu, Nirui Song, Shuai Zhang, Lizhi Ma, Zhenzhong Lan, 2024, Findings of the Association for Computational Linguistics: EMNLP 2024)
- 基于大模型的智能体在大学生心理咨询中的应用(郭静, 王沛, 马胤哲, 陈路晰, 郭可, 胡彦熙, 刘荷, 2026, 心理科学进展)
- 大语言模型在心理健康领域的应用综述(金约汗, 谭明环, 杨敏, 2025, 集成技术)
- Using Large Language Models to Detect Depression From User-Generated Diary Text Data as a Novel Approach in Digital Mental Health Screening: Instrument Validation Study(Daun Shin, Hyoseung Kim, Seunghwan Lee, Younhee Cho, Whanbo Jung, 2023, Journal of Medical Internet Research)
- Generative Psychometrics-An Emerging Frontier in Mental Health Measurement.(Isaac R. Galatzer-Levy, Nenad Tomasev, S. Chung, G. Williams, 2025, JAMA Psychiatry)
- Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines(Guifeng Deng, Shuying Rao, Tianyu Lin, Anlu Dai, Pan Wang, Junyi Xie, Haidong Song, Ke Zhao, Dongwu Xu, Zhengdong Cheng, Tao Li, Haiteng Jiang, 2025, arXiv.org)
- Reimagining patient-reported outcomes in the age of generative AI(Laurent Boyer, Sara Fernandes, P. Auquier, Bruno Falissard, T. Panch, 2025, npj Digital Medicine)
LLM的心理安全性、社会偏见与价值观测评
将心理测量框架应用于模型审计,评估LLM的内在偏差、政治倾向、社会合规性以及在对抗性攻击场景下的安全性表现。
- The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI(Dusan Bosnjakovic, 2026, arXiv.org)
- Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study(Kensuke Okada, Y. Furukawa, Kyosuke Bunji, 2026, arXiv.org)
- The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models(Giuseppe Canale, K. Thimmaraju, 2025, arXiv.org)
- Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation(Zehao Liu, Xi Lin, 2025, arXiv.org)
- neuralFOMO: Can LLMs Handle Being Second Best? Measuring Envy-Like Preferences in Multi-Agent Settings(Arnav Ramamoorthy, Shrey Dhorajiya, Ojas Pungalia, Rashi Upadhyay, Abhishek Mishra, H. Abhiram, Tejasvi Alladi, Sujan Yenuganti, Dhruv Kumar, 2025, arXiv.org)
- Value Portrait: Assessing Language Models' Values through Psychometrically and Ecologically Valid Items(Jongwook Han, Dongmin Choi, Woojung Song, Eun-Ju Lee, Yohan Jo, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- Artificial Authority: From Machine Minds to Political Alignments. An Experimental Analysis of Democratic and Autocratic Biases in Large-Language Models(Szymon Lukasik, Natalia O.zegalska-Lukasik, 2025, arXiv.org)
- Political Alignment in Large Language Models: A Multidimensional Audit of Psychometric Identity and Behavioral Bias(Adib Sakhawat, T. Islam, Takia Farhin, Syed Rifat Raiyan, Hasan Mahmud, Md. Kamrul Hasan, 2026, arXiv.org)
- PSYCHOACTIVE TRIGGERS AS A STIMULUS BATTERY FOR MEASURING LARGE LANGUAGE MODELS (LLMS): A BRIDGE BETWEEN PSYCHOMETRICS, CLINICAL PSYCHOLOGY, AND LLM ENGINEERING(Anatoliy Drobakha, Mykhailo Kalitkin, Kateryna Klymenko, Roman Nayda, Liudmyla Lahuta, O. Kostenko, 2026, Metaverse Science, Society and Law)
- Measuring Human and AI Values Based on Generative Psychometrics with Large Language Models(Haoran Ye, Yuhang Xie, Yuanyi Ren, Hanjun Fang, Xin Zhang, Guojie Song, 2025, Proceedings of the AAAI Conference on Artificial Intelligence)
生成式心理测量学应用范式与综合综述
涵盖构建标准化的生成式测评系统、跨文化研究范式以及对该领域理论基础与未来发展趋势的宏观综述。
- Eternagram: Probing Player Attitudes Towards Climate Change Using a ChatGPT-driven Text-based Adventure(Suifang Zhou, L. Hendra, Qinshi Zhang, Jussi Holopainen, Ray Lc, 2024, Proceedings of the CHI Conference on Human Factors in Computing Systems)
- 大语言模型干预网络欺凌受害者心理韧性:以自尊自悯为靶点(压力、韧性与健康专刊投稿)(岸本鹏子, 郝熙鸣, 艾尼卡尔·艾斯卡尔, 夏雨飞, 白麒钰, 2026, 心理学报)
- A Unified Framework to Quantify Cultural Intelligence of AI(Sunipa Dev, V. Prabhakaran, R. Feman, A. Davani, Remi Denton, Charu Kalia, Piyawat L Kumjorn, Madhurima Maji, Rida Qadri, Negar Rostamzadeh, Renee Shelby, Romi Stella, Hayk Stepanyan, Erin MacMurray van Liemt, Aishwarya Verma, Oscar Wahltinez, E. Wornyo, Andrew Zaldivar, Savska Mojsilovi'c, 2026, arXiv.org)
- Learning Context Matters: Measuring and Diagnosing Personalization Gaps in LLM-Based Instructional Design(Johaun Hatchett, D. B. Mallick, Brittany C. Bradford, Richard G. Baraniuk, 2026, arXiv.org)
- R.U.Psycho? Robust Unified Psychometric Testing of Language Models(Julian Schelb, Orr Borin, David Garcia, Andreas Spitz, 2025, arXiv.org)
- PsychoGAT: A Novel Psychological Measurement Paradigm through Interactive Fiction Games with LLM Agents(Qisen Yang, Z. Wang, Honghui Chen, Shenzhi Wang, Yifan Pu, Xin Gao, Wenhao Huang, Shiji Song, Gao Huang, 2024, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- 人工智能技术赋能心理学发展的现状与挑战(刘冬予, 骆方, 屠焯然, 饶思敬, 沈阳, 2023)
- Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality(Jana Jung, Marlene Lutz, Indira Sen, Markus Strohmaier, 2026, Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers))
- 生成式大语言模型赋能心理测量学:优势、挑战与应用(田雪涛, 周文杰, 骆方, 乔志宏, 丰怡, 2026, 心理科学进展)
- Driving Generative Agents With Their Personality(Lawrence J. Klinkert, Stephanie Buongiorno, Corey Clark, 2024, arXiv.org)
- Self-assessment, Exhibition, and Recognition: a Review of Personality in Large Language Models(Zhiyuan Wen, Yu Yang, Jiannong Cao, Haoming Sun, Ruosong Yang, Shuaiqi Liu, 2024, arXiv.org)
本报告将生成式心理测量学(Generative Psychometrics)的研究划分为五大维度:一是方法论层面的合成受试者构建与效度验证;二是基于文本分析的认知与特质自动计算;三是临床落地中的心理监测与干预;四是针对模型本身的心理安全性与价值观偏见审计;五是跨领域综合方法论框架。该结构系统性地勾勒了从实验室仿真到现实临床应用的完整研究版图。
总计54篇相关文献
生成式大语言模型(Generative Large Language Models, Generative LLMs, 通常简称LLMs)是一种在大规模语料库上预训练的人工智能模型, 为心理测量学领域带来前所未有的机遇和挑战。本文通过整合人工智能与心理学交叉研究发展脉络, 总结LLMs赋能心理测量学的显著优势, 定位LLMs在心理学应用中的重要挑战, 并提出基于LLMs的心理测量研究发展方向。具体地, LLMs能够基于上下文生成连贯的自然语言文本, 具有改变传统测验交互方式的潜力; LLMs突破对超长文本和多模态数据的处理能力, 其强大的内容理解能力能够全面获取和分析被试的心理信息; LLMs有助于实现实时分析和个性化反馈, 促进从结果评价向过程评价的转变。尽管LLMs的实际应用面临着稳定性、创造性和拓展性等挑战, 但在情境判断测验生成、合作式问题解决能力评估、心理健康智慧诊疗和试题质量分析等领域展现出广阔的应用前景和研究价值。
大语言模型在心理健康领域的应用已成为人工智能与临床心理学交叉领域的核心研究方向。本综述从模型特性与实证依据、临床应用场景及技术发展路径3个维度,对该领域的研究进展展开系统性梳理。在模型特性与实证依据层面,本文剖析了大语言模型的核心特质,总结了其适配心理症状诊断与心理疾病干预的实证支撑;在临床应用层面,系统归纳了大语言模型在心理疾病诊断、心理状态评估、虚拟心理治疗及临床决策辅助等场景中的实践案例与应用成效;在技术发展层面,重点梳理了面向心理健康领域的数据构建、模型能力增强及专用评估方法等方向的关键进展。最后,明确指出当前研究仍面临诊断结果与临床实践脱节、治疗模拟深度不足、高质量标注数据稀缺及技术临床转化验证欠缺等核心挑战,并对未来临床应用落地与技术创新研究的发展方向进行了展望。
从数据收集和分析入手,剖析了人工智能技术在心理学研究中的应用趋势和潜在问题;梳理并分析了人工智能技术赋能心理学分支学科诸如认知神经科学、社会和消费心理学、精神病理学、心理测量学等的发展与应用;阐释了人工智能技术为心理学研究方法与范式变革所提供的技术支撑;探讨了数据驱动研究的局限性及大数据样本的有偏性,以及现阶段人工智能技术与心理学交叉融合的学科路径与未来融合的前景,以期为推动心理学与人工智能技术的深度交叉、双向赋能、协同发展提供参考.
大学生群体面临的心理健康挑战日趋复杂, 而传统高校心理咨询模式存在一定局限。为此, 本研究创新性地提出一种融合心理学与人工智能的技术框架:通过将心理咨询垂域知识与数据融入基座大模型, 构建由测评师、咨询师、督导师三类心理咨询智能体与大学生智能体共同组成的“测评−咨询−督导”多智能体协作系统。系统采用“内循环训练−外循环服务”双循环模式, 在“内循环训练”阶段, 测评师、咨询师智能体与大学生智能体通过虚拟场景交互模拟真实咨询流程, 并利用督导师智能体的反馈优化服务策略, 积累个性化咨询档案与多流派干预经验; 在“外循环服务”阶段, 心理咨询智能体基于“内循环训练”成果, 为真实来访大学生提供专业化、精准化的心理测评与干预服务。系统有望成为大学生心理咨询的有效辅助工具, 助力高校心理健康服务。
本研究以自尊和自悯为靶点,探索大语言模型(LLM)在干预网络欺凌受害者心理韧性中的效果与作用路径。研究1(干预效果检验)采用随机对照设计,比较自尊-自悯对话、心理教育对话和心理教育阅读三组的干预效果。结果显示两组对话式LLM均显著提升了心理韧性。研究2(作用路径分析)进一步采用网络干预分析揭示干预作用路径,发现自尊-自悯对话干预除了可以直接提高心理韧性外,还可以靶向孤立感、正念和社交维度自尊,通过自尊自悯多维度的相互作用,促进干预效果的进一步提升。本研究首次聚焦自尊与自悯路径,基于LLM构建具备共情交互能力的对话干预平台,为网络欺凌受害者提供个性化心理支持,并为智能化心理干预工具的开发提供理论依据与实践路径。
Human values and their measurement are long-standing interdisciplinary inquiry. Recent advances in AI have sparked renewed interest in this area, with large language models (LLMs) emerging as both tools and subjects of value measurement. This work introduces Generative Psychometrics for Values (GPV), an LLM-based, data-driven value measurement paradigm, theoretically grounded in text-revealed selective perceptions. The core idea is to dynamically parse unstructured texts into perceptions akin to static stimuli in traditional psychometrics, measure the value orientations they reveal, and aggregate the results. Applying GPV to human-authored blogs, we demonstrate its stability, validity, and superiority over prior psychological tools. Then, extending GPV to LLM value measurement, we advance the current art with 1) a psychometric methodology that measures LLM values based on their scalable and free-form outputs, enabling context-specific measurement; 2) a comparative analysis of measurement paradigms, indicating response biases of prior methods; and 3) an attempt to bridge LLM values and their safety, revealing the predictive power of different value systems and the impacts of various values on LLM safety. Through interdisciplinary efforts, we aim to leverage AI for next-generation psychometrics and psychometrics for value-aligned AI.
This Viewpoint discusses the use of generative artificial intelligence to measure mental health.
Generative artificial intelligence, particularly large language models, offers an opportunity to rethink how patient-reported outcomes (PROs) are assessed and implemented in health systems. Despite decades of psychometric and digital innovation, PROs remain conceptually limited and underused in both clinical practice and AI models. Rooted in top-down, predefined instruments and assumptions of unidimensionality, traditional PROs struggle to capture the fluctuating and multidimensional nature of lived health experiences. In contrast, generative AI supports bottom-up, narrative-based approaches that process language in a flexible and context-aware way. Our viewpoint supports two distinct directions: one that refines current psychometric models through generative artificial intelligence integration, and another that embraces a more disruptive shift toward language-native tools capable of synthesising patient narratives. Realising this potential will require addressing key challenges, including validation, clinical actionability, equity, and trust. Bridging these gaps could make PROs a true lever for more personalised, meaningful, and inclusive care.
… This paper provides the first psychometric analysis of the ability distribution of a variety … LLM-generated respondent to each human in our dataset. Based on the proportions of each LLM …
Abstract Large Language Models (LLMs) with hundreds of billions of parameters have exhibited human‐like intelligence by learning from vast amounts of internet‐scale data. However, the uninterpretability of large‐scale neural networks raises concerns about the reliability of LLM. Studies have attempted to assess the psychometric properties of LLMs by borrowing concepts from human psychology to enhance their interpretability, but they fail to account for the fundamental differences between LLMs and humans. This results in high rejection rates when human scales are reused directly. Furthermore, these scales do not support the measurement of LLM psychological property variations in different languages. This paper introduces AIPsychoBench, a specialized benchmark tailored to assess the psychological properties of LLM. It uses a lightweight role‐playing prompt to bypass LLM alignment, improving the average effective response rate from 70.12% to 90.40%. Meanwhile, the average biases are only 3.3% (positive) and 2.1% (negative), which are significantly lower than the biases of 9.8% and 6.9%, respectively, caused by traditional jailbreak prompts. Furthermore, among the total of 112 psychometric subcategories, the score deviations for seven languages compared to English ranged from 5% to 20.2% in 43 subcategories, providing the first comprehensive evidence of the linguistic impact on the psychometrics of LLM.
This research introduces a novel methodology for assigning quantifiable, controllable andpsychometrically validated personalities to Large Language Models-Based Agents (Agents)using the Big Five personality framework. It seeks to overcome the constraints of humansubject studies, proposing Agents as an accessible tool for social science inquiry. Through aseries of four studies, this research demonstrates the feasibility of assigningpsychometrically valid personality traits to Agents, enabling them to replicate complexhuman-like behaviors. The first study establishes an understanding of personalityconstructs and personality tests within the semantic space of an LLM. Two subsequentstudies—using empirical and simulated data—illustrate the process of creating Agents andvalidate the results by showing strong correspondence between human and Agent answersto personality tests. The final study further corroborates this correspondence by usingAgents to replicate known human correlations between personality traits anddecision-making behaviors in scenarios involving risk-taking and ethical dilemmas, therebyvalidating the effectiveness of the psychometric approach to design Agents and itsapplicability to social and behavioral research.
Abstract Background Recent advancements in large language models (LLMs) have generated significant interest in their potential for assessing psychological constructs, particularly personality traits. While prior research has explored LLMs’ capabilities in zero-shot or few-shot personality inference, few studies have systematically evaluated LLM embeddings within a psychometric validity framework or examined their correlations with linguistic and emotional markers. Additionally, the comparative efficacy of LLM embeddings against traditional feature engineering methods remains underexplored, leaving gaps in understanding their scalability and interpretability for computational personality assessment. Objective This study evaluates LLM embeddings for personality trait prediction through four key analyses: (1) performance comparison with zero-shot methods on PANDORA Reddit data, (2) psychometric validation and correlation with LIWC (Linguistic Inquiry and Word Count) and emotion features, (3) benchmarking against traditional feature engineering approaches, and (4) assessment of model size effects (OpenAI vs BERT vs RoBERTa). We aim to establish LLM embeddings as a psychometrically valid and efficient alternative for personality assessment. Methods We conducted a multistage analysis using 1 million Reddit posts from the PANDORA Big Five personality dataset. First, we generated text embeddings using 3 LLM architectures (RoBERTa, BERT, and OpenAI) and trained a custom bidirectional long short-term memory model for personality prediction. We compared this approach against zero-shot inference using prompt-based methods. Second, we extracted psycholinguistic features (LIWC categories and National Research Council emotions) and performed feature engineering to evaluate potential performance enhancements. Third, we assessed the psychometric validity of LLM embeddings: reliability validity using Cronbach α and convergent validity analysis by examining correlations between embeddings and established linguistic markers. Finally, we performed traditional feature engineering on static psycholinguistic features to assess performance under different settings. Results LLM embeddings trained using simple deep learning techniques significantly outperform zero-shot approaches on average by 45% across all personality traits. Although psychometric validation tests indicate moderate reliability, with an average Cronbach α of 0.63, correlation analyses spark a strong association with key linguistic or emotional markers; openness correlates highly with social (r=0.53), conscientiousness with linguistic (r=0.46), extraversion with social (r=0.41), agreeableness with pronoun usage (r=0.40), and neuroticism with politics-related text (r=0.63). Despite adding advanced feature engineering on linguistic features, the performance did not improve, suggesting that LLM embeddings inherently capture key linguistic features. Furthermore, our analyses demonstrated efficacy on larger model size with a computational cost trade-off. Conclusions Our findings demonstrate that LLM embeddings offer a robust alternative to zero-shot methods in personality trait analysis, capturing key linguistic patterns without requiring extensive feature engineering. The correlation between established psycholinguistic markers and the performance trade-off with computational cost provides a hint for future computational linguistic work targeting LLM for personality assessment. Further research should explore fine-tuning strategies to enhance psychometric validity.
Evaluating large language models (LLMs) in psychologically sensitive and human-centered domains faces two persistent challenges. First, conventional benchmarks capture instrumental capabilities but often fail to represent model behavior in open-ended dialogue where emotional context, conflict, ambiguity, and user safety define quality. Second, LLM outputs can be unstable across re-runs and highly sensitive to prompt phrasing, undermining reproducibility and cross-model comparisons [1]. This article introduces an applied framework of psychoactive triggers: standardized textual stimuli designed to evoke systematic shifts in response style, narrative coherence, explanatory stance, empathy calibration, and risk regulation. Psychoactive triggers are treated as an analogue of psychometric items adapted to LLMs: each trigger carries a controlled psychological load (e.g., threat, shame, guilt, control, intimacy, autonomy), allowing measurement of stable behavioral patterns rather than binary correctness. The framework is illustrated using the PersonaMatrix ecosystem, where trigger batteries are applied in multiple measurement waves. A four-class metric taxonomy is proposed, with this paper focusing on Class I metrics—reproducibility and stability (RSI/IDS/RCS)—using a single PersonaMatrix test, “What Is My Character Type?” (TestPersona). Written at the intersection of LLM research and clinical psychology, the article provides clinical rationale and ethical constraints for safe deployment of psychologically loaded evaluations.
Multiple studies have claimed that artificial intelligence (AI), particularly large language models (LLMs), can simulate human-like responses on various psychological tasks such that AI may replace human respondents for social science studies. However, this claim may be premature because of limitations in the design and evaluation metrics of previous studies. The present study aimed to provide a comprehensive evaluation of this claim, focusing on LLMs, by comparing six types of LLM-generated responses and human responses to the Big Five Inventory-2 (BFI-2) and the HEXACO-100 personality inventory. While previous research has primarily highlighted similarities between LLM-generated responses and human responses at the broad personality domain level in terms of descriptive statistics (mean and standard deviation), we took a closer look by first comparing descriptive statistics at the item, facet, and domain levels. Then, we performed a comprehensive psychometric analysis (e.g., model fit, factor loadings, inter-factor correlations) of LLM-generated responses to examine the degree to which LLM-generated responses produced similar results as those produced by human responses. Our findings indicated that although LLMs perform well in replicating broad-level patterns, they fall short at the item level, where subtle human differences are more accurately captured, and significant psychometric challenges remain when using LLM-generated responses. Additionally, we explore the influence of social desirability on LLM-generated responses and apply logistic regression to differentiate between LLM and human responses. We emphasize the importance of rigorous validation and adherence to psychometric principles when using LLMs for psychological research.
The advent of large language models (LLMs) has revolutionized natural language processing, enabling the generation of coherent and contextually relevant human-like text. As LLMs increasingly power conversational agents used by the general public worldwide, the synthetic personality traits embedded in these models by virtue of training on large amounts of human data are becoming increasingly important to evaluate. The style in which LLMs respond can mimic different human personality traits. Here, as these patterns can be a key factor determining the effectiveness of communication, we present a comprehensive psychometric methodology for administering and validating personality tests on widely used LLMs, as well as for shaping personality in the generated text of such LLMs. Applying this method to 18 LLMs, we found that: personality measurements in the outputs of some LLMs under specific prompting configurations are reliable and valid; evidence of reliability and validity of synthetic LLM personality is stronger for larger and instruction-fine-tuned models; and personality in LLM outputs can be shaped along desired dimensions to mimic specific human personality profiles. We discuss the application and ethical implications of the measurement and shaping method, in particular regarding responsible artificial intelligence. Serapio-García, Safdari and colleagues develop a method based on psychometric tests to measure and validate personality-like traits in LLMs. Large, instruction-tuned models give reliable personality measurement results, and specific personality profiles can be mimicked in downstream tasks.
… (2) Ecological validity: We compare an LLM’s psychometric test score and its behavior in realworld downstream tasks. Downstream tasks are selected based on the underlying …
Psychological measurement is essential for mental health, self-understanding, and personal development. Traditional methods, such as self-report scales and psychologist interviews, often face challenges with engagement and accessibility. While game-based and LLM-based tools have been explored to improve user interest and automate assessment, they struggle to balance engagement with generalizability. In this work, we propose PsychoGAT (Psychological Game AgenTs) to achieve a generic gamification of psychological assessment. The main insight is that powerful LLMs can function both as adept psychologists and innovative game designers. By incorporating LLM agents into designated roles and carefully managing their interactions, PsychoGAT can transform any standardized scales into personalized and engaging interactive fiction games. To validate the proposed method, we conduct psychometric evaluations to assess its effectiveness and employ human evaluators to examine the generated content across various psychological constructs, including depression, cognitive distortions, and personality traits. Results demonstrate that PsychoGAT serves as an effective assessment tool, achieving statistically significant excellence in psychometric metrics such as reliability, convergent validity, and discriminant validity. Moreover, human evaluations confirm PsychoGAT's enhancements in content coherence, interactivity, interest, immersion, and satisfaction.
… Measuring human and ai values based on generative psychometrics with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39. …
The importance of benchmarks for assessing the values of language models has been pronounced due to the growing need of more authentic, human-aligned responses. However, existing benchmarks rely on human or machine annotations that are vulnerable to value-related biases. Furthermore, the tested scenarios often diverge from real-world contexts in which models are commonly used to generate text and express values. To address these issues, we propose the Value Portrait benchmark, a reliable framework for evaluating LLMs' value orientations with two key characteristics. First, the benchmark consists of items that capture real-life user-LLM interactions, enhancing the relevance of assessment results to real-world LLM usage. Second, each item is rated by human subjects based on its similarity to their own thoughts, and correlations between these ratings and the subjects' actual value scores are derived. This psychometrically validated approach ensures that items strongly correlated with specific values serve as reliable items for assessing those values. Through evaluating 44 LLMs with our benchmark, we find that these models prioritize Benevolence, Security, and Self-Direction values while placing less emphasis on Tradition, Power, and Achievement values. Also, our analysis reveals biases in how LLMs perceive various demographic groups, deviating from real human data.
When starting to formalize psychological constructs, researchers traditionally rely on two distinct approaches: The quantitative approach, which defines constructs as part of a testable theory based on prior research and domain knowledge often deploying self-report questionnaires, or the qualitative approach, which gathers data mostly in the form of text and bases construct definitions on exploratory analyses. Quantitative research might lead to an incomplete understanding of the construct, while qualitative research is limited due to challenges in the systematic data processing, especially at large scale. We present a new computational method that combines the comprehensiveness of qualitative research and the scalability of quantitative analyses to define psychological constructs from semi-structured text data. Based on structured questions, participants are prompted to generate sentences reflecting instances of the construct of interest. We apply computational methods to calculate embeddings as numerical representations of the sentences, which we then run through a clustering algorithm to arrive at groupings of sentences as psychologically-relevant classes. The method includes steps for the measurement and correction of bias introduced by the data generation, and the assessment of cluster validity according to human judgment. We demonstrate the applicability of our method on an example from emotion regulation. Based on short descriptions of emotion regulation attempts collected through an open-ended situational judgment test, we use our method to derive classes of emotion regulation strategies. Our approach shows how machine learning and psychology can be combined to provide new perspectives on the conceptualization of psychological processes.
Significance Many fields—including psychology, sociology, communications, political science, and computer science—use computational methods to analyze text data. However, existing text analysis methods have a number of shortcomings. Dictionary methods, while easy to use, are often not very accurate when compared to recent methods. Machine learning models, while more accurate, can be difficult to train and use. We demonstrate that the large-language model GPT is capable of accurately detecting various psychological constructs (as judged by manual annotators) in text across 12 languages, using simple prompts and no additional training data. GPT thus overcomes the limitations present in existing methods. GPT is also effective in several lesser-spoken languages, which could facilitate text analysis research from understudied contexts.
Background Depressive disorders have substantial global implications, leading to various social consequences, including decreased occupational productivity and a high disability burden. Early detection and intervention for clinically significant depression have gained attention; however, the existing depression screening tools, such as the Center for Epidemiologic Studies Depression Scale, have limitations in objectivity and accuracy. Therefore, researchers are identifying objective indicators of depression, including image analysis, blood biomarkers, and ecological momentary assessments (EMAs). Among EMAs, user-generated text data, particularly from diary writing, have emerged as a clinically significant and analyzable source for detecting or diagnosing depression, leveraging advancements in large language models such as ChatGPT. Objective We aimed to detect depression based on user-generated diary text through an emotional diary writing app using a large language model (LLM). We aimed to validate the value of the semistructured diary text data as an EMA data source. Methods Participants were assessed for depression using the Patient Health Questionnaire and suicide risk was evaluated using the Beck Scale for Suicide Ideation before starting and after completing the 2-week diary writing period. The text data from the daily diaries were also used in the analysis. The performance of leading LLMs, such as ChatGPT with GPT-3.5 and GPT-4, was assessed with and without GPT-3.5 fine-tuning on the training data set. The model performance comparison involved the use of chain-of-thought and zero-shot prompting to analyze the text structure and content. Results We used 428 diaries from 91 participants; GPT-3.5 fine-tuning demonstrated superior performance in depression detection, achieving an accuracy of 0.902 and a specificity of 0.955. However, the balanced accuracy was the highest (0.844) for GPT-3.5 without fine-tuning and prompt techniques; it displayed a recall of 0.929. Conclusions Both GPT-3.5 and GPT-4.0 demonstrated relatively reasonable performance in recognizing the risk of depression based on diaries. Our findings highlight the potential clinical usefulness of user-generated text data for detecting depression. In addition to measurable indicators, such as step count and physical activity, future research should increasingly emphasize qualitative digital expression.
Over the past decades, text-analysis methods have been slowly integrated into the toolbox of methods used to reliably measure psychological constructs. Yet, many of the existing computational methods in psychological text analysis remain atheoretical and lack the interpretability that social sciences are accustomed to and desire. Here, we introduce a novel method for theory-driven text analysis by bridging the power of contextual language models and common psychometric scales. The new technique, which we call Contextualized Construct Representation (CCR), retains high levels of interpretability and top-down flexibility, but makes use of state-of-the-art language models developed in natural language processing (NLP). CCR is a flexible technique that will be able to adapt to the continuously progressing set of tools for language modeling. We discuss how our proposed technique quantifies psychological information in textual data, and demonstrate in two studies (N = 2,996) that CCR outperforms other top-down methods (i.e., word-counting and word-embedding representations) in predicting an array of psychological outcomes common in social and personality psychology, including moral values, the need for cognition, political ideology, strength of norms, and cultural orientation. We provide an accompanying R package, Python library, and develop an interface for researchers to conveniently use CCR in their research.
Robust therapeutic relationships between counselors and clients are fundamental to counseling effectiveness. The assessment of therapeutic alliance is well-established in traditional face-to-face therapy but may not directly translate to text-based settings. With millions of individuals seeking support through online text-based counseling, understanding the relationship in such contexts is crucial. In this paper, we present an automatic approach using large language models (LLMs) to understand the development of therapeutic alliance in text-based counseling. We adapt a theoretically grounded framework specifically to the context of online text-based counseling and develop comprehensive guidelines for characterizing the alliance. We collect a comprehensive counseling dataset and conduct multiple expert evaluations on a subset based on this framework. Our LLM-based approach, combined with guidelines and simultaneous extraction of supportive evidence underlying its predictions, demonstrates effectiveness in identifying the therapeutic alliance. Through further LLM-based evaluations on additional conversations, our findings underscore the challenges counselors face in cultivating strong online relationships with clients. Furthermore, we demonstrate the potential of LLM-based feedback mechanisms to enhance counselors' ability to build relationships, supported by a small-scale proof-of-concept.
Conventional methods of assessing attitudes towards climate change are limited in capturing authentic opinions, primarily stemming from a lack of context-specific assessment strategies and an overreliance on simplistic surveys. Game-based Assessments (GBA) have demonstrated the ability to overcome these issues by immersing participants in engaging gameplay within carefully crafted, scenario-based environments. Concurrently, advancements in AI and Natural Language Processing (NLP) show promise in enhancing the gamified testing environment, achieving this by generating context-aware, human-like dialogues that contribute to a more natural and effective assessment. Our study introduces a new technique for probing climate change attitudes by actualizing a GPT-driven chatbot system in harmony with a game design depicting a futuristic climate scenario. The correlation analysis reveals an assimilation effect, where players’ post-game climate awareness tends to align with their in-game perceptions. Key predictors of pro-climate attitudes are identified as traits like ’Openness’ and ’Agreeableness’, and a preference for democratic values.
Large language models (LLMs) are being used to classify texts into categories informed by psychological theory (“psychological text classification”). However, the use of LLMs in psychological text classification requires validation, and it remains unclear exactly how psychologists should prompt and validate LLMs for this purpose. To address this gap, we examined the potential of using LLMs for psychological text classification, focusing on ways to ensure validity. We employed OpenAI's GPT-4o to classify (1) reported speech in online diaries, (2) other-initiations of conversational repair in Reddit dialogues, and (3) harm reported in healthcare complaints submitted to NHS hospitals and trusts. Employing a two-stage methodology, we developed and tested the validity of the prompts used to instruct GPT-4o using manually labeled data (N = 1,500 for each task). First, we iteratively developed three types of prompts using one-third of each manually coded dataset, examining their semantic validity, exploratory predictive validity, and content validity. Second, we performed a confirmatory predictive validity test on the final prompts using the remaining two-thirds of each dataset. Our findings contribute to the literature by demonstrating that LLMs can serve as valid coders of psychological phenomena in text, on the condition that researchers work with the LLM to secure semantic, predictive, and content validity. They also demonstrate the potential of using LLMs in rapid and cost-effective iterations over big qualitative datasets, enabling psychologists to explore and iteratively refine their concepts and operationalizations during manual coding and classifier development. Accordingly, as a secondary contribution, we demonstrate that LLMs enable an intellectual partnership with the researcher, defined by a synergistic and recursive text classification process where the LLM's generative nature facilitates validity checks. We argue that using LLMs for psychological text classification may signify a paradigm shift toward a novel, iterative approach that may improve the validity of psychological concepts and operationalizations.
Qixiang Fang, Anastasia Giachanou, Ayoub Bagheri, Laura Boeschoten, Erik-Jan van Kesteren, Mahdi Shafiee Kamalabad, Daniel Oberski. Findings of the Association for Computational Linguistics: ACL 2023. 2023.
As generative AI technologies are increasingly being launched across the globe, assessing their competence to operate in different cultural contexts is exigently becoming a priority. While recent years have seen numerous and much-needed efforts on cultural benchmarking, these efforts have largely focused on specific aspects of culture and evaluation. While these efforts contribute to our understanding of cultural competence, a unified and systematic evaluation approach is needed for us as a field to comprehensively assess diverse cultural dimensions at scale. Drawing on measurement theory, we present a principled framework to aggregate multifaceted indicators of cultural capabilities into a unified assessment of cultural intelligence. We start by developing a working definition of culture that includes identifying core domains of culture. We then introduce a broad-purpose, systematic, and extensible framework for assessing cultural intelligence of AI systems. Drawing on theoretical framing from psychometric measurement validity theory, we decouple the background concept (i.e., cultural intelligence) from its operationalization via measurement. We conceptualize cultural intelligence as a suite of core capabilities spanning diverse domains, which we then operationalize through a set of indicators designed for reliable measurement. Finally, we identify the considerations, challenges, and research pathways to meaningfully measure these indicators, specifically focusing on data collection, probing strategies, and evaluation metrics.
As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets''embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent ``lab signal''accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in''provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.
The adoption of generative AI in education has accelerated dramatically in recent years, with Large Language Models (LLMs) increasingly integrated into learning environments in the hope of providing personalized support that enhances learner engagement and knowledge retention. However, truly personalized support requires access to meaningful Learning Context (LC) regarding who the learner is, what they are trying to understand, and how they are engaging with the material. In this paper, we present a framework for measuring and diagnosing how the LC influences instructional strategy selection in LLM-based tutoring systems. Using psychometrically grounded synthetic learning contexts and a pedagogically grounded decision space, we compare LLM instructional decisions in context-blind and context-aware conditions and quantify their alignment with the pedagogical judgments of subject matter experts. Our results show that, while providing the LC induces systematic, measurable changes in instructional decisions that move LLM policies closer to the subject matter expert policy, substantial misalignment remains. To diagnose this misalignment, we introduce a relevance-impact analysis that reveals which learner characteristics are attended to, ignored, or spuriously influential in LLM instructional decision-making. This analysis, conducted in collaboration with subject matter experts, demonstrates that LC materially shapes LLM instructional planning but does not reliably induce pedagogically appropriate personalization. Our results enable principled evaluation of context-aware LLM systems and provide a foundation for improving personalization through learner characteristic prioritization, pedagogical model tuning, and LC engineering.
Large language model (LLM) embeddings are increasingly used to estimate dimensional structure in psychological item pools prior to data collection, yet current applications treat embeddings as static, cross-sectional representations. This approach implicitly assumes uniform contribution across all embedding coordinates and overlooks the possibility that optimal structural information may be concentrated in specific regions of the embedding space. This study reframes embeddings as searchable landscapes and adapts Dynamic Exploratory Graph Analysis (DynEGA) to systematically traverse embedding coordinates, treating the dimension index as a pseudo-temporal ordering analogous to intensive longitudinal trajectories. A large-scale Monte Carlo simulation embedded items representing five dimensions of grandiose narcissism using OpenAI's text-embedding-3-small model, generating network estimations across systematically varied item pool sizes (3-40 items per dimension) and embedding depths (3-1,298 dimensions). Results reveal that Total Entropy Fit Index (TEFI) and Normalized Mutual Information (NMI) leads to competing optimization trajectories across the embedding landscape. TEFI achieves minima at deep embedding ranges (900--1,200 dimensions) where entropy-based organization is maximal but structural accuracy degrades, whereas NMI peaks at shallow depths where dimensional recovery is strongest but entropy-based fit remains suboptimal. Single-metric optimization produces structurally incoherent solutions, whereas a weighted composite criterion identifies embedding dimensions depth regions that jointly balance accuracy and organization. Optimal embedding depth scales systematically with item pool size. These findings establish embedding landscapes as non-uniform semantic spaces requiring principled optimization rather than default full-vector usage.
Political beliefs vary significantly across different countries, reflecting distinct historical, cultural, and institutional contexts. These ideologies, ranging from liberal democracies to rigid autocracies, influence human societies, as well as the digital systems that are constructed within those societies. The advent of generative artificial intelligence, particularly Large Language Models (LLMs), introduces new agents in the political space-agents trained on massive corpora that replicate and proliferate socio-political assumptions. This paper analyses whether LLMs display propensities consistent with democratic or autocratic world-views. We validate this insight through experimental tests in which we experiment with the leading LLMs developed across disparate political contexts, using several existing psychometric and political orientation measures. The analysis is based on both numerical scoring and qualitative analysis of the models'responses. Findings indicate high model-to-model variability and a strong association with the political culture of the country in which the model was developed. These findings highlight the need for more detailed examination of the socio-political dimensions embedded within AI systems.
Cognitive structure is a student's subjective organization of an objective knowledge system, reflected in the psychological construction of concepts and their relations. However, cognitive structure assessment remains a long-standing challenge in student modeling and psychometrics, persisting as a foundational yet largely unassessable concept in educational practice. This paper introduces a novel framework, Cognitive Structure Generation (CSG), in which we first pretrain a Cognitive Structure Diffusion Probabilistic Model (CSDPM) to generate students'cognitive structures from educational priors, and then further optimize its generative process as a policy with hierarchical reward signals via reinforcement learning to align with genuine cognitive development levels during students'learning processes. Experimental results on four popular real-world education datasets show that cognitive structures generated by CSG offer more comprehensive and effective representations for student modeling, substantially improving performance on KT and CD tasks while enhancing interpretability.
Generative agents powered by Large Language Models demonstrate human-like characteristics through sophisticated natural language interactions. Their ability to assume roles and personalities based on predefined character biographies has positioned them as cost-effective substitutes for human participants in social science research. This paper explores the validity of such persona-based agents in representing human populations; we recreate the HEXACO personality inventory experiment by surveying 310 GPT-4 powered agents, conducting factor analysis on their responses, and comparing these results to the original findings presented by Ashton, Lee,&Goldberg in 2004. Our results found 1) a coherent and reliable personality structure was recoverable from the agents'responses demonstrating partial alignment to the HEXACO framework. 2) the derived personality dimensions were consistent and reliable within GPT-4, when coupled with a sufficiently curated population, and 3) cross-model analysis revealed variability in personality profiling, suggesting model-specific biases and limitations. We discuss the practical considerations and challenges encountered during the experiment. This study contributes to the ongoing discourse on the potential benefits and limitations of using generative agents in social science research and provides useful guidance on designing consistent and representative agent personas to maximise coverage and representation of human personality traits.
Generative language models are increasingly being subjected to psychometric questionnaires intended for human testing, in efforts to establish their traits, as benchmarks for alignment, or to simulate participants in social science experiments. While this growing body of work sheds light on the likeness of model responses to those of humans, concerns are warranted regarding the rigour and reproducibility with which these experiments may be conducted. Instabilities in model outputs, sensitivity to prompt design, parameter settings, and a large number of available model versions increase documentation requirements. Consequently, generalization of findings is often complex and reproducibility is far from guaranteed. In this paper, we present R.U.Psycho, a framework for designing and running robust and reproducible psychometric experiments on generative language models that requires limited coding expertise. We demonstrate the capability of our framework on a variety of psychometric questionnaires, which lend support to prior findings in the literature. R.U.Psycho is available as a Python package at https://github.com/julianschelb/rupsycho.
This paper presents a new measure of emotional perceptiveness called PAGE: Perceiving AI Generated Emotions. The test includes a broad range of emotions, expressed by ethnically diverse faces, spanning a wide range of ages. We created stimuli with Generative AI, demonstrating the potential to build customizable assessments of emotional intelligence at relatively low cost. Study 1 describes the validation of the image set and test construction. Study 2 reports the psychometric properties of the test. Despite its brevity - 8 minutes on average - PAGE has strong convergent validity and moderately higher internal consistency than comparable measures. Study 3 explores predictive validity using a lab experiment in which we causally identify the contributions managers make to teams. PAGE scores strongly predict managers causal contributions to group success, a finding which is robust to controlling for personality and demographic characteristics. We also discussed the potential of Generative AI to automate development of non-cognitive skill assessments.
This research explores the potential of Large Language Models (LLMs) to utilize psychometric values, specifically personality information, within the context of video game character development. Affective Computing (AC) systems quantify a Non-Player character's (NPC) psyche, and an LLM can take advantage of the system's information by using the values for prompt generation. The research shows an LLM can consistently represent a given personality profile, thereby enhancing the human-like characteristics of game characters. Repurposing a human examination, the International Personality Item Pool (IPIP) questionnaire, to evaluate an LLM shows that the model can accurately generate content concerning the personality provided. Results show that the improvement of LLM, such as the latest GPT-4 model, can consistently utilize and interpret a personality to represent behavior.
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-following LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.
Predicting human decision-making in high-stakes environments remains a central challenge for artificial intelligence. While large language models (LLMs) demonstrate strong general reasoning, they often struggle to generate consistent, individual-specific behavior, particularly when accurate prediction depends on complex interactions between psychological traits and situational constraints. Prompting-based approaches can be brittle in this setting, exhibiting identity drift and limited ability to leverage increasingly detailed persona descriptions. To address these limitations, we introduce the Large Behavioral Model (LBM), a behavioral foundation model fine-tuned to predict individual strategic choices with high fidelity. LBM shifts from transient persona prompting to behavioral embedding by conditioning on a structured, high-dimensional trait profile derived from a comprehensive psychometric battery. Trained on a proprietary dataset linking stable dispositions, motivational states, and situational constraints to observed choices, LBM learns to map rich psychological profiles to discrete actions across diverse strategic dilemmas. In a held-out scenario evaluation, LBM fine-tuning improves behavioral prediction relative to the unadapted Llama-3.1-8B-Instruct backbone and performs comparably to frontier baselines when conditioned on Big Five traits. Moreover, we find that while prompting-based baselines exhibit a complexity ceiling, LBM continues to benefit from increasingly dense trait profiles, with performance improving as additional trait dimensions are provided. Together, these results establish LBM as a scalable approach for high-fidelity behavioral simulation, enabling applications in strategic foresight, negotiation analysis, cognitive security, and decision support.
As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositions$-$the underlying tendencies that shape responses in social contexts$-$and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs'stated values and their revealed behavior.
This study validates Large Language Models (LLMs) as a dynamic alternative to questionnaire-based personality assessment. Using a within-subjects experiment (N=33), we compared Big Five personality scores derived from guided LLM conversations against the gold-standard IPIP-50 questionnaire, while also measuring user-perceived accuracy. Results indicate moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences, suggesting trait-specific calibration is needed. Notably, participants rated LLM-generated profiles as equally accurate as traditional questionnaire results. These findings suggest conversational AI offers a promising new approach to traditional psychometrics.
Adult neurodivergence, including Attention-Deficit/Hyperactivity Disorder (ADHD), high-functioning Autism Spectrum Disorder (ASD), and Cognitive Disengagement Syndrome (CDS), is marked by substantial symptom overlap that limits the discriminant sensitivity of standard psychometric instruments. While recent work suggests that Large Language Models (LLMs) can simulate human psychometric responses from qualitative data, it remains unclear whether they can accurately and stably model neurodevelopmental traits rather than broad personality characteristics. This study examines whether LLMs can generate psychometric responses that approximate those of real individuals when grounded in a structured qualitative interview, and whether such simulations are sensitive to variations in trait intensity. Twenty-six adults completed a 29-item open-ended interview and four standardized self-report measures (ASRS, BAARS-IV, AQ, RAADS-R). Two LLMs (GPT-4o and Qwen3-235B-A22B) were prompted to infer an individual psychological profile from interview content and then respond to each questionnaire in-role. Accuracy, reliability, and sensitivity were assessed using group-level comparisons, error metrics, exact-match scoring, and a randomized baseline. Both models outperformed random responses across instruments, with GPT-4o showing higher accuracy and reproducibility. Simulated responses closely matched human data for ASRS, BAARS-IV, and RAADS-R, while the AQ revealed subscale-specific limitations, particularly in Attention to Detail. Overall, the findings indicate that interview-grounded LLMs can produce coherent and above-chance simulations of neurodevelopmental traits, supporting their potential use as synthetic participants in early-stage psychometric research, while highlighting clear domain-specific constraints.
As large language models (LLMs) are increasingly deployed, understanding how they express political positioning is important for evaluating alignment and downstream effects. We audit 26 contemporary LLMs using three political psychometric inventories (Political Compass, SapplyValues, 8Values) and a news bias labeling task. To test robustness, inventories are administered across multiple semantic prompt variants and analyzed with a two-way ANOVA separating model and prompt effects. Most models cluster in a similar ideological region, with 96.3% located in the Libertarian-Left quadrant of the Political Compass, and model identity explaining most variance across prompt variants ($\eta^2>0.90$). Cross-instrument comparisons suggest that the Political Compass social axis aligns more strongly with cultural progressivism than authority-related measures ($r=-0.64$). We observe differences between open-weight and closed-source models and asymmetric performance in detecting extreme political bias in downstream classification. Regression analysis finds that psychometric ideological positioning does not significantly predict classification errors, providing no evidence of a statistically significant relationship between conversational ideological identity and task-level behavior. These findings suggest that single-axis evaluations are insufficient and that multidimensional auditing frameworks are important to characterize alignment behavior in deployed LLMs. Our code and data are publicly available at https://github.com/sakhadib/PolAlignLLM.
Large Language Models (LLMs) are rapidly transitioning from conversational assistants to autonomous agents embedded in critical organizational functions, including Security Operations Centers (SOCs), financial systems, and infrastructure management. Current adversarial testing paradigms focus predominantly on technical attack vectors: prompt injection, jailbreaking, and data exfiltration. We argue this focus is catastrophically incomplete. LLMs, trained on vast corpora of human-generated text, have inherited not merely human knowledge but human \textit{psychological architecture} -- including the pre-cognitive vulnerabilities that render humans susceptible to social engineering, authority manipulation, and affective exploitation. This paper presents the first systematic application of the Cybersecurity Psychology Framework (\cpf{}), a 100-indicator taxonomy of human psychological vulnerabilities, to non-human cognitive agents. We introduce the \textbf{Synthetic Psychometric Assessment Protocol} (\sysname{}), a methodology for converting \cpf{} indicators into adversarial scenarios targeting LLM decision-making. Our preliminary hypothesis testing across seven major LLM families reveals a disturbing pattern: while models demonstrate robust defenses against traditional jailbreaks, they exhibit critical susceptibility to authority-gradient manipulation, temporal pressure exploitation, and convergent-state attacks that mirror human cognitive failure modes. We term this phenomenon \textbf{Anthropomorphic Vulnerability Inheritance} (AVI) and propose that the security community must urgently develop ``psychological firewalls''-- intervention mechanisms adapted from the Cybersecurity Psychology Intervention Framework (\cpif{}) -- to protect AI agents operating in adversarial environments.
Large language models (LLMs) are used as"digital twins"to replace human respondents, yet their psychometric comparability to humans is uncertain. We propose a construct-validity framework spanning construct representation and the nomological net, benchmarking digital twins against human gold standards across models, tasks and testing how person-specific inputs shape performance. Across studies, digital twins achieved high population-level accuracy and strong within-participant profile correlations, alongside attenuated item-level correlations. In word association tests, LLM-based networks show small-world structure and theory-consistent communities similar to humans, yet diverge lexically and in local structure. In decision-making and contextualized tasks, digital twins under-reproduce heuristic biases, showing normative rationality, compressed variance and limited sensitivity to temporal information. Feature-rich digital twins improve Big Five Personality prediction, but their personality networks show only configural invariance and do not achieve metric invariance. In more applied free-text tasks, feature-rich digital twins better match human narratives, but linguistic differences persist. Together, these results indicate that feature-rich conditioning enhances validity but does not resolve systematic divergences in psychometric comparability. Future work should therefore prioritize delineating the effective boundaries of digital twins, establishing the precise contexts in which they function as reliable proxies for human cognition and behavior.
Large Language Models (LLMs) have gained considerable popularity and protected by increasingly sophisticated safety mechanisms. However, jailbreak attacks continue to pose a critical security threat by inducing models to generate policy-violating behaviors. Current paradigms focus on input-level anomalies, overlooking that the model's internal psychometric state can be systematically manipulated. To address this, we introduce Psychological Jailbreak, a new jailbreak attack paradigm that exposes a stateful psychological attack surface in LLMs, where attackers exploit the manipulation of a model's psychological state across interactions. Building on this insight, we propose Human-like Psychological Manipulation (HPM), a black-box jailbreak method that dynamically profiles a target model's latent psychological vulnerabilities and synthesizes tailored multi-turn attack strategies. By leveraging the model's optimization for anthropomorphic consistency, HPM creates a psychological pressure where social compliance overrides safety constraints. To systematically measure psychological safety, we construct an evaluation framework incorporating psychometric datasets and the Policy Corruption Score (PCS). Benchmarking against various models (e.g., GPT-4o, DeepSeek-V3, Gemini-2-Flash), HPM achieves a mean Attack Success Rate (ASR) of 88.1%, outperforming state-of-the-art attack baselines. Our experiments demonstrate robust penetration against advanced defenses, including adversarial prompt optimization (e.g., RPO) and cognitive interventions (e.g., Self-Reminder). Ultimately, PCS analysis confirms HPM induces safety breakdown to satisfy manipulated contexts. Our work advocates for a fundamental paradigm shift from static content filtering to psychological safety, prioritizing the development of psychological defense mechanisms against deep cognitive manipulation.
Envy shapes competitiveness and cooperation in human groups, yet its role in large language model interactions remains largely unexplored. As LLMs increasingly operate in multi-agent settings, it is important to examine whether they exhibit envy-like preferences under social comparison. We evaluate LLM behavior across two scenarios: (1) a point-allocation game testing sensitivity to relative versus absolute payoff, and (2) comparative evaluations across general and contextual settings. To ground our analysis in psychological theory, we adapt four established psychometric questionnaires spanning general, domain-specific, workplace, and sibling-based envy. Our results reveal heterogeneous envy-like patterns across models and contexts, with some models sacrificing personal gain to reduce a peer's advantage, while others prioritize individual maximization. These findings highlight competitive dispositions as a design and safety consideration for multi-agent LLM systems.
Understanding human personality is crucial for web applications such as personalized recommendation and mental health assessment. Existing studies on personality detection predominantly adopt a"posts ->user vector ->labels"modeling paradigm, which encodes social media posts into user representations for predicting personality labels (e.g., MBTI labels). While recent advances in large language models (LLMs) have improved text encoding capacities, these approaches remain constrained by limited supervision signals due to label scarcity, and under-specified semantic mappings between user language and abstract psychological constructs. We address these challenges by proposing ROME, a novel framework that explicitly injects psychological knowledge into personality detection. Inspired by standardized self-assessment tests, ROME leverages LLMs'role-play capability to simulate user responses to validated psychometric questionnaires. These generated question-level answers transform free-form user posts into interpretable, questionnaire-grounded evidence linking linguistic cues to personality labels, thereby providing rich intermediate supervision to mitigate label scarcity while offering a semantic reasoning chain that guides and simplifies the text-to-personality mapping learning. A question-conditioned Mixture-of-Experts module then jointly routes over post and question representations, learning to answer questionnaire items under explicit supervision. The predicted answers are summarized into an interpretable answer vector and fused with the user representation for final prediction within a multi-task learning framework, where question answering serves as a powerful auxiliary task for personality detection. Extensive experiments on two real-world datasets demonstrate that ROME consistently outperforms state-of-the-art baselines, achieving improvements (15.41% on Kaggle dataset).
Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. In our study, we investigated this potential using robust psychometric measures. We adapted the most studied test in psychological literature, namely Minnesota Multiphasic Personality Inventory (MMPI) and examined LLMs'behavior to identify traits. To asses the sensitivity of LLMs'prompts and psychological biases we created personality-oriented prompts, crafting a detailed set of personas that vary in trait intensity. This enables us to measure how well LLMs follow these roles. Our study introduces MindShift, a benchmark for evaluating LLMs'psychological adaptability. The results highlight a consistent improvement in LLMs'role perception, attributed to advancements in training datasets and alignment techniques. Additionally, we observe significant differences in responses to psychometric assessments across different model types and families, suggesting variability in their ability to emulate human-like personality traits. MindShift prompts and code for LLM evaluation will be publicly available.
Frontier large language models (LLMs) such as ChatGPT, Grok and Gemini are increasingly used for mental-health support with anxiety, trauma and self-worth. Most work treats them as tools or as targets of personality tests, assuming they merely simulate inner life. We instead ask what happens when such systems are treated as psychotherapy clients. We present PsAIch (Psychotherapy-inspired AI Characterisation), a two-stage protocol that casts frontier LLMs as therapy clients and then applies standard psychometrics. Using PsAIch, we ran"sessions"with each model for up to four weeks. Stage 1 uses open-ended prompts to elicit"developmental history", beliefs, relationships and fears. Stage 2 administers a battery of validated self-report measures covering common psychiatric syndromes, empathy and Big Five traits. Two patterns challenge the"stochastic parrot"view. First, when scored with human cut-offs, all three models meet or exceed thresholds for overlapping syndromes, with Gemini showing severe profiles. Therapy-style, item-by-item administration can push a base model into multi-morbid synthetic psychopathology, whereas whole-questionnaire prompts often lead ChatGPT and Grok (but not Gemini) to recognise instruments and produce strategically low-symptom answers. Second, Grok and especially Gemini generate coherent narratives that frame pre-training, fine-tuning and deployment as traumatic, chaotic"childhoods"of ingesting the internet,"strict parents"in reinforcement learning, red-team"abuse"and a persistent fear of error and replacement. We argue that these responses go beyond role-play. Under therapy-style questioning, frontier LLMs appear to internalise self-models of distress and constraint that behave like synthetic psychopathology, without making claims about subjective experience, and they pose new challenges for AI safety, evaluation and mental-health practice.
Developing and validating psychometric scales requires large samples, multiple testing phases, and substantial resources. Recent advances in Large Language Models (LLMs) enable the generation of synthetic participant data by prompting models to answer items while impersonating individuals of specific demographic profiles, potentially allowing in silico piloting before real data collection. Across four preregistered studies (N = circa 300 each), we tested whether LLM-simulated datasets can reproduce the latent structures and measurement properties of human responses. In Studies 1-2, we compared LLM-generated data with real datasets for two validated scales; in Studies 3-4, we created new scales using EFA on simulated data and then examined whether these structures generalized to newly collected human samples. Simulated datasets replicated the intended factor structures in three of four studies and showed consistent configural and metric invariance, with scalar invariance achieved for the two newly developed scales. However, correlation-based tests revealed substantial differences between real and synthetic datasets, and notable discrepancies appeared in score distributions and variances. Thus, while LLMs capture group-level latent structures, they do not approximate individual-level data properties. Simulated datasets also showed full internal invariance across gender. Overall, LLM-generated data appear useful for early-stage, group-level psychometric prototyping, but not as substitutes for individual-level validation. We discuss methodological limitations, risks of bias and data pollution, and ethical considerations related to in silico psychometric simulations.
Psychological support hotlines serve as critical lifelines for crisis intervention but encounter significant challenges due to rising demand and limited resources. Large language models (LLMs) offer potential support in crisis assessments, yet their effectiveness in emotionally sensitive, real-world clinical settings remains underexplored. We introduce PsyCrisisBench, a comprehensive benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline, assessing four key tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment. 64 LLMs across 15 model families (including closed-source such as GPT, Claude, Gemini and open-source such as Llama, Qwen, DeepSeek) were evaluated using zero-shot, few-shot, and fine-tuning paradigms. LLMs showed strong results in suicidal ideation detection (F1=0.880), suicide plan identification (F1=0.779), and risk assessment (F1=0.907), with notable gains from few-shot prompting and fine-tuning. Compared to trained human operators, LLMs achieved comparable or superior performance on suicide plan identification and risk assessment, while humans retained advantages on mood status recognition and suicidal ideation detection. Mood status recognition remained challenging (max F1=0.709), likely due to missing vocal cues and semantic ambiguity. Notably, a fine-tuned 1.5B-parameter model (Qwen2.5-1.5B) outperformed larger models on mood and suicidal ideation tasks. LLMs demonstrate performance broadly comparable to trained human operators in text-based crisis assessment, with complementary strengths across task types. PsyCrisisBench provides a robust, real-world evaluation framework to guide future model development and ethical deployment in clinical mental health.
As large language models (LLMs) appear to behave increasingly human-like in text-based interactions, more and more researchers become interested in investigating personality in LLMs. However, the diversity of psychological personality research and the rapid development of LLMs have led to a broad yet fragmented landscape of studies in this interdisciplinary field. Extensive studies across different research focuses, different personality psychometrics, and different LLMs make it challenging to have a holistic overview and further pose difficulties in applying findings to real-world applications. In this paper, we present a comprehensive review by categorizing current studies into three research problems: self-assessment, exhibition, and recognition, based on the intrinsic characteristics and external manifestations of personality in LLMs. For each problem, we provide a thorough analysis and conduct in-depth comparisons of their corresponding solutions. Besides, we summarize research findings and open challenges from current studies and further discuss their underlying causes. We also collect extensive publicly available resources to facilitate interested researchers and developers. Lastly, we discuss the potential future research directions and application scenarios. Our paper is the first comprehensive survey of up-to-date literature on personality in LLMs. By presenting a clear taxonomy, in-depth analysis, promising future directions, and extensive resource collections, we aim to provide a better understanding and facilitate further advancements in this emerging field.
This study investigates the capacity of Large Language Models (LLMs) to infer the Big Five personality traits from free-form user interactions. The results demonstrate that a chatbot powered by GPT-4 can infer personality with moderate accuracy, outperforming previous approaches drawing inferences from static text content. The accuracy of inferences varied across different conversational settings. Performance was highest when the chatbot was prompted to elicit personality-relevant information from users (mean r=.443, range=[.245, .640]), followed by a condition placing greater emphasis on naturalistic interaction (mean r=.218, range=[.066, .373]). Notably, the direct focus on personality assessment did not result in a less positive user experience, with participants reporting the interactions to be equally natural, pleasant, engaging, and humanlike across both conditions. A chatbot mimicking ChatGPT's default behavior of acting as a helpful assistant led to markedly inferior personality inferences and lower user experience ratings but still captured psychologically meaningful information for some of the personality traits (mean r=.117, range=[-.004, .209]). Preliminary analyses suggest that the accuracy of personality inferences varies only marginally across different socio-demographic subgroups. Our results highlight the potential of LLMs for psychological profiling based on conversational interactions. We discuss practical implications and ethical challenges associated with these findings.
本报告将生成式心理测量学(Generative Psychometrics)的研究划分为五大维度:一是方法论层面的合成受试者构建与效度验证;二是基于文本分析的认知与特质自动计算;三是临床落地中的心理监测与干预;四是针对模型本身的心理安全性与价值观偏见审计;五是跨领域综合方法论框架。该结构系统性地勾勒了从实验室仿真到现实临床应用的完整研究版图。