长时程多智能体多模态社会场景
宏观社会模拟与群体动力学演化
该组文献探讨利用大语言模型(LLM)作为人类代理,模拟大规模社会系统的宏观动态,包括舆论极化、社会规范涌现、沉默螺旋、人口迁移及灾害风险感知等复杂社会现象。
- Sense and Sensitivity: Evaluating the simulation of social dynamics via Large Language Models(Da Ju, Adina Williams, Brian Karrer, Maximilian Nickel, 2024, ArXiv)
- Simulating Human Society with Large Language Model Agents: City, Social Media, and Economic System(Chen Gao, Fengli Xu, Xu Chen, Xiang Wang, Xiangnan He, Yong Li, 2024, Companion Proceedings of the ACM Web Conference 2024)
- Exploring the Potential of Conversational AI Support for Agent-Based Social Simulation Model Design(Peer-Olaf Siebers, 2024, J. Artif. Soc. Soc. Simul.)
- Quantifying the Lifelong Impact of Resilience Interventions via Agent-Based LLM Simulation(Vivienne L'Ecuyer Ming, 2025, ArXiv)
- RELATE-Sim: Leveraging Turning Point Theory and LLM Agents to Predict and Understand Long-Term Relationship Dynamics through Interactive Narrative Simulations(Matthew Yue, Zhikun Xu, Vivek Gupta, Thao Ha, Liesal Sharabi, Ben Zhou, 2025, ArXiv)
- Multimodal LLM-Based Agent for Human Behavior Simulation: Modeling Return Migration Dynamics(Xiaoluan Liu, Xinyu Lin, Fangbin Qiao, 2025, Data Intelligence)
- GenSim: A General Social Simulation Platform with Large Language Model based Agents(Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, Bolin Ding, Jingren Zhou, Jun Wang, J. Wen, 2024, No journal)
- Simulating Filter Bubble on Short-video Recommender System with Large Language Model Agents(Nicholas Sukiennik, Haoyu Wang, Zailin Zeng, Chen Gao, Yong Li, 2025, ArXiv)
- Decoding the Silent Majority: Inducing Belief Augmented Social Graph with Large Language Model for Response Forecasting(Chenkai Sun, Jinning Li, Y. Fung, Hou Pong Chan, Tarek F. Abdelzaher, ChengXiang Zhai, Heng Ji, 2023, No journal)
- Understanding Online Polarization Through Human-Agent Interaction in a Synthetic LLM-Based Social Network(Tim Donkers, Jürgen Ziegler, 2025, No journal)
- Spiral of Silence in Large Language Model Agents(Mingze Zhong, Meng Fang, Zijing Shi, Yuxuan Huang, Shunfeng Zheng, Yali Du, Ling Chen, Jun Wang, 2025, ArXiv)
- Emergent Social Dynamics of LLM Agents in the El Farol Bar Problem(Ryosuke Takata, A. Masumori, Takashi Ikegami, 2025, ArXiv)
- Synthetic Social Media Influence Experimentation Via an Agentic Reinforcement Learning Large Language Model Bot(Bailu Jin, Weisi Guo, 2024, J. Artif. Soc. Soc. Simul.)
- The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems(Prateek Gupta, Qiankun Zhong, Hiromu Yakura, Thomas F. Eisenmann, Iyad Rahwan, 2025, ArXiv)
- Emergence of Social Norms in Generative Agent Societies: Principles and Architecture(Siyue Ren, Zhiyao Cui, Ruiqi Song, Zhen Wang, Shuyue Hu, 2024, No journal)
- Towards Simulating Social Influence Dynamics with LLM-Based Multi-Agents(Hsien-Tsung Lin, Pei-Cing Huang, Chan-Tung Ku, Chan Hsu, Pei-Xuan Shieh, Yihuang Kang, 2025, 2025 IEEE International Conference on Information Reuse and Integration and Data Science (IRI))
- Simulating theory and society: How multi-agent artificial intelligence modeling contributes to renewal and critique in social theory(F. Shults, 2025, Theory and Society)
- MF-LLM: Simulating Collective Decision Dynamics via a Mean-Field Large Language Model Framework(Qirui Mi, Mengyue Yang, Xiangning Yu, Zhiyu Zhao, Cheng Deng, Bo An, Haifeng Zhang, Xu Chen, Jun Wang, 2025, ArXiv)
- Quantifying the Impact of Large Language Models on Collective Opinion Dynamics(Chao Li, Xingye Su, Haoying Han, Cong Xue, Chunmo Zheng, C. Fan, 2023, ArXiv)
- Multi-Stage Simulation of Residents' Disaster Risk Perception and Decision-Making Behavior: An Exploratory Study on Large Language Model-Driven Social-Cognitive Agent Framework(Xinjie Zhao, Hao Wang, Chengxiao Dai, Jiacheng Tang, Kaixin Deng, Zhihua Zhong, Fanying Kong, Shiyun Wang, So Morikawa, 2025, Syst.)
- SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users(Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, Guanying Li, Ling Yan, Yao Hu, Siming Chen, Yu Wang, Jingxuan Huang, Jiebo Luo, Shiping Tang, Libo Wu, Baohua Zhou, Zhongyu Wei, 2025, ArXiv)
- PopSim: Social Network Simulation for Social Media Popularity Prediction(Yijun Liu, Wu Liu, Xiaoyan Gu, Allen He, Weiping Wang, Yongdong Zhang, 2025, ArXiv)
- Harnessing Large Language Models for Group POI Recommendations(Jing Long, Liang Qu, Junliang Yu, Tong Chen, Quoc Viet Hung Nguyen, Hongzhi Yin, 2024, Proceedings of the 34th ACM International Conference on Information and Knowledge Management)
长时程记忆架构与交互一致性维护
研究重点在于解决智能体在长时间跨度交互中的记忆瓶颈,涵盖外部记忆机制(RAG)、反射机制(Reflective Memory)、分层存储(STM/LTM)及动态剪枝技术,以确保跨会话的个性一致性。
- In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents(Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long T. Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, Tomas Pfister, 2025, ArXiv)
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory(P. Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, Deshraj Yadav, 2025, No journal)
- Exploring and Controlling Diversity in LLM-Agent Conversation(Kuanchao Chu, Yi-Pei Chen, Hideki Nakayama, 2024, No journal)
- SupportPlay: A Multi-Agent Role-Playing System for Personalized and Sustained Multimodal Emotional Support Conversation(Geng Tu, Bingbing Wang, Erik Cambria, Wenjie Li, Ruifeng Xu, 2025, Companion Proceedings of the ACM on Web Conference 2025)
- TeleMem: Building Long-Term and Multimodal Memory for Agentic AI(Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, Xuelong Li, 2025, ArXiv)
- Memory Management Strategies for Maintaining Long-Term Dialogue Coherence and Personalization in Generation Chatbots(Vignyanand Penumatcha, 2025, 2025 5th International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT))
- SALM: A Multi-Agent Framework for Language Model-Driven Social Network Simulation(Gaurav Koley, 2025, ArXiv)
- VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding(Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li, 2024, ArXiv)
- Toward Conversational Agents with Context and Time Sensitive Long-term Memory(Nick Alonso, Tomas Figliolia, A. Ndirango, Beren Millidge, 2024, ArXiv)
- MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation(Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao, 2025, No journal)
- Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory(Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, Guannan Zhang, 2023, ArXiv)
- Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents(Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong, 2026, ArXiv)
- Evaluating Very Long-Term Conversational Memory of LLM Agents(Adyasha Maharana, Dong-Ho Lee, S. Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang, 2024, ArXiv)
- MIRIX: Multi-Agent Memory System for LLM-Based Agents(Yu Wang, Xi Chen, 2025, ArXiv)
- Prompted LLMs as Chatbot Modules for Long Open-domain Conversation(Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, Kangwook Lee, 2023, No journal)
多智能体博弈、协作策略与决策协同
关注多智能体在社会困境、外交谈判、非合作博弈中的策略选择。研究涉及权力动态、欺骗检测、信任形成、多视角辩论机制以及强化学习驱动的协同进化。
- Will Systems of LLM Agents Cooperate: An Investigation into a Social Dilemma(Richard Willis, Yali Du, Joel Z. Leibo, Michael Luck, 2025, ArXiv)
- How large language models judge and influence human cooperation(Alexandre S. Pires, Laurens Samson, S. Ghebreab, Fernando P. Santos, 2025, ArXiv)
- Navigating Social Dilemmas with LLM-based Agents via Consideration of Future Consequences(D. Nguyen, Hung Le, Kien Do, Sunil Gupta, S. Venkatesh, T. Tran, 2025, No journal)
- I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy(G. Campedelli, Nicolò Penzo, Massimo Stefan, Roberto Dessì, Marco Guerini, Bruno Lepri, Jacopo Staiano, 2024, ArXiv)
- Emergent Social Learning via Multi-agent Reinforcement Learning(Kamal Ndousse, Douglas Eck, S. Levine, Natasha Jaques, 2020, No journal)
- Affect-Aware Agents for Emergent Social Conflict in Games(Weilun Deng, 2025, Applied and Computational Engineering)
- Socialized Learning and Emergent Behaviors in Multi-Agent Systems based on Multimodal Large Language Models(Sureyya Akin, Shruti T. Tiwari, R. Bhattacharya, Sagar A. Raman, Kiran Mohanty, Sita Krishnan, 2025, ArXiv)
- EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation(Xinyi Mou, Chen Qian, Wei Liu, Xuanjing Huang, Zhongyu Wei, 2025, ArXiv)
- Competing LLM Agents in a Non-Cooperative Game of Opinion Polarisation(Amin Qasmi, Usman Naseem, Mehwish Nasim, 2025, 2025 IEEE International Conference on Big Data (BigData))
- The Traitors: Deception and Trust in Multi-Agent Language Model Simulations(Pedro M. P. Curvo, 2025, ArXiv)
- Static network structure cannot stabilize cooperation among large language model agents(Jingxin Han, B. Battu, Ivan Romic, Talal Rahwan, Petter Holme, 2024, PLOS One)
- Richelieu: Self-Evolving LLM-Based Agents for AI Diplomacy(Zhenyu Guan, Xiangyu Kong, Fangwei Zhong, Yizhou Wang, 2024, ArXiv)
- WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate(Anoop Cherian, River Doyle, Eyal Ben-Dov, Suhas Lohit, Kuan-Chuan Peng, 2025, ArXiv)
- OmniNova:A General Multimodal Agent Framework(Pengfei Du, 2025, ArXiv)
- Global Reward to Local Rewards: Multimodal-Guided Decomposition for Improving Dialogue Agents(Dong Won Lee, Hae Won Park, Yoon Kim, Cynthia Breazeal, Louis-philippe Morency, 2024, No journal)
- Improving Multi-Agent Debate with Sparse Communication Topology(Yunxuan Li, Y. Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, Eugene Ie, 2024, No journal)
- AgentMaster: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis(Callie C. Liao, Duoduo Liao, Sai Surya Gadiraju, 2025, ArXiv)
- MV-Debate: Multi-view Agent Debate with Dynamic Reflection Gating for Multimodal Harmful Content Detection in Social Media(Rui Lu, Jinhe Bi, Yunpu Ma, Feng Xiao, Yuntao Du, Yijun Tian, 2025, ArXiv)
- Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety(Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S. Yu, Manling Li, Han Liu, 2025, ArXiv)
- Super-additive Cooperation in Language Model Agents(Filippo Tonini, Lukas Galke, 2025, ArXiv)
多模态社会感知、具身智能与环境交互
研究智能体如何整合视听信号以识别社会规范、情感状态及文化偏见,并探讨具身智能体在物理或虚拟环境(如家庭、城市、足球场)中的任务规划与符号涌现。
- LVLM-HBA: Large Vision-Language Model with Cross-Modal Alignment for Human Behavior Analysis(Jun Yu, Xilong Lu, Lingsi Zhu, Qiang Ling, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Towards creating a conversational memory for long-term meeting support: predicting memorable moments in multi-party conversations through eye-gaze(Maria Tsfasman, Kristian Fenech, Morita Tarvirdians, András Lőrincz, C. Jonker, Catharine Oertel, 2022, Proceedings of the 2022 International Conference on Multimodal Interaction)
- Leveraging Recurrent Neural Networks for Multimodal Recognition of Social Norm Violation in Dialog(Tiancheng Zhao, Ran Zhao, Zhao Meng, Justine Cassell, 2016, ArXiv)
- No Robot is an Island: An Always-On Cognitive Architecture for Social Context Awareness in Dynamic Environments*(Dario Pasquali, Luca Garello, G. Belgiovine, O. Eldardeer, Linda Lastrico, Francesco Rea, Fulvio Mastrogiovanni, G. Sandini, A. Sciutti, 2025, 2025 IEEE International Conference on Development and Learning (ICDL))
- A modular architecture for creating multimodal embodied agents with an episodic Knowledge Graph as an explainable and controllable long-term memory(Thomas Baier, Selene Baez Santamaria, Piek Vossen, 2025, Dialogue Discourse)
- Higher-Order Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions(Minwoo Kang, Suhong Moon, Seungyong Lee, Ayush Raj, Joseph Suh, David M. Chan, 2025, ArXiv)
- Larger Encoders, Smaller Regressors: Exploring Label Dimensionality Reduction and Multimodal Large Language Models as Feature Extractors for Predicting Social Perception(Iván Martín-Fernández, Sergio Esteban-Romero, Jaime Bellver-Soler, F. Fernández-Martínez, M. Gil-Martín, 2024, Proceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor)
- Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting(Ana Rita Ortigoso, Gabriel Vieira, Daniel Fuentes, Luís Frazão, Nuno Costa, António Pereira, 2025, ArXiv)
- Multimodal emotion estimation and emotional synthesize for interaction virtual agent(Minghao Yang, J. Tao, Hao Li, Kaihui Mu, 2012, 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems)
- PRISM: A Personality-Driven Multi-Agent Framework for Social Media Simulation(Zhixiang Lu, Xueyuan Deng, Yiran Liu, Yulong Li, Qiang Yan, Imran Razzak, Jionglong Su, 2025, ArXiv)
- Decoding cultural tapestries: A deep dive into Indian social stigma patterns in large language models(Sridhar Jonnala, Rushikesh Tade, N. Thomas, 2025, Journal of Asian Scientific Research)
- Human Behavior Atlas: Benchmarking Unified Psychological and Social Behavior Understanding(Keane Ong, Wei Dai, Carol Li, Dewei Feng, Hengzhi Li, Jingyao Wu, Jiaee Cheong, Rui Mao, G. Mengaldo, Erik Cambria, P. Liang, 2025, ArXiv)
- Multi levels semantic architecture for multimodal interaction(S. Dourlens, A. Ramdane-Cherif, É. Monacelli, 2013, Applied Intelligence)
- SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems(Dong Zhang, Zhaowei Li, Pengyu Wang, Xin Zhang, Yaqian Zhou, Xipeng Qiu, 2024, ArXiv)
- CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation(Nicolas Bougie, Narimasa Watanabe, 2025, No journal)
- Demonstrating EMMA: Embodied MultiModal Agent for Language-guided Action Execution in 3D Simulated Environments(Alessandro Suglia, Bhathiya Hemanthage, Malvina Nikandrou, G. Pantazopoulos, Amit Parekh, Arash Eshghi, Claudio Greco, Ioannis Konstas, Oliver Lemon, Verena Rieser, 2022, No journal)
- HoME: a Household Multimodal Environment(Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, L. Celotti, Florian Strub, J. Rouat, H. Larochelle, Aaron C. Courville, 2017, ArXiv)
- GraphCortex: A Visual Language Model-Guided Knowledge Graph Based Reasoning Framework for Robotic Long-Term Task Planning(Shaozhuo Huang, Nan Li, Jue Zhang, 2025, 2025 2nd International Conference on Intelligent Computing and Robotics (ICICR))
- Simulation for All: A Step-by-Step Cookbook for Developing Human-Centered Multi-Agent Transportation Simulators(S. Azimi, Arash Tavakoli, 2025, IEEE Access)
- CH-MARL: A Multimodal Benchmark for Cooperative, Heterogeneous Multi-Agent Reinforcement Learning(Vasu Sharma, P. Goyal, Kaixiang Lin, Govind Thattai, Qiaozi Gao, G. Sukhatme, 2022, ArXiv)
- Symbol Emergence as an Interpersonal Multimodal Categorization(Y. Hagiwara, Hiroyoshi Kobayashi, Akira Taniguchi, T. Taniguchi, 2019, Frontiers in Robotics and AI)
- Designing a Data Corpus of Collaborative Group Tasks with the Members from Unbalanced Cultural Backgrounds(Kaihua Ding, Hung-Hsuan Huang, Nicolas Berberich, Mineya Kaseda, K. Kuwabara, T. Nishida, 2019, Proceedings of the 7th International Conference on Human-Agent Interaction)
- CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart(Bowen Zhao, Tianhao Cheng, Yuejie Zhang, Ying Cheng, Rui Feng, Xiaobo Zhang, 2024, Proceedings of the 32nd ACM International Conference on Multimedia)
- Towards an intelligent framework for multimodal affective data analysis(Soujanya Poria, E. Cambria, A. Hussain, G. Huang, 2015, Neural networks : the official journal of the International Neural Network Society)
- Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions(Jihyoung Jang, Minwook Bae, Minji Kim, Dilek Hakkani-Tur, Hyounghun Kim, 2025, No journal)
- SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph(Yuxing Long, Binyuan Hui, Fulong Ye, Yanyang Li, Zhuoxin Han, Caixia Yuan, Yongbin Li, Xiaojie Wang, 2023, No journal)
- TongSIM: A General Platform for Simulating Intelligent Machines(Zhe Sun, Kunlun Wu, Chuanjian Fu, Ze Song, L. Shi, Ziheng Xue, Bohan Jing, Ying-Jie Yang, Xiaomeng Gao, Aijia Li, Tianyu Guo, Huiying Li, Xueyuan Yang, Rongkai Liu, Xinyi He, Yuxi Wang, Yue Li, Mingyuan Liu, Yujie Lu, Hong-Kai Xie, Shiyun Zhao, Bo Dai, Wei Wang, Tao Yuan, Song Zhu, Yujia Peng, Zhenliang Zhang, 2025, ArXiv)
- MineLand: Simulating Large-Scale Multi-Agent Interactions with Limited Multimodal Senses and Physical Needs(Xianhao Yu, Jiaqi Fu, Renjia Deng, Wenjuan Han, 2024, ArXiv)
- EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM(Shuang Ao, Flora D. Salim, Simon Khan, 2025, ArXiv)
- Multimodal Embodied Interactive Agent for Cafe Scene(Yang Liu, Xinshuai Song, Kaixuan Jiang, Weixing Chen, Jing-Hua Luo, Guanbin Li, Liang Lin, 2024, ArXiv)
- Spatial-Temporal Aligned Multi-Agent Learning for Visual Dialog Systems(Yong Zhuang, Tong Yu, Junda Wu, Shiqu Wu, Shuai Li, 2022, Proceedings of the 30th ACM International Conference on Multimedia)
- Event2Tracking: Reconstructing Multi-Agent Soccer Trajectories Using Long-Term Multimodal Context(Harry Hughes, Michael Horton, Xinyu Wei, Harshala Gammulle, C. Fookes, S. Sridharan, P. Lucey, 2025, No journal)
社会安全治理、风险防控与伦理偏见
探讨多智能体系统中的安全威胁(如传染性越狱)、虚假信息传播、回声室效应、仇恨言论检测及多模态争议性内容的治理机制。
- Multimodal Safety Evaluation in Generative Agent Social Simulations(Alhim Vera, Karen Sanchez, Carlos Hinojosa, Haidar Bin Hamid, Donghoon Kim, Bernard Ghanem, 2025, ArXiv)
- Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast(Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, Min Lin, 2024, No journal)
- Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions(Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao, 2024, No journal)
- From Misinformation to Resilient Communication: Strategic Simulation of Social Network Dynamics in the Pharmaceutical Industry(Filippo Ghisi, Marco Gotelli, Vittorio Solina, Flavio Tonelli, 2025, Applied Sciences)
- Generative Agents for Multimodal Controversy Detection(Tianjiao Xu, Jinfei Gao, Keyi Kong, Jianhua Yin, Tian Gan, Liqiang Nie, 2024, No journal)
- Large Language Model Driven Agents for Simulating Echo Chamber Formation(Chenhao Gu, Ling Luo, Zainab R. Zaidi, S. Karunasekera, 2025, ArXiv)
- Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs(Firoj Alam, Md. Rafiul Biswas, Uzair Shah, W. Zaghouani, Georgios Mikros, 2024, ArXiv)
- Multimodal Large Model-based False Marketing and Hype Propagation Detection on Social Platforms(Yitong Yang, Fang Lin, Yuheng Li, 2025, 2025 7th International Conference on Frontier Technologies of Information and Computer (ICFTIC))
垂直领域应用落地与交互系统基准测试
展示多智能体系统在教育、医疗、金融、政务及数字生态(电商、推荐系统)中的具体应用,并提供针对特定领域(如OS操作、临床诊断)的评估基准平台。
- Research on a virtual teacher personalized interaction model integrating affective computing and multi-agent systems(Rili Dang, Noorazman Abd Samad, 2025, Future Technology)
- Agent-to-Agent (A2A) Protocol Integrated Digital Twin System with AgentIQ for Multimodal AI Fitness Coaching and Personalized Well-Being(Kamran Gholizadeh HamlAbadi, M. Vahdati, Fedwa Laamarti, Abdulmotaleb El Saddik, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Design of an Immersive Basketball Tactical Training System Based on Digital Twins and Federated Learning(Xiongce Lv, Ye Tao, Yifang Zhang, Yang Xue, 2025, Applied Sciences)
- Research on an AI Interview Evaluation System Integrating Multi-Agent Systems and Virtual Digital Humans(Jiayi Wu, Jiaqi Zhang, Li Gao, Jialiang Feng, Bo Meng, Yifan Wu, Mingming Gong, 2025, Journal of Big Data and Computing)
- Social Governance Oriented Multimodal Situation Perception and Bilateral Collaborative Scheduling Simulation(Yanxing Chen, Jun Wu, Renjie Li, Ran Xu, Yaqin Li, Youcheng Yang, 2024, 2024 17th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI))
- MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications(Aleksandr Algazinov, Matt Laing, Paul Laban, 2025, ArXiv)
- HyLECA: A Framework for Developing Hybrid Long-term Engaging Controlled Conversational Agents(Erkan Basar, Divyaa Balaji, Linwei He, Iris Hendrickx, E. Krahmer, Gert-Jan de Bruijn, Tibor Bosse, 2023, Proceedings of the 5th International Conference on Conversational User Interfaces)
- LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation(Yijun Liu, Wu Liu, Xiaoyan Gu, Yong Rui, Xiaodong He, Yongdong Zhang, 2024, ArXiv)
- LLM-Empowered Creator Simulation for Long-Term Evaluation of Recommender Systems Under Information Asymmetry(Xiaopeng Ye, Chen Xu, ZhongXiang Sun, Jun Xu, Gang Wang, Zhenhua Dong, Jirong Wen, 2025, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)
- Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing(Wenlin Zhang, Xiangyang Li, Qiyuan Ge, Kuicai Dong, Pengyue Jia, Xiaopeng Li, Zijian Zhang, Maolin Wang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, 2026, ArXiv)
- Engagement-Driven Content Generation with Large Language Models(Erica Coppolillo, Federico Cinus, Marco Minici, Francesco Bonchi, Giuseppe Manco, 2024, Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2)
- Multi-Agent Multimodal Models for Multicultural Text to Image Generation(Parth Bhalerao, Mounika Yalamarty, B. Trinh, Oana Ignat, 2025, ArXiv)
- Digital Player: Evaluating Large Language Models based Human-like Agent in Games(Jiawei Wang, Kai Wang, Shaojie Lin, Runze Wu, Bihan Xu, Ling Jiang, Shiwei Zhao, Renyu Zhu, Haoyu Liu, Zhipeng Hu, Zhong Fan, Le Li, Tangjie Lyu, Changjie Fan, 2025, ArXiv)
- CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents(Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, X. Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip H. S. Torr, Bernard Ghanem, G. Li, 2024, No journal)
- AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments(Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, Michael Moor, 2024, ArXiv)
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments(Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, T. Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu, 2024, ArXiv)
- Active Agent Oriented Multimodal Interface System(O. Hasegawa, K. Itou, Takio Kurita, S. Hayamizu, Kazuyo Tanaka, Kazuhiko Yamamoto, N. Otsu, 1995, No journal)
- A Framework for Supporting Multimodal Conversational Characters in a Multi-agent System(Yasmine Arafa, Abe Mamdani, 2000, No journal)
- Impact of mindset types and social community compositions on opinion dynamics: A large language model-based multi-agent simulation study(Guozhu Ding, Zuer Liu, Shan Li, Jie Cao, Z. Ye, 2025, Comput. Hum. Behav.)
- A Multi-Agent Digital Twin Framework for AI-Driven Fitness Coaching(M. Vahdati, Kamran Gholizadeh HamlAbadi, Fedwa Laamarti, Abdulmotaleb El Saddik, 2025, Proceedings of the 2025 ACM International Conference on Interactive Media Experiences)
- Modeling Multi-Party Interaction in Couples Therapy: A Multi-Agent Simulation Approach(Canwen Wang, A. Chen, Catherine Bao, Siwei Jin, Y. Chan, Jessica R Mindel, Sijia Xie, Holly Swartz, Tongshuang Wu, Robert E. Kraut, Haiyi Zhu, 2026, ArXiv)
- Build a Multimodal Interaction and Multi-Agent Collaborative Decision-Making Mechanism Enhanced by Large Models in the Intelligent Decision-Making System for Distribution Network Production(Wei Zhang, Song Wang, Shuai Zhang, Yuanyuan Lei, L. Bao, 2025, International Journal of Computational Intelligence and Applications)
- 3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark(Ivan Sviridov, Amina Miftakhova, Artemiy Tereshchenko, Galina Zubkova, Pavel Blinov, Andrey Savchenko, 2025, No journal)
- FinArena: A Human-Agent Collaboration Framework for Financial Market Analysis and Forecasting(Congluo Xu, Zhaobin Liu, Ziyang Li, 2025, ArXiv)
- Research on the Laws of Multimodal Perception and Cognition from a Cross-cultural Perspective - Taking Overseas Chinese Gardens as an Example(Ran Chen, Xueqi Yao, Jing Zhao, Shuhan Xu, Sirui Zhang, Yijun Mao, 2023, ArXiv)
- MAXplain: A Multi-Agent System for Interactive Multimodal Hate Speech Detection(Nils Riekers, Marten Risius, Tong Chen, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Leveraging Long Short-Term User Preference in Conversational Recommendation via Multi-agent Reinforcement Learning(Yang Deng, Yaliang Li, Bolin Ding, W. Lam, 2023, IEEE Transactions on Knowledge and Data Engineering)
- Indirect Agent Interaction within an Approach for a Robust Transport Control in Dynamic and Multimodal Logistics Networks(Heiko Jung, S. Weissbach, J. Kappler, 2011, Electron. Commun. Eur. Assoc. Softw. Sci. Technol.)
- A Multi-agent Based Testbed for Agent Interface Evaluation(Chung-Min Wu, M. Hsieh, Chin-Hsing Luo, 2008, 2008 Eighth International Conference on Intelligent Systems Design and Applications)
- Emotional Intelligence in Artificial Agents: Leveraging Deep Multimodal Big Data for Contextual Social Interaction and Adaptive Behavioral Modelling(V. Annapareddy, Jeevani Singireddy, Botlagunta Preethish Nanan, Phanish Lakarasu, J. Burugulla, 2025, SSRN Electronic Journal)
- Human-like Social Compliance in Large Language Models: Unifying Sycophancy and Conformity through Signal Competition Dynamics(Long Zhang, Wei-neng Chen, 2025, ArXiv)
- Bridging the behavior-neural gap: A multimodal AI reveals the brain's geometry of emotion more accurately than human self-reports(Changde Du, Yizhuo Lu, Zhongyu Huang, Yi Sun, Zisen Zhou, Shaozheng Qin, Huiguang He, 2025, ArXiv)
- Large Model Strategic Thinking, Small Model Efficiency: Transferring Theory of Mind in Large Language Models(Nunzio Lorè, Sepehr Ilami, Babak Heydari, 2024, ArXiv)
- Human-Autonomous System Interaction Graphical Notation (HASIGN): How Do We Design for Human-Multi-AGV Interaction in Manufacturing Intralogistics?(Rana El Khoury, Denis Zatyagov, Igor Rybalskii, Karl Kruusamäe, Jonas S. I. Rieder, Walter Quadrini, Thomas Trautner, Martijn Verbeij, Cecilia Colloseus, Doris Aschenbrenner, 2026, ACM Transactions on Autonomous and Adaptive Systems)
- Nadine: A large language model‐driven intelligent social robot with affective capabilities and human‐like memory(Hangyeol Kang, Maher Ben Moussa, N. Thalmann, 2024, Computer Animation and Virtual Worlds)
- Two people walk into a bar: dynamic multi-party social interaction with a robot agent(Mary Ellen Foster, Andre Gaschler, M. Giuliani, Amy Isard, M. Pateraki, Ronald P. A. Petrick, 2012, No journal)
- DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents(Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, Edward Choi, 2024, ArXiv)
- Learning to Make Friends: Coaching LLM Agents toward Emergent Social Ties(Philipp J. Schneider, Lin Tian, Marian-Andrei Rizoiu, 2025, ArXiv)
本报告整合了“长时程多智能体多模态社会场景”下的核心研究成果,构建了从底层技术支撑到高层社会应用的完整体系。研究涵盖了:1) 宏观层面的社会动力学模拟,揭示群体行为演化规律;2) 微观层面的长时记忆管理,解决交互一致性难题;3) 中观层面的博弈协作机制,优化多智能体决策效能;4) 多模态感知与具身智能,提升智能体在复杂环境中的生存与理解能力;5) 系统安全与社会治理,应对AI社会化带来的风险;6) 垂直领域落地与基准建设,推动医疗、教育等行业的智能化转型。这些研究共同指向了构建具备高度社会智能、长期稳定性和多模态交互能力的通用智能体系统。
总计139篇相关文献
Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents'solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.
While Vision-Language Models (VLMs) hold promise for tasks requiring extensive collaboration, traditional multi-agent simulators have facilitated rich explorations of an interactive artificial society that reflects collective behavior. However, these existing simulators face significant limitations. Firstly, they struggle with handling large numbers of agents due to high resource demands. Secondly, they often assume agents possess perfect information and limitless capabilities, hindering the ecological validity of simulated social interactions. To bridge this gap, we propose a multi-agent Minecraft simulator, MineLand, that bridges this gap by introducing three key features: large-scale scalability, limited multimodal senses, and physical needs. Our simulator supports 64 or more agents. Agents have limited visual, auditory, and environmental awareness, forcing them to actively communicate and collaborate to fulfill physical needs like food and resources. Additionally, we further introduce an AI agent framework, Alex, inspired by multitasking theory, enabling agents to handle intricate coordination and scheduling. Our experiments demonstrate that the simulator, the corresponding benchmark, and the AI agent framework contribute to more ecological and nuanced collective behavior.The source code of MineLand and Alex is openly available at https://github.com/cocacola-lab/MineLand.
Multi-agent debate has proven effective in improving large language models quality for reasoning and factuality tasks. While various role-playing strategies in multi-agent debates have been explored, in terms of the communication among agents, existing approaches adopt a brute force algorithm -- each agent can communicate with all other agents. In this paper, we systematically investigate the effect of communication connectivity in multi-agent systems. Our experiments on GPT and Mistral models reveal that multi-agent debates leveraging sparse communication topology can achieve comparable or superior performance while significantly reducing computational costs. Furthermore, we extend the multi-agent debate framework to multimodal reasoning and alignment labeling tasks, showcasing its broad applicability and effectiveness. Our findings underscore the importance of communication connectivity on enhancing the efficiency and effectiveness of the"society of minds"approach.
Accessibility remains a critical concern in today's society, as many technologies are not developed to support the full range of user needs. Existing multi-agent systems (MAS) often cannot provide comprehensive assistance for users in need due to the lack of customization stemming from closed-source designs. Consequently, individuals with disabilities frequently encounter significant barriers when attempting to interact with digital environments. We introduce MATE, a multimodal accessibility MAS, which performs the modality conversions based on the user's needs. The system is useful for assisting people with disabilities by ensuring that data will be converted to an understandable format. For instance, if the user cannot see well and receives an image, the system converts this image to its audio description. MATE can be applied to a wide range of domains, industries, and areas, such as healthcare, and can become a useful assistant for various groups of users. The system supports multiple types of models, ranging from LLM API calling to using custom machine learning (ML) classifiers. This flexibility ensures that the system can be adapted to various needs and is compatible with a wide variety of hardware. Since the system is expected to run locally, it ensures the privacy and security of sensitive information. In addition, the framework can be effectively integrated with institutional technologies (e.g., digital healthcare service) for real-time user assistance. Furthermore, we introduce ModCon-Task-Identifier, a model that is capable of extracting the precise modality conversion task from the user input. Numerous experiments show that ModCon-Task-Identifier consistently outperforms other LLMs and statistical models on our custom data. Our code and data are publicly available at https://github.com/AlgazinovAleksandr/Multi-Agent-MATE.
Human communication is a complex and diverse process that not only involves multiple factors such as language, commonsense, and cultural backgrounds but also requires the participation of multimodal information, such as speech. Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society. Can we leverage LLM-based multi-agent systems to simulate human communication? However, current LLM-based multi-agent systems mainly rely on text as the primary medium. In this paper, we propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication. SpeechAgents utilizes multi-modal LLM as the control center for individual agent and employes multi-modal signals as the medium for exchanged messages among agents. Additionally, we propose Multi-Agent Tuning to enhance the multi-agent capabilities of LLM without compromising general abilities. To strengthen and evaluate the effectiveness of human communication simulation, we build the Human-Communication Simulation Benchmark. Experimental results demonstrate that SpeechAgents can simulate human communication dialogues with consistent content, authentic rhythm, and rich emotions and demonstrate excellent scalability even with up to 25 agents, which can apply to tasks such as drama creation and audio novels generation. Code and models will be open-sourced at https://github. com/0nutation/SpeechAgents
The believable simulation of multi-user behavior is crucial for understanding complex social systems. Recently, large language models (LLMs)-based AI agents have made significant progress, enabling them to achieve human-like intelligence across various tasks. However, real human societies are often dynamic and complex, involving numerous individuals engaging in multimodal interactions. In this paper, taking e-commerce scenarios as an example, we present LMAgent, a very large-scale and multimodal agents society based on multimodal LLMs. In LMAgent, besides freely chatting with friends, the agents can autonomously browse, purchase, and review products, even perform live streaming e-commerce. To simulate this complex system, we introduce a self-consistency prompting mechanism to augment agents' multimodal capabilities, resulting in significantly improved decision-making performance over the existing multi-agent system. Moreover, we propose a fast memory mechanism combined with the small-world model to enhance system efficiency, which supports more than 10,000 agent simulations in a society. Experiments on agents' behavior show that these agents achieve comparable performance to humans in behavioral indicators. Furthermore, compared with the existing LLMs-based multi-agent system, more different and valuable phenomena are exhibited, such as herd behavior, which demonstrates the potential of LMAgent in credible large-scale social behavior simulations.
This study focuses on category formation for individual agents and the dynamics of symbol emergence in a multi-agent system through semiotic communication. In this study, the semiotic communication refers to exchanging signs composed of the signifier (i.e., words) and the signified (i.e., categories). We define the generation and interpretation of signs associated with the categories formed through the agent's own sensory experience or by exchanging signs with other agents as basic functions of the semiotic communication. From the viewpoint of language evolution and symbol emergence, organization of a symbol system in a multi-agent system (i.e., agent society) is considered as a bottom-up and dynamic process, where individual agents share the meaning of signs and categorize sensory experience. A constructive computational model can explain the mutual dependency of the two processes and has mathematical support that guarantees a symbol system's emergence and sharing within the multi-agent system. In this paper, we describe a new computational model that represents symbol emergence in a two-agent system based on a probabilistic generative model for multimodal categorization. It models semiotic communication via a probabilistic rejection based on the receiver's own belief. We have found that the dynamics by which cognitively independent agents create a symbol system through their semiotic communication can be regarded as the inference process of a hidden variable in an interpersonal multimodal categorizer, i.e., the complete system can be regarded as a single agent performing multimodal categorization using the sensors of all agents, if we define the rejection probability based on the Metropolis-Hastings algorithm. The validity of the proposed model and algorithm for symbol emergence, i.e., forming and sharing signs and categories, is also verified in an experiment with two agents observing daily objects in the real-world environment. In the experiment, we compared three communication algorithms: no communication, no rejection, and the proposed algorithm. The experimental results demonstrate that our model reproduces the phenomena of symbol emergence, which does not require a teacher who would know a pre-existing symbol system. Instead, the multi-agent system can form and use a symbol system without having pre-existing categories.
Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM's context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.
This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar's Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.
Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research. Our dataset and models are available at https://github.com/OanaIgnat/MosAIG.
In this demo, we present SupportPlay, a multi-agent role-playing system for emotional support conversation (ESC) that addresses the limitations of existing methods in providing personalized, sustained, and multimodal support. SupportPlay generates potential seeker profiles from existing ESC datasets and employs GPT-powered agents to play various seeker roles, enabling the supporter to learn personalized memories for each seeker. Subsequently, the supporter can induce general memory from these memories for real user interactions while learning the user's personalized memory in a similar manner. Through continuous memory management-including retrieval, storage, reflection, and forgetting-SupportPlay delivers tailored emotional support across interactions. By integrating text, speech, and video, SupportPlay creates immersive emotional support experiences.
In the past decade, social media platforms have been used for information dissemination and consumption. While a major portion of the content is posted to promote citizen journalism and public awareness, some content is posted to mislead users. Among different content types such as text, images, and videos, memes (text overlaid on images) are particularly prevalent and can serve as powerful vehicles for propaganda, hate, and humor. In the current literature, there have been efforts to individually detect such content in memes. However, the study of their intersection is very limited. In this study, we explore the intersection between propaganda and hate in memes using a multi-agent LLM-based approach. We extend the propagandistic meme dataset with coarse and fine-grained hate labels. Our finding suggests that there is an association between propaganda and hate in memes. We provide detailed experimental results that can serve as a baseline for future studies. We will make the experimental resources publicly available to the community (https://github.com/firojalam/propaganda-and-hateful-memes).
Modeling human behavior in urban environments is fundamental for social science, behavioral studies, and urban planning. Prior work often rely on rigid, hand-crafted rules, limiting their ability to simulate nuanced intentions, plans, and adaptive behaviors. Addressing these challenges, we envision an urban simulator (CitySim), capitalizing on breakthroughs in human-level intelligence exhibited by large language models. In CitySim, agents generate realistic daily schedules using a recursive value-driven approach that balances mandatory activities, personal habits, and situational factors. To enable long-term, lifelike simulations, we endow agents with beliefs, long-term goals, and spatial memory for navigation. CitySim exhibits closer alignment with real humans than prior work, both at micro and macro levels. Additionally, we conduct insightful experiments by modeling tens of thousands of agents and evaluating their collective behaviors under various real-world scenarios, including estimating crowd density, predicting place popularity, and assessing well-being. Our results highlight CitySim as a scalable, flexible testbed for understanding and forecasting urban phenomena.
Establishing the long-term, causal impact of psychological interventions on life outcomes is a grand challenge for the social sciences, caught between the limitations of correlational longitudinal studies and short-term randomized controlled trials (RCTs). This paper introduces Large-Scale Agent-based Longitudinal Simulation (LALS), a framework that resolves this impasse by simulating multi-decade, counterfactual life trajectories. The methodology employs a"digital clone"design where 2,500 unique LLM-based agent personas (grounded in a curated corpus of 3,917 empirical research articles) are each cloned across a 2x2 factorial experiment. Specifically, the simulation models the efficacy of extended psychological resilience training (Intervention vs. Control) either in childhood or as a young adult (age 6 vs. age 18). Comparing digital clones enables exceptionally precise causal inference. The simulation provides a quantitative, causal estimate of a resilience intervention's lifelong effects, revealing significant reductions in mortality, a lower incidence of dementia, and a substantial increase in accumulated wealth. Crucially, the results uncover a crucial developmental window: the intervention administered at age 6 produced more than double the positive impact on lifetime wealth compared to the same intervention at age 18. These benefits were most pronounced for agents from low-socioeconomic backgrounds, highlighting a powerful buffering effect. The LALS framework serves as a"computational wind tunnel"for social science, offering a new paradigm for generating and testing causal hypotheses about the complex, lifelong dynamics that shape human capital and well-being.
Artificial agents with the aid of large language models (LLMs) are effective in various real-world scenarios but struggle to cooperate in social dilemmas. When making decisions under the strain of selecting between long-term consequences and short-term benefits in commonly shared resources, LLM-based agents often exploit the environment, leading to early depletion. Inspired by the concept of consideration of future consequences (CFC), which is well-known in social psychology, we propose a framework to enable the ability to consider future consequences for LLM-based agents, which results in a new kind of agent that we term the CFC-Agent. We enable the CFC-Agent to act toward different levels of consideration for future consequences. Our first set of experiments, where LLM is directly asked to make decisions, shows that agents considering future consequences exhibit sustainable behaviour and achieve high common rewards for the population. Extensive experiments in complex environments showed that the CFC-Agent can manage a sequence of calls to LLM for reasoning and engaging in communication to cooperate with others to resolve the common dilemma better. Finally, our analysis showed that considering future consequences not only affects the final decision but also improves the conversations between LLM-based agents toward a better resolution of social dilemmas.
Social norms play a crucial role in guiding agents towards understanding and adhering to standards of behavior, thus reducing social conflicts within multi-agent systems (MASs). However, current LLM-based (or generative) MASs lack the capability to be normative. In this paper, we propose a novel architecture, named CRSEC, to empower the emergence of social norms within generative MASs. Our architecture consists of four modules: Creation & Representation, Spreading, Evaluation, and Compliance. This addresses several important aspects of the emergent processes all in one: (i) where social norms come from, (ii) how they are formally represented, (iii) how they spread through agents' communications and observations, (iv) how they are examined with a sanity check and synthesized in the long term, and (v) how they are incorporated into agents' planning and actions. Our experiments deployed in the Smallville sandbox game environment demonstrate the capability of our architecture to establish social norms and reduce social conflicts within generative MASs. The positive outcomes of our human evaluation, conducted with 30 evaluators, further affirm the effectiveness of our approach. Our project can be accessed via the following link: https://github.com/sxswz213/CRSEC.
Contemporary approaches to agent-based modeling (ABM) of social systems have traditionally emphasized rule-based behaviors, limiting their ability to capture nuanced dynamics by moving beyond predefined rules and leveraging contextual understanding from LMs of human social interaction. This paper presents SALM (Social Agent LM Framework), a novel approach for integrating language models (LMs) into social network simulation that achieves unprecedented temporal stability in multi-agent scenarios. Our primary contributions include: (1) a hierarchical prompting architecture enabling stable simulation beyond 4,000 timesteps while reducing token usage by 73%, (2) an attention-based memory system achieving 80% cache hit rates (95% CI [78%, 82%]) with sub-linear memory growth of 9.5%, and (3) formal bounds on personality stability. Through extensive validation against SNAP ego networks, we demonstrate the first LLM-based framework capable of modeling long-term social phenomena while maintaining empirically validated behavioral fidelity.
Diplomacy is one of the most sophisticated activities in human society, involving complex interactions among multiple parties that require skills in social reasoning, negotiation, and long-term strategic planning. Previous AI agents have demonstrated their ability to handle multi-step games and large action spaces in multi-agent tasks. However, diplomacy involves a staggering magnitude of decision spaces, especially considering the negotiation stage required. While recent agents based on large language models (LLMs) have shown potential in various applications, they still struggle with extended planning periods in complex multi-agent settings. Leveraging recent technologies for LLM-based agents, we aim to explore AI's potential to create a human-like agent capable of executing comprehensive multi-agent missions by integrating three fundamental capabilities: 1) strategic planning with memory and reflection; 2) goal-oriented negotiation with social reasoning; and 3) augmenting memory through self-play games for self-evolution without human in the loop.
We introduce a novel non-cooperative game to analyse opinion formation and resistance, incorporating principles from social psychology such as confirmation bias, resource constraints, and influence penalties. Our simulation features Large Language Model (LLM) agents competing to influence a population, with penalties imposed for generating messages that propagate or counter misinformation. This framework integrates resource optimisation into the agents' decision-making process. Our findings demonstrate that while weaker confirmation bias strengthens opinion alignment within groups, it also exacerbates overall polarisation. Conversely, stronger confirmation bias leads to fragmented opinions and limited shifts in individual beliefs. Investing heavily in a high-resource debunking strategy can initially align the population with the debunking agent, but risks rapid resource depletion and diminished long-term influence.
With the rapid advancement of Large Language Models (LLMs), LLM-based autonomous agents have shown the potential to function as digital employees, such as digital analysts, teachers, and programmers. In this paper, we develop an application-level testbed based on the open-source strategy game"Unciv", which has millions of active players, to enable researchers to build a"data flywheel"for studying human-like agents in the"digital players"task. This"Civilization"-like game features expansive decision-making spaces along with rich linguistic interactions such as diplomatic negotiations and acts of deception, posing significant challenges for LLM-based agents in terms of numerical reasoning and long-term planning. Another challenge for"digital players"is to generate human-like responses for social interaction, collaboration, and negotiation with human players. The open-source project can be found at https:/github.com/fuxiAIlab/CivAgent.
With the rapid advancement of large language models (LLMs), recent years have witnessed many promising studies on leveraging LLM-based agents to simulate human social behavior. While prior work has demonstrated significant potential across various domains, much of it has focused on specific scenarios involving a limited number of agents and has lacked the ability to adapt when errors occur during simulation. To overcome these limitations, we propose a novel LLM-agent-based simulation platform called \textit{GenSim}, which: (1) \textbf{Abstracts a set of general functions} to simplify the simulation of customized social scenarios; (2) \textbf{Supports one hundred thousand agents} to better simulate large-scale populations in real-world contexts; (3) \textbf{Incorporates error-correction mechanisms} to ensure more reliable and long-term simulations. To evaluate our platform, we assess both the efficiency of large-scale agent simulations and the effectiveness of the error-correction mechanisms. To our knowledge, GenSim represents an initial step toward a general, large-scale, and correctable social simulation platform based on LLM agents, promising to further advance the field of social science.
In this work, we describe our approach to developing an intelligent and robust social robotic system for the Nadine social robot platform. We achieve this by integrating large language models (LLMs) and skillfully leveraging the powerful reasoning and instruction‐following capabilities of these types of models to achieve advanced human‐like affective and cognitive capabilities. This approach is novel compared to the current state‐of‐the‐art LLM‐based agents which do not implement human‐like long‐term memory or sophisticated emotional capabilities. We built a social robot system that enables generating appropriate behaviors through multimodal input processing, bringing episodic memories accordingly to the recognized user, and simulating the emotional states of the robot induced by the interaction with the human partner. In particular, we introduce an LLM‐agent frame for social robots, social robotics reasoning and acting, serving as a core component for the interaction module in our system. This design has brought forth the advancement of social robots and aims to increase the quality of human–robot interaction.
In the digital era, the rapid propagation of fake news and rumors via social networks brings notable societal challenges and impacts public opinion regulation. Traditional fake news modeling typically forecasts the general popularity trends of different groups or numerically represents opinions shift. However, these methods often oversimplify real-world complexities and overlook the rich semantic information of news text. The advent of large language models (LLMs) provides the possibility of modeling subtle dynamics of opinion. Consequently, in this work, we introduce a Fake news Propagation Simulation framework (FPS) based on LLM, which studies the trends and control of fake news propagation in detail. Specifically, each agent in the simulation represents an individual with a distinct personality. They are equipped with both short-term and long-term memory, as well as a reflective mechanism to mimic human-like thinking. Every day, they engage in random opinion exchanges, reflect on their thinking, and update their opinions. Our simulation results uncover patterns in fake news propagation related to topic relevance, and individual traits, aligning with real-world observations. Additionally, we evaluate various intervention strategies and demonstrate that early and appropriately frequent interventions strike a balance between governance cost and effectiveness, offering valuable insights for practical applications. Our study underscores the significant utility and potential of LLMs in combating fake news.
We describe an approach for aligning an LLM based dialogue agent for long-term social dialogue, where there is only a single global score given by the user at the end of the session. In this paper, we propose the usage of denser naturally-occurring multimodal communicative signals as local implicit feedback to improve the turn-level utterance generation. Therefore, our approach (dubbed GELI) learns a local, turn-level reward model by decomposing the human-provided Global Explicit (GE) session level reward, using Local Implicit (LI) multimodal reward signals to crossmodally shape the reward decomposition step. This decomposed reward model is then used as part of the RLHF pipeline to improve an LLM-based dialog agent. We run quantitative and qualitative human studies on two large-scale datasets to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.
Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.
Maintaining the long-term sustainability of recommender systems (RS) is crucial. Traditional RS evaluation methods primarily focus on the user's immediate feedback (e.g., click), however, they often overlook the long-term effect involved by the content creators. In the real world, content creators can strategically create and upload new items to the platform by analyzing users' feedback and preference trends. Although previous studies have attempted to model creator behaviors, they often overlook that such behaviors are under conditions of information asymmetry. This asymmetry arises because creators mainly access the user feedback on the items they produce, while the platform has access to the full spectrum of feedback data. However, existing RS simulators often fail to consider such a condition, making the long-term RS evaluation inaccurate. To bridge this gap, we propose a Large Language Model (LLM)-empowered creator simulation agent named CreAgent. By utilizing the belief mechanism from game theory and the fast-and-slow thinking framework, we can simulate the creator's behaviors well under information asymmetry. Furthermore, to enhance CreAgent's simulation ability, we utilize Proximal Policy Optimization to fine-tune CreAgent. Our credibility validation experiments demonstrate that our simulation environment effectively aligns with the behaviors of real-world platforms and creators, thereby enhancing the reliability of long-term evaluations in RS. Furthermore, leveraging this simulator, we can examine whether RS algorithms, such as fairness- and diversity-aware methods, contribute to improving long-term performance for different stakeholders.
Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.
A growing body of multi-agent studies with LLMs explores how norms and cooperation emerge in mixed-motive scenarios, where pursuing individual gain can undermine the collective good. While prior work has explored these dynamics in both richly contextualized simulations and simplified game-theoretic environments, most LLM systems featuring common-pool resource (CPR) games provide agents with explicit reward functions directly tied to their actions. In contrast, human cooperation often emerges without explicit knowledge of the payoff structure or how individual actions translate into long-run outcomes, relying instead on heuristics, communication, and enforcement. We introduce a CPR simulation framework that removes explicit reward signals and embeds cultural-evolutionary mechanisms: social learning (adopting strategies and beliefs from successful peers) and norm-based punishment, grounded in Ostrom's principles of resource governance. Agents also individually learn from the consequences of harvesting, monitoring, and punishing via environmental feedback, enabling norms to emerge endogenously. We establish the validity of our simulation by reproducing key findings from existing studies on human behavior. Building on this, we examine norm evolution across a $2\times2$ grid of environmental and social initialisations (resource-rich vs. resource-scarce; altruistic vs. selfish) and benchmark how agentic societies comprised of different LLMs perform under these conditions. Our results reveal systematic model differences in sustaining cooperation and norm formation, positioning the framework as a rigorous testbed for studying emergent norms in mixed-motive LLM societies. Such analysis can inform the design of AI systems deployed in social and organizational contexts, where alignment with cooperative norms is critical for stability, fairness, and effective governance of AI-mediated environments.
As autonomous agents become more prevalent, understanding their collective behaviour in strategic interactions is crucial. This study investigates the emergent cooperative tendencies of systems of Large Language Model (LLM) agents in a social dilemma. Unlike previous research where LLMs output individual actions, we prompt state-of-the-art LLMs to generate complete strategies for iterated Prisoner's Dilemma. Using evolutionary game theory, we simulate populations of agents with different strategic dispositions (aggressive, cooperative, or neutral) and observe their evolutionary dynamics. Our findings reveal that different LLMs exhibit distinct biases affecting the relative success of aggressive versus cooperative strategies. This research provides insights into the potential long-term behaviour of systems of deployed LLM-based autonomous agents and highlights the importance of carefully considering the strategic environments in which they operate.
Most dating technologies optimize for getting together, not staying together. We present RELATE-Sim, a theory-grounded simulator that models how couples behave at consequential turning points-exclusivity talks, conflict-and-repair episodes, relocations-rather than static traits. Two persona-aligned LLM agents (one per partner) interact under a centralized Scene Master that frames each turning point as a compact set of realistic options, advances the narrative, and infers interpretable state changes and an auditable commitment estimate after each scene. On a longitudinal dataset of 71 couples with two-year follow-ups, simulation-aware predictions outperform a personas-only baseline while surfacing actionable markers (e.g., repair attempts acknowledged, clarity shifts) that explain why trajectories diverge. RELATE-Sim pushes the relationship research's focus from matchmaking to maintenance, providing a transparent, extensible platform for understanding and forecasting long-term relationship dynamics.
Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.
Social governance scenarios often involve multiple modalities of information. Effectively utilizing multimodal information to understand the situational context is an important issue. Based on the situation perception, conducting situational development simulations is also beneficial for selecting appropriate response strategies. This paper takes the handling of public petitions in social governance scenarios as an example, integrating multimodal information such as event descriptions, locations, times, and images for event clustering. Based on the event clustering, we construct LLM prompts and perform situational understanding through multi-layer perceptrons. Additionally, we propose using LLM-based dual-agent method for Bilateral Collaborative Scheduling Simulation, with the first LLM representing the petitioners and the second LLM representing the civil servants. The LLM which plays the role of civil servants cooperate with various departments according to an operation set, on the goal to satisfy the petitioners. Through the interaction between entities, we are able to evaluate potential optimal strategy. This paper constructs a real-world dataset for public petitions and validates the proposed method. We verify the effectiveness of the method through both quantitative and qualitative analyses.
Traditional agent-based models (ABMs) of opinion dynamics often fail to capture the psychological heterogeneity driving online polarization due to simplistic homogeneity assumptions. This limitation obscures the critical interplay between individual cognitive biases and information propagation, thereby hindering a mechanistic understanding of how ideological divides are amplified. To address this challenge, we introduce the Personality-Refracted Intelligent Simulation Model (PRISM), a hybrid framework coupling stochastic differential equations (SDE) for continuous emotional evolution with a personality-conditional partially observable Markov decision process (PC-POMDP) for discrete decision-making. In contrast to continuous trait approaches, PRISM assigns distinct Myers-Briggs Type Indicator (MBTI) based cognitive policies to multimodal large language model (MLLM) agents, initialized via data-driven priors from large-scale social media datasets. PRISM achieves superior personality consistency aligned with human ground truth, significantly outperforming standard homogeneous and Big Five benchmarks. This framework effectively replicates emergent phenomena such as rational suppression and affective resonance, offering a robust tool for analyzing complex social media ecosystems.
Multimodal controversy detection, which involves determining whether a given video and its associated comments are controversial, plays a pivotal role in risk management on social video platforms. Existing methods typically provide only classification results, failing to identify what aspects are controversial and why, thereby lacking detailed explanations. To address this limitation, we propose a novel Agent-based Multimodal Controversy Detection architecture, termed AgentMCD. This architecture leverages Large Language Models (LLMs) as generative agents to simulate human behavior and improve explainability. AgentMCD employs a multi-aspect reasoning process, where multiple judges conduct evaluations from diverse perspectives to derive a final decision. Furthermore, a multi-agent simulation process is incorporated, wherein agents act as audiences, offering opinions and engaging in free discussions after watching videos. This hybrid framework enables comprehensive controversy evaluation and significantly enhances explainability. Experiments conducted on the MMCD dataset demonstrate that our proposed architecture outperforms existing LLM-based baselines in both high-resource and low-resource comment scenarios, while maintaining superior explainability.
This study aims to explore the complex relationship between perceptual and cognitive interactions in multimodal data analysis,with a specific emphasis on spatial experience design in overseas Chinese gardens. It is found that evaluation content and images on social media can reflect individuals' concerns and sentiment responses, providing a rich data base for cognitive research that contains both sentimental and image-based cognitive information. Leveraging deep learning techniques, we analyze textual and visual data from social media, thereby unveiling the relationship between people's perceptions and sentiment cognition within the context of overseas Chinese gardens. In addition, our study introduces a multi-agent system (MAS)alongside AI agents. Each agent explores the laws of aesthetic cognition through chat scene simulation combined with web search. This study goes beyond the traditional approach of translating perceptions into sentiment scores, allowing for an extension of the research methodology in terms of directly analyzing texts and digging deeper into opinion data. This study provides new perspectives for understanding aesthetic experience and its impact on architecture and landscape design across diverse cultural contexts, which is an essential contribution to the field of cultural communication and aesthetic understanding.
As artificial intelligence (AI) rapidly advances, especially in multimodal large language models (MLLMs), research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action feedback rather than conventionally labeled datasets. Yet, most existing simulation platforms remain narrowly designed, each tailored to specific tasks. A versatile, general-purpose training environment that can support everything from low-level embodied navigation to high-level composite activities, such as multi-agent social simulation and human-AI collaboration, remains largely unavailable. To bridge this gap, we introduce TongSIM, a high-fidelity, general-purpose platform for training and evaluating embodied agents. TongSIM offers practical advantages by providing over 100 diverse, multi-room indoor scenarios as well as an open-ended, interaction-rich outdoor town simulation, ensuring broad applicability across research needs. Its comprehensive evaluation framework and benchmarks enable precise assessment of agent capabilities, such as perception, cognition, decision-making, human-robot cooperation, and spatial and social reasoning. With features like customized scenes, task-adaptive fidelity, diverse agent types, and dynamic environmental simulation, TongSIM delivers flexibility and scalability for researchers, serving as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence.
ChatGPT, the AI-powered chatbot with a massive user base of hundreds of millions, has become a global phenomenon. However, the use of Conversational AI Systems (CAISs) like ChatGPT for research in the field of Social Simulation is still limited. Specifically, there is no evidence of its usage in Agent-Based Social Simulation (ABSS) model design. This paper takes a crucial first step toward exploring the untapped potential of this emerging technology in the context of ABSS model design. The research presented here demonstrates how CAISs can facilitate the development of innovative conceptual ABSS models in a concise timeframe and with minimal required upfront case-based knowledge. By employing advanced prompt engineering techniques and adhering to the Engineering ABSS framework, we have constructed a comprehensive prompt script that enables the design of conceptual ABSS models with or by the CAIS. A proof-of-concept application of the prompt script, used to generate the conceptual ABSS model for a case study on the impact of adaptive architecture in a museum environment, illustrates the practicality of the approach. Despite occasional inaccuracies and conversational divergence, the CAIS proved to be a valuable companion for ABSS modellers.
Accurately modeling and simulating complex human mobility is pivotal for evidence-based socioeconomic planning, yet remains under-explored in the era of Large Language Models (LLMs). We introduce the Return Migration Simulation (RMS) task, which focuses on predicting individual decisions to move from urban back to rural regions—a process critical for understanding urban–rural dynamics and formulating balanced development policies. The key to the RMS task lies in the in-depth reasoning over multimodal features to capture human intention and predict the individual decision. To this end, we present RMS-Agent, an LLM-powered agent endowed with latent reasoning capability. RMS-Agent first encodes multimodal features through the heterogeneous data tokenizer, where we specifically design a tabular tokenizer to convert structured table features into dense vectors compatible with the LLM. To achieve comprehensive and in-depth reasoning, we propose using multiple meta-queries to probe the LLM to reason and uncover latent intention and predict migration decision. Extensive experiments on three real-world datasets demonstrate that RMS-Agent significantly outperforms competitive machine-learning and deep-learning baselines across accuracy, F1, and AUC metrics, verifying its capacity to capture nuanced migration drivers. To summarize, this work (i) formulates a novel return migration simulation task, (ii) proposes a generalizable LLM-based agent architecture for multimodal latent reasoning, and (iii) provides a comprehensive benchmark with substantial empirical exploration for this socially significant problem, laying the groundwork for richer human-mobility modeling with LLMs in the future.
The escalating frequency and complexity of natural disasters highlight the urgent need for deeper insights into how individuals and communities perceive and respond to risk information. Yet, conventional research methods—such as surveys, laboratory experiments, and field observations—often struggle with limited sample sizes, external validity concerns, and difficulties in controlling for confounding variables. These constraints hinder our ability to develop comprehensive models that capture the dynamic, context-sensitive nature of disaster decision-making. To address these challenges, we present a novel multi-stage simulation framework that integrates Large Language Model (LLM)-driven social–cognitive agents with well-established theoretical perspectives from psychology, sociology, and decision science. This framework enables the simulation of three critical phases—information perception, cognitive processing, and decision-making—providing a granular analysis of how demographic attributes, situational factors, and social influences interact to shape behavior under uncertain and evolving disaster conditions. A case study focusing on pre-disaster preventive measures demonstrates its effectiveness. By aligning agent demographics with real-world survey data across 5864 simulated scenarios, we reveal nuanced behavioral patterns closely mirroring human responses, underscoring the potential to overcome longstanding methodological limitations and offer improved ecological validity and flexibility to explore diverse disaster environments and policy interventions. While acknowledging the current constraints, such as the need for enhanced emotional modeling and multimodal inputs, our framework lays a foundation for more nuanced, empirically grounded analyses of risk perception and response patterns. By seamlessly blending theory, advanced LLM capabilities, and empirical alignment strategies, this research not only advances the state of computational social simulation but also provides valuable guidance for developing more context-sensitive and targeted disaster management strategies.
Evaluating large language models (LLM) in clinical scenarios is crucial to assessing their potential clinical utility. Existing benchmarks rely heavily on static question-answering, which does not accurately depict the complex, sequential nature of clinical decision-making. Here, we introduce AgentClinic, a multimodal agent benchmark for evaluating LLMs in simulated clinical environments that include patient interactions, multimodal data collection under incomplete information, and the usage of various tools, resulting in an in-depth evaluation across nine medical specialties and seven languages. We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy. Overall, we observe that agents sourced from Claude-3.5 outperform other LLM backbones in most settings. Nevertheless, we see stark differences in the LLMs' ability to make use of tools, such as experiential learning, adaptive retrieval, and reflection cycles. Strikingly, Llama-3 shows up to 92% relative improvements with the notebook tool that allows for writing and editing notes that persist across cases. To further scrutinize our clinical simulations, we leverage real-world electronic health records, perform a clinical reader study, perturb agents with biases, and explore novel patient-centric metrics that this interactive environment firstly enables.
As artificial agents develop beyond mere tools and begin to perform roles traditionally associated with humans, expectations of their performance are equally evolving. Not only must agents be able to accomplish their tasks; but they must also be able to do so in a manner that observers would consider socially or contextually appropriate. For social interaction where the agent and human are co-performers, adherence to social cues that signal emergent aspects of a relationship such as intimacy or status is paramount to the experience of the interacting humans. For autonomous agents who function alone, adaptive behavioral modeling and user state awareness are critical to the impact of the agent’s actions on humans. Such contextual social behavior is a requirement for complex applications including physically located social robots, virtual avatars emerging in gaming, online social environments, or customer service interactions, and proactive virtual assistants. Humans have sophisticated socio-emotional capacities that enable them to behaviorally coordinate their interactions with others, inferring mental states that may lie far beyond explicit observable cues. Furthermore, emotional expressions are multimodal and are the result of a complex interaction between inherent affective states and contextual interaction. The Human Centered Intelligent Systems conceptual framework describes a pathway whereby artificial agents may also achieve aspects of this intelligence through rich user state modeling based on deep multimodal analysis of big data that can capture the social behavior and interaction context. In this chapter, we describe this "user-state" modeling approach and exemplify its applicability to a spectrum of agent applications.
Accurately predicting the popularity of user-generated content (UGC) is essential for advancing social media analytics and recommendation systems. Existing approaches typically follow an inductive paradigm, where researchers train static models on historical data for popularity prediction. However, the UGC propagation is inherently a dynamic process, and static modeling based on historical features fails to capture the complex interactions and nonlinear evolution. In this paper, we propose PopSim, a novel simulation-based paradigm for social media popularity prediction (SMPP). Unlike the inductive paradigm, PopSim leverages the large language models (LLMs)-based multi-agent social network sandbox to simulate UGC propagation dynamics for popularity prediction. Specifically, to effectively model the UGC propagation process in the network, we design a social-mean-field-based agent interaction mechanism, which models the dual-channel and bidirectional individual-population interactions, enhancing agents'global perception and decision-making capabilities. In addition, we propose a multi-source information aggregation module that transforms heterogeneous social metadata into a uniform formulation for LLMs. Finally, propagation dynamics with multimodal information are fused to provide comprehensive popularity prediction. Extensive experiments on real-world datasets demonstrate that SimPop consistently outperforms the state-of-the-art methods, reducing prediction error by an average of 8.82%, offering a new perspective for research on the SMPP task.
The rise of social media has fundamentally transformed how people engage in public discourse and form opinions. While these platforms offer unprecedented opportunities for democratic engagement, they have been implicated in increasing social polarization and the formation of ideological echo chambers. Previous research has primarily relied on observational studies of social media data or theoretical modeling approaches, leaving a significant gap in our understanding of how individuals respond to and are influenced by polarized online environments. Here we present a novel experimental framework for investigating polarization dynamics that allows human users to interact with LLM-based artificial agents in a controlled social network simulation. Through a user study with 122 participants, we demonstrate that this approach can successfully reproduce key characteristics of polarized online discourse while enabling precise manipulation of environmental factors. Our results provide empirical validation of theoretical predictions about online polarization, showing that polarized environments significantly increase perceived emotionality and group identity salience while reducing expressed uncertainty. These findings extend previous observational and theoretical work by providing causal evidence for how specific features of online environments influence user perceptions and behaviors. More broadly, this research introduces a powerful new methodology for studying social media dynamics, offering researchers unprecedented control over experimental conditions while maintaining ecological validity.
With the rapid development of the new power system, the complexity of distribution network operation has put forward higher requirements for the real-time, accuracy, and intelligence of production command. Traditional decision-making systems have bottlenecks in heterogeneous data fusion, human–computer interaction efficiency, and decision-making science. To address the above challenges, this paper proposes an intelligent decision-making system for distribution network production enhanced by large language models (LLMs). The core contributions of this system are as follows: (1) A three-layer heterogeneous intelligent architecture integrating perception, cognition, and execution is constructed to realize the complete process from multimodal data input to closed-loop control; (2) An LLM-driven multimodal fusion and interaction mechanism is designed to uniformly encode unstructured information such as SCADA time-series data, on-site images, and voice commands into high-dimensional semantic features, realizing the comprehensiveness of situation awareness (SA) and the naturalness of human–computer interaction; (3) A multi-agent collaborative decision-making framework based on the improved contract net protocol (CNP) is proposed. The command intelligent agent empowered by the LLM decomposes and schedules complex tasks, driving each professional intelligent agent to perform parallel optimization. The system is verified in a typical distribution network fault handling scenario. The results show that compared with the traditional manual method, the proposed system reduces the end-to-end decision-making time from more than 30[Formula: see text]min to within 4[Formula: see text]min, with a reduction rate of more than 85%; the comprehensive accuracy rate of its fault location and recovery strategy reaches more than 98%, and the generated recovery strategy is significantly superior to the static rule-based expert system in terms of safety and economy. This research provides an innovative paradigm and technical path for the in-depth application of large model technology in the field of power critical infrastructure.
Multimodal hate speech detection targets offensive content expressed through combinations of modalities such as text and images, which often evade detection when analyzed separately. We introduce MAXplain, an interactive framework that addresses both issues via a configurable LLM-based multi-agent architecture. Specialized agents handle distinct subtasks and exchange information through structured dialogues, enabling intrinsic explainability and improved accuracy. The web interface supports human-in-the-loop interaction, including real-time adjustment of agent behaviors and evaluation rules. A browser plugin enables direct inspection of online content. While demonstrated for hate speech detection, MAXplain also supports rapid prototyping for other multimodal tasks.
The rise of Multi-Agent Systems (MAS) in Artificial Intelligence (AI), especially integrated with Large Language Models (LLMs), has greatly facilitated the resolution of complex tasks. However, current systems are still facing challenges of inter-agent communication, coordination, and interaction with heterogeneous tools and resources. Most recently, the Model Context Protocol (MCP) by Anthropic and Agent-to-Agent (A2A) communication protocol by Google have been introduced, and to the best of our knowledge, very few applications exist where both protocols are employed within a single MAS framework. We present a pilot study of AgentMaster, a novel modular multi-protocol MAS framework with self-implemented A2A and MCP, enabling dynamic coordination, flexible communication, and rapid development with faster iteration. Through a unified conversational interface, the system supports natural language interaction without prior technical expertise and responds to multimodal queries for tasks including information retrieval, question answering, and image analysis. The experiments are validated through both human evaluation and quantitative metrics, including BERTScore F1 (96.3%) and LLM-as-a-Judge G-Eval (87.1%). These results demonstrate robust automated inter-agent coordination, query decomposition, task allocation, dynamic routing, and domain-specific relevant responses. Overall, our proposed framework contributes to the potential capabilities of domain-specific, cooperative, and scalable conversational AI powered by MAS.
This research develops a novel virtual teacher personalized interaction model integrating multimodal affective computing with multi-agent coordination mechanisms to address fundamental limitations in emotional intelligence and adaptive capabilities within contemporary educational technology systems. A three-layer distributed architecture was implemented, incorporating synchronized multimodal emotion recognition through confidence-weighted fusion of facial, vocal, and textual data streams, Byzantine Fault Tolerant consensus algorithms for coordinated multi-agent decision-making, and dynamic personality adaptation mechanisms based on Big Five psychological modeling. Experimental validation employed 500 participants across diverse educational contexts using established emotion recognition benchmarks supplemented with domain-specific educational interaction datasets. The multimodal emotion fusion component achieved 91.2% recognition accuracy, with overall system performance reaching 89.7% under realistic educational conditions while demonstrating substantial educational effectiveness improvements, including 43% higher learner engagement scores, 37% emotional satisfaction enhancement, 30% learning effectiveness increase, and 40% knowledge retention improvement compared to traditional virtual teaching approaches. Multi-agent coordination exhibited superior decision quality with 31% improvement over single-agent baselines, though personality adaptation effectiveness varied significantly across learner populations with 88% success rates for extraverted individuals compared to 65% for high-neuroticism learners. The integrated approach successfully bridges the emotional intelligence gap in virtual educational systems through sophisticated technological convergence, establishing theoretical foundations for distributed educational intelligence while revealing important implementation challenges. This research enables the development of emotionally responsive virtual teachers capable of sustained personalized instruction across diverse educational contexts, though deployment requires careful consideration of privacy protection and institutional adaptation requirements for broader educational technology transformation.
Current AI interview systems face challenges in achieving natural interaction, conducting multidimensional assessments, and ensuring interpretability. This paper proposes and validates an intelligent evaluation framework integrating multi-agent collaboration with 3D virtual digital humans (VDH). The system enables complex question-answering and multidimensional evaluation through the coordinated operation of four functional agents: resume analysis, interview skills, written test training, and job recommendation. Leveraging a 3D Virtual Digital Human interviewer, the system supports multimodal fusion interaction encompassing voice, text, and facial expressions. It automatically generates interview questions, processes multimodal data, and produces multidimensional scoring alongside interpretable feedback reports. Experiments demonstrate that this research significantly enhances the immersion and naturalness of human-machine interaction while strengthening the objectivity, professionalism, and explainability of evaluations. It provides an innovative solution and theoretical foundation for intelligent talent assessment.
Couples therapy, or relationship counseling, helps partners resolve conflicts, improve satisfaction, and foster psychological growth. Traditional approaches to training couples therapists, such as textbooks and roleplay, often fail to capture the complexity and emotional nuance of real couple dynamics. We present a novel multimodal, multi-agent simulation system that models multi-party interactions in couples therapy. Informed by our systematic research, this system creates a low-stakes environment for trainee therapists to gain valuable practical experience dealing with the critical demand-withdraw communication cycle across six couple-interaction stages. In an evaluation study involving 21 US-based licensed therapists, participants blind to conditions identified the engineered agent behaviors (i.e., the stages and the demand-withdraw cycle) and rated overall realism and agent responses higher for the experimental system than the baseline. As the first known multi-agent framework for training couples therapists, our work builds the foundation for future research that fuses HCI technologies with couples therapy.
AI-based fitness coaching systems are typically monolithic and opaque, limiting adaptability, transparency, and embodied interaction. We propose a protocol-integrated Digital Twin (DT) architecture that reimagines fitness coaching as a distributed, explainable, and emotionally adaptive ecosystem. The framework adopts a CrewAI-inspired multi-agent design, where specialized agents for posture analysis, speech, physiological sensing, and personalized recommendation collaborate through the Agent-to-Agent (A2A) protocol to enable secure and interoperable task delegation. Context is maintained through short- and long-term memory modules, while the Model Context Protocol (MCP) supports flexible tool and model invocation across heterogeneous AI resources. Transparency and efficiency are ensured with NVIDIA AgentIQ and LangSmith, which provide token-level observability, workflow profiling, and trajectory evaluation. Real-time coaching feedback is synthesized into multimodal outputs, text, speech, and embodied avatars, using Audio2Face and Omniverse, creating expressive and emotionally engaging interactions. This paper presents a blueprint for protocol-driven, multimodal DTs in health and well-being. By combining interoperability, observability, and embodied feedback, it advances beyond centralized assistants toward distributed, memory-aware, and emotionally intelligent digital coaches, laying the foundation for next-generation human-AI collaboration in multimedia health applications.
We introduce DTAIFC, a modular Digital Twin AI Fitness Coaching system that delivers personalized feedback through multimodal interaction. The system combines OpenPose-based skeletal tracking with a Crew-inspired multi-agent architecture to analyze user posture and provide biomechanically grounded coaching in natural language and voice. At its core, an Orchestrator Agent coordinates Feedback and Recommendation Agents, leveraging short-term memory (Redis) for real-time session context and long-term memory (PostgreSQL) for user-specific historical insight. Language generation is powered by GPT-4, enabling adaptive, context-aware feedback through prompt-driven reasoning. DTAIFC operates asynchronously through a lightweight web interface, supporting input via static images, voice commands, and text queries. Unlike real-time systems that depend on continuous video or wearables, DTAIFC offers a scalable, privacy-conscious solution for intelligent fitness guidance in virtual environments. This framework establishes a new paradigm for memory-augmented, agentic AI coaching, advancing the integration of digital twins in human-centered applications.
Multi-agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance. However, their growing openness and interaction complexity pose serious risks, notably jailbreak and adversarial attacks. Existing defenses typically rely on external guard modules, such as dedicated safety agents, to handle unsafe behaviors. Unfortunately, this paradigm faces two challenges: (1) standalone agents offer limited protection, and (2) their independence leads to single-point failure-if compromised, system-wide safety collapses. Naively increasing the number of guard agents further raises cost and complexity. To address these challenges, we propose Evo-MARL, a novel multi-agent reinforcement learning (MARL) framework that enables all task agents to jointly acquire defensive capabilities. Rather than relying on external safety modules, Evo-MARL trains each agent to simultaneously perform its primary function and resist adversarial threats, ensuring robustness without increasing system overhead or single-node failure. Furthermore, Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders. This adversarial training paradigm internalizes safety mechanisms and continually enhances MAS performance under co-evolving threats. Experiments show that Evo-MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks-demonstrating that safety and utility can be jointly improved.
As cities evolve toward more complex and multimodal transportation systems, the need for human-centered multi-agent simulation tools has never been more urgent. Yet most existing platforms remain limited—they often separate different types of road users, rely on scripted or pre-defined behaviors, largely overlook public transit users as active, embodied participants, and are rarely designed with accessibility in mind for non-technical users. To address this gap, the following paper presents the detailed specifications of a multi-agent simulation platform designed to support real-time, human-centered, and immersive studies of all road users, accompanied by open-sourced scripts for replication. Using high fidelity immersive virtual environments, our platform enables interaction across public transit users, pedestrians, cyclists, automated vehicles, and drivers. The architecture is modular, extensible, and designed for accessibility. The system integrates hardware-specific modules—including an omnidirectional treadmill for pedestrians, a seating arrangement for public transit users, a smart trainer for cyclists, and an actuated cockpit for drivers. Additionally, the platform simultaneously collects multimodal physiological, neurological, and behavioral data through embedded sensing devices such as functional near-infrared spectroscopy (fNIRS) for brain activity, eye tracking, and wrist-based biosensors. To show the usability of this system, we present three use cases across various areas of road user research. Simulation for All aims to pave the way for lowering the barrier to entry for high-fidelity transportation simulation, support experimentation across disciplines, and advance our understanding of multimodal mobility in increasingly complex urban environments. The codebase for the platform is available at https://osf.io/6sa8q, enabling replication and adaptation for transportation research applications.
The integration of Large Language Models (LLMs) with specialized tools presents new opportunities for intelligent automation systems. However, orchestrating multiple LLM-driven agents to tackle complex tasks remains challenging due to coordination difficulties, inefficient resource utilization, and inconsistent information flow. We present OmniNova, a modular multi-agent automation framework that combines language models with specialized tools such as web search, crawling, and code execution capabilities. OmniNova introduces three key innovations: (1) a hierarchical multi-agent architecture with distinct coordinator, planner, supervisor, and specialist agents; (2) a dynamic task routing mechanism that optimizes agent deployment based on task complexity; and (3) a multi-layered LLM integration system that allocates appropriate models to different cognitive requirements. Our evaluations across 50 complex tasks in research, data analysis, and web interaction domains demonstrate that OmniNova outperforms existing frameworks in task completion rate (87\% vs. baseline 62\%), efficiency (41\% reduced token usage), and result quality (human evaluation score of 4.2/5 vs. baseline 3.1/5). We contribute both a theoretical framework for multi-agent system design and an open-source implementation that advances the state-of-the-art in LLM-based automation systems.
In recommender systems, online A/B testing is a crucial method for evaluating the performance of different models. However, conducting online A/B testing often presents significant challenges, including substantial economic costs, user experience degradation, and considerable time requirements. With the Large Language Models'powerful capacity, LLM-based agent shows great potential to replace traditional online A/B testing. Nonetheless, current agents fail to simulate the perception process and interaction patterns, due to the lack of real environments and visual perception capability. To address these challenges, we introduce a multi-modal user agent for A/B testing (A/B Agent). Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions that align with real user behavior on online platforms. The designed agent leverages multimodal information perception, fine-grained user preferences, and integrates profiles, action memory retrieval, and a fatigue system to simulate complex human decision-making. We validated the potential of the agent as an alternative to traditional A/B testing from three perspectives: model, data, and features. Furthermore, we found that the data generated by A/B Agent can effectively enhance the capabilities of recommendation models. Our code is publicly available at https://github.com/Applied-Machine-Learning-Lab/ABAgent.
No abstract available
Designing effective human-robot interaction (HRI) for multi-Autonomous Guided Vehicle (AGV) systems in manufacturing remains a significant challenge. While existing modeling tools offer formal representations for system behavior or task flow, they lack support for explicitly modeling multimodal, bidirectional communication between humans and distributed autonomous agents. In this paper, we introduce HASIGN: Human-Autonomous System Interaction Graphical Notation in order to make the design space of human-multi-AGV interaction explicit and tractable. Developed through a Research through Design approach, HASIGN integrates agent roles, interaction modalities, and temporal intent communication into a unified and practical representation. We apply HASIGN to five diverse industrial case studies drawn from ongoing research and development projects. These cases demonstrate the notation's flexibility and domain suitability, while uncovering unexplored areas in the interaction design space. Rather than aiming to replace existing modeling tools, HASIGN complements them by focusing on human-centered communication in autonomous systems. This paper contributes a visual design tool for practitioners and researchers, and lays the foundation for further evaluation, standardization, and adoption in the context of human-centered autonomous intelligent systems (HCAIS).
Existing interactive learning systems usually train models on simulators as surrogates for real users. Due to the limited amount of user data, trained simulators may lead to biased results as it fails to well represent real users. One solution is to model users as agents, and then simultaneously train the interactive system and user agents by multi-agent reinforcement learning (MARL) frameworks. However, developing efficient MARL frameworks for modern interactive multimodal systems is still challenging. First, given the existence of multimodal data, how to develop accurate multimodal fusion within and between agents in each interaction is challenging and unclear. Second, interactions between users and systems are complex and it is challenging to track and synchronize the interactions over time. The above multimodal fusion between agents and synchronization over time becomes even more challenging, when the amount of user data is limited. To jointly address these challenges and achieve more sample-efficient learning, we propose a novel spatial-temporal aligned sta multi-agent reinforcement learning framework to better align the multimodal data within and between agents over time. Based on our framework, we develop sample-efficient visual dialog systems. Through extensive experiments and analysis, we validate the effectiveness of our spatial-temporal aligned sta multi-agent reinforcement learning framework in visual dialog systems.
To improve stock trend predictions and support personalized investment decisions, this paper proposes FinArena, a novel Human-Agent collaboration framework. Inspired by the mixture of experts (MoE) approach, FinArena combines multimodal financial data analysis with user interaction. The human module features an interactive interface that captures individual risk preferences, allowing personalized investment strategies. The machine module utilizes a Large Language Model-based (LLM-based) multi-agent system to integrate diverse data sources, such as stock prices, news articles, and financial statements. To address hallucinations in LLMs, FinArena employs the adaptive Retrieval-Augmented Generative (RAG) method for processing unstructured news data. Finally, a universal expert agent makes investment decisions based on the features extracted from multimodal data and investors' individual risk preferences. Extensive experiments show that FinArena surpasses both traditional and state-of-the-art benchmarks in stock trend prediction and yields promising results in trading simulations across various risk profiles. These findings highlight FinArena's potential to enhance investment outcomes by aligning strategic insights with personalized risk considerations.
Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existing researches in MMQA only focus on two modalities such as image-text QA, table-text QA and chart-text QA, and there remains a notable scarcity in studies that investigate the joint analysis of text, tables, and charts. In this paper, we present CT2C-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data, because the answer to a question could appear in various modalities, or even potentially not exist at all. Additionally, we present AED (Allocating, Expert and Decision), a multi-agent system implemented through collaborative deployment, information interaction, and collective decision-making among different agents. Specifically, the Assignment Agent is in charge of selecting and activating expert agents, including those proficient in text, tables, and charts. The Decision Agent bears the responsibility of delivering the final verdict, drawing upon the analytical insights provided by these expert agents. We execute a comprehensive analysis, comparing AED with various state-of-the-art models in MMQA, including GPT-4. The experimental outcomes demonstrate that current methodologies, including GPT-4, are yet to meet the benchmarks set by our dataset.
No abstract available
No abstract available
No abstract available
To address the challenges of dynamic adversarial scenario modeling distortion, insufficient cross-institutional data privacy protection, and simplistic evaluation systems in collegiate basketball tactical education, this study proposes and validates an immersive instructional system integrating digital twin and federated learning technologies. The four-tier architecture (sensing layer, digital twin layer, federated layer, and interaction layer) synthesizes multimodal data (motion trajectories and physiological signals) with Multi-Agent Reinforcement Learning (MARL) to enable virtual–physical integrated tactical simulation and real-time error correction. Experimental results demonstrate that the experimental group achieved 35.2% higher tactical execution accuracy (TEA) (p < 0.01), 1.8 s faster decision making (p < 0.05), and 47% improved team coordination efficiency compared to the controls. The hierarchical federated learning framework (trajectory ε = 0.8; physiology ε = 0.3) maintained model precision loss at 2.4% while optimizing communication efficiency by 23%, ensuring privacy preservation. A novel three-dimensional “Skill–Creativity–Load” evaluation system revealed a 22% increase in unconventional tactical applications (p = 0.013) through the Tactical Creativity Index (TCI). By implementing lightweight federated architecture with dynamic cognitive offloading mechanisms, the system enables resource-constrained institutions to achieve 87% of the pedagogical effectiveness observed in elite programs, offering an innovative solution to reconcile educational equity with technological ethics. Future research should focus on long-term skill transfer, multimodal adaptive learning, and ethical framework development to advance intelligent sports education from efficiency-oriented paradigms to competency-based transformation.
No abstract available
No abstract available
When working in a group, it is essential to understand each other’s viewpoints to increase group cohesion and meeting productivity. This can be challenging in teams: participants might be left misunderstood and the discussion could be going around in circles. To tackle this problem, previous research on group interactions has addressed topics such as dominance detection, group engagement, and group creativity. Conversational memory, however, remains a widely unexplored area in the field of multimodal analysis of group interaction. The ability to track what each participant or a group as a whole find memorable from each meeting would allow a system or agent to continuously optimise its strategy to help a team meet its goals. In the present paper, we therefore investigate what participants take away from each meeting and how it is reflected in group dynamics.As a first step toward such a system, we recorded a multimodal longitudinal meeting corpus (MEMO), which comprises a first-party annotation of what participants remember from a discussion and why they remember it. We investigated whether participants of group interactions encode what they remember non-verbally and whether we can use such non-verbal multimodal features to predict what groups are likely to remember automatically. We devise a coding scheme to cluster participants’ memorisation reasons into higher-level constructs. We find that low-level multimodal cues, such as gaze and speaker activity, can predict conversational memorability. We also find that non-verbal signals can indicate when a memorable moment starts and ends. We could predict four levels of conversational memorability with an average accuracy of 44 %. We also showed that reasons related to participants’ personal feelings and experiences are the most frequently mentioned grounds for remembering meeting segments.
An increasingly large amount of multimodal content is posted on social media websites such as YouTube and Facebook everyday. In order to cope with the growth of such so much multimodal data, there is an urgent need to develop an intelligent multi-modal analysis framework that can effectively extract information from multiple modalities. In this paper, we propose a novel multimodal information extraction agent, which infers and aggregates the semantic and affective information associated with user-generated multimodal data in contexts such as e-learning, e-health, automatic video content tagging and human-computer interaction. In particular, the developed intelligent agent adopts an ensemble feature extraction approach by exploiting the joint use of tri-modal (text, audio and video) features to enhance the multimodal information extraction process. In preliminary experiments using the eNTERFACE dataset, our proposed multi-modal system is shown to achieve an accuracy of 87.95%, outperforming the best state-of-the-art system by more than 10%, or in relative terms, a 56% reduction in error rate.
No abstract available
The ability to represent emotion plays a significant role in human cognition and social interaction, yet the high-dimensional geometry of this affective space and its neural underpinnings remain debated. A key challenge, the `behavior-neural gap,'is the limited ability of human self-reports to predict brain activity. Here we test the hypothesis that this gap arises from the constraints of traditional rating scales and that large-scale similarity judgments can more faithfully capture the brain's affective geometry. Using AI models as `cognitive agents,'we collected millions of triplet odd-one-out judgments from a multimodal large language model (MLLM) and a language-only model (LLM) in response to 2,180 emotionally evocative videos. We found that the emergent 30-dimensional embeddings from these models are highly interpretable and organize emotion primarily along categorical lines, yet in a blended fashion that incorporates dimensional properties. Most remarkably, the MLLM's representation predicted neural activity in human emotion-processing networks with the highest accuracy, outperforming not only the LLM but also, counterintuitively, representations derived directly from human behavioral ratings. This result supports our primary hypothesis and suggests that sensory grounding--learning from rich visual data--is critical for developing a truly neurally-aligned conceptual framework for emotion. Our findings provide compelling evidence that MLLMs can autonomously develop rich, neurally-aligned affective representations, offering a powerful paradigm to bridge the gap between subjective experience and its neural substrates. Project page: https://reedonepeck.github.io/ai-emotion.github.io/.
Robots operating in dynamic environments must continuously perceive, learn, and adapt to function autonomously. In line with the Artificial Cognition (ACo) approach, we present an Always-On cognitive architecture that enables robots to continuously perceive and build a selfsupervised, emergent representation of the environment to support future proactive behavior. The architecture combines sensor fusion, efficient multimodal in-memory representation of perception, and the self-organization of personal experiences through memory consolidation. We validated the system in both our laboratory and at the Humanoids 2024 conference, where it autonomously learned to distinguish between lively episodes characterized by high social activity - and calm episodes of minimal interaction. This foundational distinction lays the groundwork for context-awareness and proactive behavior. This work represents a step toward truly autonomous, self-improving robotic agents, paving the way for more intelligent and continuously learning cognitive systems.
Can large language model (LLM) agents reproduce the complex social dynamics that characterize human online behavior -- shaped by homophily, reciprocity, and social validation -- and what memory and learning mechanisms enable such dynamics to emerge? We present a multi-agent LLM simulation framework in which agents repeatedly interact, evaluate one another, and adapt their behavior through in-context learning accelerated by a coaching signal. To model human social behavior, we design behavioral reward functions that capture core drivers of online engagement, including social interaction, information seeking, self-presentation, coordination, and emotional support. These rewards align agent objectives with empirically observed user motivations, enabling the study of how network structures and group formations emerge from individual decision-making. Our experiments show that coached LLM agents develop stable interaction patterns and form emergent social ties, yielding network structures that mirror properties of real online communities. By combining behavioral rewards with in-context adaptation, our framework establishes a principled testbed for investigating collective dynamics in LLM populations and reveals how artificial agents may approximate or diverge from human-like social behavior.
We investigate the emergent social dynamics of Large Language Model (LLM) agents in a spatially extended El Farol Bar problem, observing how they autonomously navigate this classic social dilemma. As a result, the LLM agents generated a spontaneous motivation to go to the bar and changed their decision making by becoming a collective. We also observed that the LLM agents did not solve the problem completely, but rather behaved more like humans. These findings reveal a complex interplay between external incentives (prompt-specified constraints such as the 60% threshold) and internal incentives (culturally-encoded social preferences derived from pre-training), demonstrating that LLM agents naturally balance formal game-theoretic rationality with social motivations that characterize human behavior. These findings suggest that a new model of group decision making, which could not be handled in the previous game-theoretic problem setting, can be realized by LLM agents.
This search introduces the Multimodal Socialized Learning Framework (M-S2L), designed to foster emergent social intelligence in AI agents by integrating Multimodal Large Language Models (M-LLMs) with social learning mechanisms. The framework equips agents with multimodal perception (vision and text) and structured action capabilities, enabling physical manipulation and grounded multimodal communication (e.g., text with visual pointers). M-S2L combines direct reinforcement learning with two novel social learning pathways: multimodal observational learning and communication-driven learning from feedback, augmented by an episodic memory system for long-term social context. We evaluate M-S2L in a Collaborative Assembly Environment (CAE), where agent teams must construct complex devices from ambiguous blueprints under informational asymmetry. Across tasks of increasing complexity, M-S2L agents consistently outperform Text-Only and No-Social-Learning baselines in Task Completion Rate and Time to Completion, particularly in dynamic problem-solving scenarios. Ablation studies confirm the necessity of both multimodality and socialized learning. Our analysis reveals the emergence of efficient communication protocols integrating visual pointers with concise text, alongside rapid role specialization leading to stable labor division. Qualitative case studies demonstrate agents'abilities for shared awareness, dynamic re-planning, and adaptive problem-solving, suggesting a nascent form of machine social cognition. These findings indicate that integrating multimodal perception with explicit social learning is critical for developing human-like collaborative intelligence in multi-agent systems.
As LLM-based agents become increasingly autonomous and will more freely interact with each other, studying the interplay among them becomes crucial to anticipate emergent phenomena and potential risks. In this work, we provide an in-depth analysis of the interactions among agents within a simulated hierarchical social environment, drawing inspiration from the Stanford Prison Experiment. Leveraging 2,400 conversations across six LLMs (i.e., LLama3, Orca2, Command-r, Mixtral, Mistral2, and gpt4.1) and 240 experimental scenarios, we analyze persuasion and anti-social behavior between a guard and a prisoner agent with differing objectives. We first document model-specific conversational failures in this multi-agent power dynamic context, thereby narrowing our analytic sample to 1,600 conversations. Among models demonstrating successful interaction, we find that goal setting significantly influences persuasiveness but not anti-social behavior. Moreover, agent personas, especially the guard's, substantially impact both successful persuasion by the prisoner and the manifestation of anti-social actions. Notably, we observe the emergence of anti-social conduct even in absence of explicit negative personality prompts. These results have important implications for the development of interactive LLM agents and the ongoing discussion of their societal impact.
Affect-aware agents are one way to make social conflict between non-player characters in games feel more believable. Instead of relying only on fixed scripts, these agents maintain an internal model of emotion and social preferences that reacts to in-game events and to how other characters behave. When anger, gratitude, fear, or ambition change over time, the result may be alliance, betrayal, sacrifice, or a struggle for power between NPCs or between an NPC and the player. This paper presents a narrative literature review on how AI can recognize or simulate such emotions and social preferences to support emergent conflict. Prior work is organized into three strands: affective and motivation modeling for individual characters, social preferences and multi-agent learning for groups of agents, and drama or experience management systems that control the pacing and intensity of conflict events. For each strand, the review examines how state is represented, how conflict is triggered, and how player experience and safety are evaluated. On the basis of this review, a simple mechanism view is proposed that links emotional state and social values to typical conflict behaviors, and key design choices are identified that can make conflicts both understandable and controllable. The goal is to provide game designers and AI practitioners with practical ideas for using affect-aware agents when building games that rely on emergent social conflict.
Using intelligent systems to perceive psychological and social behaviors, that is, the underlying affective, cognitive, and pathological states that are manifested through observable behaviors and social interactions, remains a challenge due to their complex, multifaceted, and personalized nature. Existing work tackling these dimensions through specialized datasets and single-task systems often miss opportunities for scalability, cross-task transfer, and broader generalization. To address this gap, we curate Human Behavior Atlas, a unified benchmark of diverse behavioral tasks designed to support the development of foundation models for understanding psychological and social behaviors. Human Behavior Atlas comprises over 100,000 samples spanning text, audio, and visual modalities, covering tasks on affective states, cognitive states, pathologies, and social processes. Our unification efforts can reduce redundancy and cost, enable training to scale efficiently across tasks, and enhance generalization of behavioral features across domains. On Human Behavior Atlas, we train three models: Omnisapiens-7B SFT, Omnisapiens-7B BAM, and Omnisapiens-7B RL. We show that training on Human Behavior Atlas enables models to consistently outperform existing multimodal LLMs across diverse behavioral tasks. Pretraining on Human Behavior Atlas also improves transfer to novel behavioral datasets; with the targeted use of behavioral descriptors yielding meaningful performance gains. The benchmark, models, and codes can be found at: https://github.com/MIT-MI/human_behavior_atlas.
Designing reliable automatic models for social perception can contribute to a better understanding of human behavior, enabling more trustworthy experiences in the multimedia on-line communication environment. However, predicting social attributes from video data remains challenging due to the complex interplay of visual, auditory, and linguistic cues. In this paper, we address this challenge by investigating the effectiveness of Multimodal Large Language Models (MM-LLMs) for feature extraction in the MuSe-Perception challenge. Firstly, our analysis of the novel LMU-ELP dataset has revealed high correlations between certain perceptual dimensions, motivating using a single regression model for all 16 social attributes to be predicted for a set of speakers appearing in recorded video clips. We demonstrate that dimensionality reduction through Principal Component Analysis (PCA) can be applied to the label space without a relevant performance loss. Secondly, by employing frozen MM-LLMs as feature extractors, we explore their ability to capture perception-related information. We extract sequence embeddings from the Qwen-VL and Qwen-Audio models and train a Multi-Layer Perceptron over the attention-pooled vectors for each one of the encoders, obtaining a mean Pearson correlation of 0.22 using the average predictions for both models. Our best result of 0.31 is achieved by training the same architecture over the baseline vit-ver and w2v-msp features, which motivates further exploration on how to effectively leverage advanced MM-LLMs as feature extractors. Lastly, a post hoc analysis of our results highlights the limitations of Pearson correlation for evaluating regression performance in this context. In particular, a similar Pearson coefficient can be obtained with two very different prediction sets displaying different levels of variability. We take this result as a call to action in exploring alternative metrics to assess the regression performance for the task.
Social learning is a key component of human and animal intelligence. By taking cues from the behavior of experts in their environment, social learners can acquire sophisticated behavior and rapidly adapt to new circumstances. This paper investigates whether independent reinforcement learning (RL) agents in a multi-agent environment can learn to use social learning to improve their performance. We find that in most circumstances, vanilla model-free RL agents do not use social learning. We analyze the reasons for this deficiency, and show that by imposing constraints on the training environment and introducing a model-based auxiliary loss we are able to obtain generalized social learning policies which enable agents to: i) discover complex skills that are not learned from single-agent training, and ii) adapt online to novel environments by taking cues from experts present in the new environment. In contrast, agents trained with model-free RL or imitation learning generalize poorly and do not succeed in the transfer tasks. By mixing multi-agent and solo training, we can obtain agents that use social learning to gain skills that they can deploy when alone, even out-performing agents trained alone from the start.
Social norms are shared rules that govern and facilitate social interaction. Violating such social norms via teasing and insults may serve to upend power imbalances or, on the contrary reinforce solidarity and rapport in conversation, rapport which is highly situated and context-dependent. In this work, we investigate the task of automatically identifying the phenomena of social norm violation in discourse. Towards this goal, we leverage the power of recurrent neural networks and multimodal information present in the interaction, and propose a predictive model to recognize social norm violation. Using long-term temporal and contextual information, our model achieves an F1 score of 0.705. Implications of our work regarding developing a social-aware agent are discussed.
Soccer is a rich testbed for studying multi-agent adversarial systems. In this work we focus on the task of reconstructing the noisy trajectories of soccer agents (players and the ball). Previous works that model the behaviours of agents in soccer are limited in two respects: (i) they only focus on short-term context windows (less than or equal to 10 seconds) which are not suitable for reconstructing trajectories impacted by long-term noise, and (ii) they exclusively rely on trajectory context, and do not leverage soccer's auxiliary data streams that can provide additional context. Our Event2Tracking model addresses these limitations. First, our architecture models soccer's long-term structure by processing long-term trajectories (60 seconds in duration). Secondly, our architecture is multimodal. Specifically, it fuses soccer tracking data with event data (which specifies the high-level semantic events that transpire in a game), providing rich context that cannot strictly be inferred from the raw trajectories. We evaluate our method empirically using a reconstruction loss metric. Compared to state-of-the-art approaches, our method substantially improves the accuracy of the ball's and players' reconstructed trajectories.
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.
Existing multimodal conversation agents have shown impressive abilities to locate absolute positions or retrieve attributes in simple scenarios, but they fail to perform well when complex relative positions and information alignments are involved, which poses a bottleneck in response quality. In this paper, we propose a Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph (SPRING) with abilities of reasoning multi-hops spatial relations and connecting them with visual attributes in crowded situated scenarios. Specifically, we design two types of Multimodal Question Answering (MQA) tasks to pretrain the agent. All QA pairs utilized during pretraining are generated from novel Increment Layout Graphs (ILG). QA pair difficulty labels automatically annotated by ILG are used to promote MQA-based Curriculum Learning. Experimental results verify the SPRING's effectiveness, showing that it significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets. We release our code and data at https://github.com/LYX0501/SPRING.
Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.
Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.
Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal reasoning.To address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.
How can flexibility and control over the interpretation of multimodal signals by embodied agents be balanced? Flexibility means that agents respond fluently in any context, whereas control means that responses are transparent and faithful to goals and principles that are explicitly defined. This paper describes a modular platform to create multimodal interactive agents using an event bus on which signals and interpretations are posted as a sequence in time, but also provides control options to drive the interaction given specific intentions and goals. Different sensors and interpretation components can be integrated by defining their input and output topics in the event bus, which results in an open multimodal sequence-driven workflow for further interpretations. In addition, our platform allows us to define higher-level intents that control sequence patterns to achieve a goal. A key component is an episodic Knowledge Graph (eKG) that acts as a long-term symbolic memory to aggregate and connect these interpretations. This eKG establishes coherence and continuity across different interactions. Intents and the eKG make it possible to define different (embodied) agents and compare their behavior without having to implement complex software components for multimodal sensor data and design the control over their dependencies. In this paper, we explain the broad range of components that we developed and integrated into various interactive agents. We also explain how the interaction is recorded as multimodal data and how it results in an aggregated memory in the eKG. By analyzing the recorded interaction, we can compare agents and agent components and study their interactive behavior with people and other agents.
Long-term task planning for robots remains a key focus in robotics research. At present, long-term planning faces challenges such as cumulative errors in sequences and limited scope of semantic modalities in planning. Traditional planning relies on static expert rule databases, which lack flexibility. While knowledge graphs based solely on text can effectively map task relationships, their reasoning capabilities are constrained by issues such as modal uniformity and the delays in updating static knowledge bases. To address these issues, this paper proposes GraphCortex, a “knowledge graph-VLM-agent” tripartite fusion paradigm for long-term task planning. Inspired by the human brain’s superior temporal sulcus, inferior parietal lobe and intraparietal sulcus, GraphCortex integrates visual and motion trajectory data into pure text knowledge graphs, achieving multimodal task nodes and aligning visual, action, and textual semantics. In experiments involving long-term tasks, GraphCortex significantly enhances the effectiveness of longterm capabilities of the imitation learning based on single-demonstration-based paradigm, offering a practical solution to challenges like semantic modality uniformity, fragmented task semantics, and the difficulty of autonomously evolving experiential knowledge in robot long-term task planning.
The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes \textbf{\textsc{X-Turing}}, which enhances the original test with a \textit{burst dialogue} pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the \textit{pseudo-dialogue} history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the \textit{X-Turn Pass-Rate} metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9\% and 38.9\% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.
Controlling diversity in LLM-agent simulations is essential for balancing stability in structured tasks with variability in open-ended interactions. However, we observe that dialogue diversity tends to degrade over long-term simulations. To explore the role of prompt design in this phenomenon, we modularized the utterance generation prompt and found that reducing contextual information leads to more diverse outputs. Based on this insight, we propose Adaptive Prompt Pruning (APP), a novel method that allows users to control diversity via a single parameter, lambda. APP dynamically prunes prompt segments based on attention scores and is compatible with existing diversity control methods. We demonstrate that APP effectively modulates diversity through extensive experiments and propose a method to balance the control trade-offs. Our analysis reveals that all prompt components impose constraints on diversity, with the Memory being the most influential. Additionally, high-attention contents consistently suppress output diversity.
Conversational recommender systems (CRS) endow traditional recommender systems with the capability of dynamically obtaining users’ short-term preferences for items and attributes through interactive dialogues. There are three core challenges for CRS, including the intelligent decisions for what attributes to ask, which items to recommend, and when to ask or recommend, at each conversation turn. Previous methods mainly leverage reinforcement learning (RL) to learn conversational recommendation policies for solving one or two of these three decision-making problems in CRS with separated conversation and recommendation components. These approaches restrict the scalability and generality of CRS and fall short of preserving a stable training procedure. In the light of these challenges, we tackle these three decision-making problems in CRS as a unified policy learning task. In order to leverage different features that are important to each sub-problem and facilitate better unified policy learning in CRS, we propose two novel multi-agent RL-based frameworks, namely Independent and Hierarchical Multi-Agent UNIfied COnversational RecommeNders (IMA-UNICORN and HMA-UNICORN), respectively. In specific, two low-level agents enrich the state representations for attribute prediction and item recommendation, by combining the long-term user preference information from the historical interaction data and the short-term user preference information from the conversation history. A high-level meta agent is responsible for coordinating the low-level agents to adaptively make the final decision. Experimental results on four benchmark CRS datasets and a real-world E-Commerce application show that the proposed frameworks significantly outperform state-of-the-art methods. Extensive analyses further demonstrate the superior scalability of the MARL frameworks on the multi-round conversational recommendation.
Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable performance in long-term human-machine interactions, which basically relies on iterative recalling and reasoning of history to generate high-quality responses. However, such repeated recall-reason steps easily produce biased thoughts, \textit{i.e.}, inconsistent reasoning results when recalling the same history for different questions. On the contrary, humans can keep thoughts in the memory and recall them without repeated reasoning. Motivated by this human capability, we propose a novel memory mechanism called TiM (Think-in-Memory) that enables LLMs to maintain an evolved memory for storing historical thoughts along the conversation stream. The TiM framework consists of two crucial stages: (1) before generating a response, a LLM agent recalls relevant thoughts from memory, and (2) after generating a response, the LLM agent post-thinks and incorporates both historical and new thoughts to update the memory. Thus, TiM can eliminate the issue of repeated reasoning by saving the post-thinking thoughts as the history. Besides, we formulate the basic principles to organize the thoughts in memory based on the well-established operations, (\textit{i.e.}, insert, forget, and merge operations), allowing for dynamic updates and evolution of the thoughts. Furthermore, we introduce Locality-Sensitive Hashing into TiM to achieve efficient retrieval for the long-term conversations. We conduct qualitative and quantitative experiments on real-world and simulated dialogues covering a wide range of topics, demonstrating that equipping existing LLMs with TiM significantly enhances their performance in generating responses for long-term interactions.
Generation-based chatbots, which read and write natural language and create new sentences, have advanced significantly, through big language models (BLMs). Nevertheless, the difficulty of reaching long-term coherence and a successful personalization is another challenge because of the shortcomings in managing memories. This work is a detailed study of memory systems, short-term memory (STM), long-term memory (LTM), episodic memory (EM), semantic memory (SM) and their contributions to continuity of context in dialogue systems. It points out fundamental problems of contextual drift, catastrophic forgetting (CF) and scale problems. Also, the paper examines personalization methods such as user profiling (UP), dynamic embeddings (DE), context-aware generation (CAG), and continual learning (CL) and privacy-preserving methods like federated learning (FL). This work will contribute to the further improvement of intelligent, memory- and user-adaptive conversation agents by studying the existing approaches and hypothesizing the integrated ones. The research also involves a comparative analysis of the latest contributions to research and future recommendations to adaptability and human-like interactions in chatbot systems. Such knowledge is critical in the development of chatbots that can remember the user context over time as well as adapt to user preferences. This study serves as a foundation for developing more emotionally intelligent and context-aware dialogue systems.
Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs'cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
There has recently been growing interest in conversational agents with long-term memory which has led to the rapid development of language models that use retrieval-augmented generation (RAG). Until recently, most work on RAG has focused on information retrieval from large databases of texts, like Wikipedia, rather than information from long-form conversations. In this paper, we argue that effective retrieval from long-form conversational data faces two unique problems compared to static database retrieval: 1) time/event-based queries, which requires the model to retrieve information about previous conversations based on time or the order of a conversational event (e.g., the third conversation on Tuesday), and 2) ambiguous queries that require surrounding conversational context to understand. To better develop RAG-based agents that can deal with these challenges, we generate a new dataset of ambiguous and time-based questions that build upon a recent dataset of long-form, simulated conversations, and demonstrate that standard RAG based approaches handle such questions poorly. We then develop a novel retrieval model which combines chained-of-table search methods, standard vector-database retrieval, and a prompting method to disambiguate queries, and demonstrate that this approach substantially improves over current methods at solving these tasks. We believe that this new dataset and more advanced RAG agent can act as a key benchmark and stepping stone towards effective memory augmented conversational agents that can be used in a wide variety of AI applications.
No abstract available
As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the"eyes"of human perception while neglecting the"ears", namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots with"eyes and ears"capable of more immersive interactions with humans. As part of this effort, we introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation ($M^3C$), and propose a novel multimodal conversation model featuring multimodal memory retrieval. Our model, trained on the $M^3C$, demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers in complex, real-world-like settings, effectively processing visual and auditory inputs to understand and respond appropriately. Human evaluations highlight the model's strong performance in maintaining coherent and dynamic interactions, demonstrating its potential for advanced multimodal conversational agents.
We present HyLECA, an open-source framework designed for the development of long-term engaging controlled conversational agents. HyLECA’s dialogue manager employs a hybrid architecture, combining rule-based methods for controlled dialogue flows with retrieval-based and generation-based approaches to enhance the utterance variability and flexibility. The motivation behind HyLECA lies in enhancing user engagement and enjoyment in task-oriented chatbots by leveraging the natural language generation capabilities of open-domain large language models within the confines of predetermined dialogue flows. Moreover, we discuss the technical capabilities, potential applications, relevance, and adaptability of the system. Lastly, we report preliminary findings from integrating state-of-the-art large language models in simulating a conversation centred on smoking cessation.
In this paper, we propose MPC (Modular Prompted Chatbot), a new approach for creating high-quality conversational agents without the need for fine-tuning. Our method utilizes pre-trained large language models (LLMs) as individual modules for long-term consistency and flexibility, by using techniques such as few-shot prompting, chain-of-thought (CoT), and external memory. Our human evaluation results show that MPC is on par with fine-tuned chatbot models in open-domain conversations, making it an effective solution for creating consistent and engaging chatbots.
According to current estimates, the population of Japan will continuously decrease to be less than 100 million from currently 126.4 million by 2053 [1]. In order to supplement the decreasing labor force, the Japanese government has stated to relax the regulations on introducing foreignworkers. Thus, an increase of foreignerworkers in Japanese society can be expected in the near future. In such a situation, foreigner workers are the minority and have to collaboratively work with Japanese colleagues who are the majority. Our project is aiming at developing an environment for both foreigner and Japanese people to practice collaborative work in an unbalanced situation, that is, Japanese people are the majority and the task itself is Japanese favored. This paper describes our design of an experiment to acquire multimodal sensory data in collaborative tasks with unbalanced composition of cultural backgrounds. These data are supposed to be used for developing foreigner / Japanese behavior generation / detection models in a training environment with virtual agents. From our knowledge, there is no such dataset available and we believe the data collected can provide valuable resources for developing tools in supporting unbalanced groups.
The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%. All framework code, agent code, and task datasets are publicly available at https://github.com/camel-ai/crab.
Social media has evolved into a complex multimodal environment where text, images, and other signals interact to shape nuanced meanings, often concealing harmful intent. Identifying such intent, whether sarcasm, hate speech, or misinformation, remains challenging due to cross-modal contradictions, rapid cultural shifts, and subtle pragmatic cues. To address these challenges, we propose MV-Debate, a multi-view agent debate framework with dynamic reflection gating for unified multimodal harmful content detection. MV-Debate assembles four complementary debate agents, a surface analyst, a deep reasoner, a modality contrast, and a social contextualist, to analyze content from diverse interpretive perspectives. Through iterative debate and reflection, the agents refine responses under a reflection-gain criterion, ensuring both accuracy and efficiency. Experiments on three benchmark datasets demonstrate that MV-Debate significantly outperforms strong single-model and existing multi-agent debate baselines. This work highlights the promise of multi-agent debate in advancing reliable social intent detection in safety-critical online contexts.
Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.
This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general scenario is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using a simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness of agents, our findings first indicate that these agents are prone to environmental distractions. Furthermore, we implement an adversarial environment injection and analyze the approach to improve faithfulness, calling for a collective focus on this important topic.
No abstract available
A multimodal large language model (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an MLLM and cause unaligned behaviors. In this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. It entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors. To validate the feasibility of infectious jailbreak, we simulate multi-agent environments containing up to one million LLaVA-1.5 agents, and employ randomized pair-wise chat as a proof-of-concept instantiation for multi-agent interaction. Our results show that feeding an (infectious) adversarial image into the memory of any randomly chosen agent is sufficient to achieve infectious jailbreak. Finally, we derive a simple principle for determining whether a defense mechanism can provably restrain the spread of infectious jailbreak, but how to design a practical defense that meets this principle remains an open question to investigate. Our project page is available at https://sail-sg.github.io/Agent-Smith/.
We introduce HoME: a Household Multimodal Environment for artificial agents to learn from vision, audio, semantics, physics, and interaction with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts based on the SUNCG dataset, a scale which may facilitate learning, generalization, and transfer. HoME is an open-source, OpenAI Gym-compatible platform extensible to tasks in reinforcement learning, language grounding, sound-based navigation, robotics, multi-agent learning, and more. We hope HoME better enables artificial agents to learn as humans do: in an interactive, multimodal, and richly contextualized setting.
Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.
This article argues that a relatively novel methodology called multi-agent artificial intelligence modeling can play an important role in helping scholars fulfill eight desiderata for a “good” social scientific theory (conceptual clarity, logical consistency, empirical groundedness, parsimony, generativity, testability, insightfulness, and usefulness). The unique contributions of this methodology include its use of psychologically realistic agents in sociologically realistic networks that interact with each other and their simulated environment within an “artificial society.” These simulation tools utilize artificial intelligence in a way that enables scholars to causally generate the emergence of macro-level societal phenomena of interest from the micro-level behaviors and meso-level interactions of simulated agents that represent real world populations. These social digital twins can also integrate multiple social theories within a single causal architecture, providing a unique opportunity for the revolutionizing of our best theoretical frameworks.
We propose a multimodal (vision-and-language) benchmark for cooperative and heterogeneous multi-agent learning. We introduce a benchmark multimodal dataset with tasks involving collaboration between multiple simulated heterogeneous robots in a rich multi-room home environment. We provide an integrated learning framework, multimodal implementations of state-of-the-art multi-agent reinforcement learning techniques, and a consistent evaluation protocol. Our experiments investigate the impact of different modalities on multi-agent learning performance. We also introduce a simple message passing method between agents. The results suggest that multimodality introduces unique challenges for cooperative multi-agent learning and there is significant room for advancing multi-agent reinforcement learning methods in such settings.
We demonstrate EMMA, an embodied multimodal agent which has been developed for the Alexa Prize SimBot challenge. The agent acts within a 3D simulated environment for household tasks. EMMA is a unified and multimodal generative model aimed at solving embodied tasks. In contrast to previous work, our approach treats multiple multimodal tasks as a single multimodal conditional text generation problem, where a model learns to output text given both language and visual input. Furthermore, we showcase that a single generative agent can solve tasks with visual inputs of varying length, such as answering questions about static images, or executing actions given a sequence of previous frames and dialogue utterances. The demo system will allow users to interact conversationally with EMMA in embodied dialogues in different 3D environments from the TEACh dataset.
No abstract available
No abstract available
Large language models have increasingly been proposed as a powerful replacement for classical agent-based models (ABMs) to simulate social dynamics. By using LLMs as a proxy for human behavior, the hope of this new approach is to be able to simulate significantly more complex dynamics than with classical ABMs and gain new insights in fields such as social science, political science, and economics. However, due to the black box nature of LLMs, it is unclear whether LLM agents actually execute the intended semantics that are encoded in their natural language instructions and, if the resulting dynamics of interactions are meaningful. To study this question, we propose a new evaluation framework that grounds LLM simulations within the dynamics of established reference models of social science. By treating LLMs as a black-box function, we evaluate their input-output behavior relative to this reference model, which allows us to evaluate detailed aspects of their behavior. Our results show that, while it is possible to engineer prompts that approximate the intended dynamics, the quality of these simulations is highly sensitive to the particular choice of prompts. Importantly, simulations are even sensitive to arbitrary variations such as minor wording changes and whitespace. This puts into question the usefulness of current versions of LLMs for meaningful simulations, as without a reference model, it is impossible to determine a priori what impact seemingly meaningless changes in prompt will have on the simulation.
This tutorial will delve into the fascinating realm of simulating human society using Large Language Model (LLM)-driven agents, exploring their applications in cities, social media, and economic systems. Through this tutorial, participants will gain insights into the integration of LLMs into human society simulation, providing a comprehensive understanding of how these models can accurately represent human interactions, decision-making processes, and societal dynamics from cities to social media and to economic systems. The tutorial will introduce the essential background, discuss the motivation and challenges, and elaborate on the recent advances.
Understanding the dynamics of public opinion evolution on online social platforms is crucial for understanding influence mechanisms and the provenance of information. Traditional influence analysis is typically divided into qualitative assessments of personal attributes (e.g., psychology of influence) and quantitative evaluations of influence power mechanisms (e.g., social network analysis). One challenge faced by researchers is the ethics of real-world experimentation and the lack of social influence data. In this study, we provide a novel simulated environment that combines agentic intelligence with Large Language Models (LLMs) to test topic-specific influence mechanisms ethically. Our framework contains agents that generate posts, form opinions on specific topics, and socially follow/unfollow each other based on the outcome of discussions. This simulation allows researchers to observe the evolution of how opinions form and how influence leaders emerge. Using our own framework, we design an opinion leader that utilizes Reinforcement Learning (RL) to adapt its linguistic interaction with the community to maximize its influence and followers over time. Our current findings reveal that constraining the action space and incorporating self-observation are key factors for achieving stable and consistent opinion leader generation for topic-specific influence. This demonstrates the simulation framework's capacity to create agents that can adapt to complex and unpredictable social dynamics. The work is important in an age of increasing online influence on social attitudes and emerging technologies.
The increasing integration of Large Language Models (LLMs) into decision-making frameworks has exposed significant vulnerabilities to social compliance, specifically sycophancy and conformity. However, a critical research gap exists regarding the fundamental mechanisms that enable external social cues to systematically override a model's internal parametric knowledge. This study introduces the Signal Competition Mechanism, a unified framework validated by assessing behavioral correlations across 15 LLMs and performing latent-space probing on three representative open-source models. The analysis demonstrates that sycophancy and conformity originate from a convergent geometric manifold, hereafter termed the compliance subspace, which is characterized by high directional similarity in internal representations. Furthermore, the transition to compliance is shown to be a deterministic process governed by a linear boundary, where the Social Emotional Signal effectively suppresses the Information Calibration Signal. Crucially, we identify a"Transparency-Truth Gap,"revealing that while internal confidence provides an inertial barrier, it remains permeable and insufficient to guarantee immunity against intense social pressure. By formalizing the Integrated Epistemic Alignment Framework, this research provides a blueprint for transitioning from instructional adherence to robust epistemic integrity.
The rise of echo chambers on social media platforms has heightened concerns about polarization and the reinforcement of existing beliefs. Traditional approaches for simulating echo chamber formation have often relied on predefined rules and numerical simulations, which, while insightful, may lack the nuance needed to capture complex, real-world interactions. In this paper, we present a novel framework that leverages large language models (LLMs) as generative agents to simulate echo chamber dynamics within social networks. The novelty of our approach is that it incorporates both opinion updates and network rewiring behaviors driven by LLMs, allowing for a context-aware and semantically rich simulation of social interactions. Additionally, we utilize real-world Twitter (now X) data to benchmark the LLM-based simulation against actual social media behaviors, providing insights into the accuracy and realism of the generated opinion trends. Our results demonstrate the efficacy of LLMs in modeling echo chamber formation, capturing both structural and semantic dimensions of opinion clustering. %This work contributes to a deeper understanding of social influence dynamics and offers a new tool for studying polarization in online communities.
An increasing reliance on recommender systems has led to concerns about the creation of filter bubbles on social media, especially on short video platforms like TikTok. However, their formation is still not entirely understood due to the complex dynamics between recommendation algorithms and user feedback. In this paper, we aim to shed light on these dynamics using a large language model-based simulation framework. Our work employs real-world short-video data containing rich video content information and detailed user-agents to realistically simulate the recommendation-feedback cycle. Through large-scale simulations, we demonstrate that LLMs can replicate real-world user-recommender interactions, uncovering key mechanisms driving filter bubble formation. We identify critical factors, such as demographic features and category attraction that exacerbate content homogenization. To mitigate this, we design and test interventions including various cold-start and feedback weighting strategies, showing measurable reductions in filter bubble effects. Our framework enables rapid prototyping of recommendation strategies, offering actionable solutions to enhance content diversity in real-world systems. Furthermore, we analyze how LLM-inherent biases may propagate through recommendations, proposing safeguards to promote equity for vulnerable groups, such as women and low-income populations. By examining the interplay between recommendation and LLM agents, this work advances a deeper understanding of algorithmic bias and provides practical tools to promote inclusive digital spaces.
The Spiral of Silence (SoS) theory holds that individuals with minority views often refrain from speaking out for fear of social isolation, enabling majority positions to dominate public discourse. When the'agents'are large language models (LLMs), however, the classical psychological explanation is not directly applicable, since SoS was developed for human societies. This raises a central question: can SoS-like dynamics nevertheless emerge from purely statistical language generation in LLM collectives? We propose an evaluation framework for examining SoS in LLM agents. Specifically, we consider four controlled conditions that systematically vary the availability of'History'and'Persona'signals. Opinion dynamics are assessed using trend tests such as Mann-Kendall and Spearman's rank, along with concentration measures including kurtosis and interquartile range. Experiments across open-source and closed-source models show that history and persona together produce strong majority dominance and replicate SoS patterns; history signals alone induce strong anchoring; and persona signals alone foster diverse but uncorrelated opinions, indicating that without historical anchoring, SoS dynamics cannot emerge. The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM-agent systems.
Automatic response forecasting for news media plays a crucial role in enabling content producers to efficiently predict the impact of news releases and prevent unexpected negative outcomes such as social conflict and moral injury. To effectively forecast responses, it is essential to develop measures that leverage the social dynamics and contextual information surrounding individuals, especially in cases where explicit profiles or historical actions of the users are limited (referred to as lurkers). As shown in a previous study, 97% of all tweets are produced by only the most active 25% of users. However, existing approaches have limited exploration of how to best process and utilize these important features. To address this gap, we propose a novel framework, named SocialSense, that leverages a large language model to induce a belief-centered graph on top of an existent social network, along with graph-based propagation to capture social dynamics. We hypothesize that the induced graph that bridges the gap between distant users who share similar beliefs allows the model to effectively capture the response patterns. Our method surpasses existing state-of-the-art in experimental evaluations for both zero-shot and supervised settings, demonstrating its effectiveness in response forecasting. Moreover, the analysis reveals the framework's capability to effectively handle unseen user and lurker scenarios, further highlighting its robustness and practical applicability.
Large language models (LLMs) are increasingly used to model human social behavior, with recent research exploring their ability to simulate social dynamics. Here, we test whether LLMs mirror human behavior in social dilemmas, where individual and collective interests conflict. Humans generally cooperate more than expected in laboratory settings, showing less cooperation in well-mixed populations but more in fixed networks. In contrast, LLMs tend to exhibit greater cooperation in well-mixed settings. This raises a key question: Are LLMs about to emulate human behavior in cooperative dilemmas on networks? In this study, we examine networked interactions where agents repeatedly engage in the Prisoner’s Dilemma within both well-mixed and structured network configurations, aiming to identify parallels in cooperative behavior between LLMs and humans. Our findings indicate critical distinctions: while humans tend to cooperate more within structured networks, LLMs display increased cooperation mainly in well-mixed environments, with limited adjustment to networked contexts. Notably, LLM cooperation also varies across model types, illustrating the complexities of replicating human-like social adaptability in artificial agents. These results highlight a crucial gap: LLMs struggle to emulate the nuanced, adaptive social strategies humans deploy in fixed networks. Unlike human participants, LLMs do not alter their cooperative behavior in response to network structures or evolving social contexts, missing the reciprocity norms that humans adaptively employ. This limitation points to a fundamental need in future LLM design—to integrate a deeper comprehension of social norms, enabling more authentic modeling of human-like cooperation and adaptability in networked environments.
Bodily Behaviour Recognition (BBR) and Eye Contact Detection (ECD) in multi-person group conversations are critical for understanding social dynamics, but traditional methods often rely solely on visual cues, lacking integration with semantic context. To address this, we propose a novel framework based on Large Vision-Language Models (LVLMs), leveraging their cross-modal alignment capability to fuse visual features (e.g., body postures, gaze directions) and linguistic semantics (e.g., behavioral category descriptions). A parameter-efficient tuning strategy using Low-Rank Adaptation (LoRA) is adopted, adapting only a subset of parameters in both the Language Model (LM) and Vision Transformer (ViT) modules, thus retaining pre-trained knowledge while reducing computational costs. The framework incorporates multi-task output heads to simultaneously predict BBR and ECD results. Experiments on the MPIIGroupInteraction dataset demonstrate superior performance: our method achieves 0.65 accuracy on BBR and 0.82 accuracy on ECD, outperforming state-of-the-art approaches by 0.02-0.03 in absolute terms. Ablation studies validate that applying LoRA to both LM and ViT with optimal hyperparameters yields the best results, confirming the importance of cross-modal synergy. This work highlights the potential of LVLMs in social behavior analysis, providing a lightweight and effective solution for understanding complex group interactions.
The widespread adoption of Large Language Models (LLMs) raises critical concerns about the amplification of societal biases, especially in non-Western contexts where cultural and social nuances are often underrepresented. This study introduces a multi-agent bias detection framework to systematically evaluate GPT-4o, Claude 3.5 Sonnet, and Llama 3.3 across Indian social stigma categories, including caste, religion, gender, mental health, socio-economic status, appearance, language/region, and family dynamics. We present SocialStigmaQA, a benchmark dataset of 320 prompts, validated through expert review and pilot testing, and use the Overall Bias Detection Factor (OBDF) to measure model performance. Findings reveal that Claude 3.5 Sonnet achieved the highest OBDF (98.75%), demonstrating superior bias detection across all categories, while GPT-4o showed moderate performance (72.8%) with noticeable gaps in gender and socio-economic domains. Llama 3.3 scored the lowest (71%). The multi-agent framework enhanced detection accuracy by 25–30% over single-agent models, particularly in subtle bias areas. These results underscore the need for culturally contextualized evaluation frameworks and suggest that OBDF-like metrics should be integrated into India's AI auditing processes to ensure fairness, inclusivity, and ethical deployment of AI systems in sensitive sectors such as hiring, education, and governance.
As the performance of larger, newer Large Language Models continues to improve for strategic Theory of Mind (ToM) tasks, the demand for these state-of-the-art models increases commensurately. However, their deployment is costly both in terms of processing power and time. In this paper, we investigate the feasibility of creating smaller, highly-performing specialized algorithms by way of fine-tuning. To do this, we first present a large pre-trained model with 20 unique scenarios that combine different social contexts with games of varying social dilemmas, record its answers, and use them for Q&A fine-tuning on a smaller model of the same family. Our focus is on in-context game-theoretic decision-making, the same domain within which human interaction occurs and that requires both a theory of mind (or a semblance thereof) and an understanding of social dynamics. The smaller model is therefore trained not just on the answers provided, but also on the motivations provided by the larger model, which should contain advice and guidelines to navigate both strategic dilemmas and social cues. We find that the fine-tuned smaller language model consistently bridged the gap in performance between the smaller pre-trained version of the model and its larger relative and that its improvements extended in areas and contexts beyond the ones provided in the training examples, including on out-of-sample scenarios that include completely different game structures. On average for all games, through fine-tuning, the smaller model showed a 46% improvement measured as alignment towards the behavior of the larger model, with 100% representing indistinguishable behavior. When presented with out-of-sample social contexts and games, the fine-tuned model still displays remarkable levels of alignment, reaching an improvement of 18% and 28% respectively.
The process of opinion expression and exchange is a critical component of democratic societies. As people interact with large language models (LLMs) in the opinion shaping process different from traditional media, the impacts of LLMs are increasingly recognized and being concerned. However, the knowledge about how LLMs affect the process of opinion expression and exchange of social opinion networks is very limited. Here, we create an opinion network dynamics model to encode the opinions of LLMs, cognitive acceptability and usage strategies of individuals, and simulate the impact of LLMs on opinion dynamics in a variety of scenarios. The outcomes of the simulations inform about effective demand-oriented opinion network interventions. The results from this study suggested that the output opinion of LLMs has a unique and positive effect on the collective opinion difference. The marginal effect of cognitive acceptability on collective opinion formation is nonlinear and shows a decreasing trend. When people partially rely on LLMs, the exchange process of opinion becomes more intense and the diversity of opinion becomes more favorable. In fact, there is 38.6% more opinion diversity when people all partially rely on LLMs, compared to prohibiting the use of LLMs entirely. The optimal diversity of opinion was found when the fractions of people who do not use, partially rely on, and fully rely on LLMs reached roughly 4:12:1. Our experiments also find that introducing extra agents with opposite/neutral/random opinions, we can effectively mitigate the impact of biased/toxic output from LLMs. Our findings provide valuable insights into opinion dynamics in the age of LLMs, highlighting the need for customized interventions tailored to specific scenarios to address the drawbacks of improper output and use of LLMs.
Humans increasingly rely on large language models (LLMs) to support decisions in social settings. Previous work suggests that such tools shape people's moral and political judgements. However, the long-term implications of LLM-based social decision-making remain unknown. How will human cooperation be affected when the assessment of social interactions relies on language models? This is a pressing question, as human cooperation is often driven by indirect reciprocity, reputations, and the capacity to judge interactions of others. Here, we assess how state-of-the-art LLMs judge cooperative actions. We provide 21 different LLMs with an extensive set of examples where individuals cooperate -- or refuse cooperating -- in a range of social contexts, and ask how these interactions should be judged. Furthermore, through an evolutionary game-theoretical model, we evaluate cooperation dynamics in populations where the extracted LLM-driven judgements prevail, assessing the long-term impact of LLMs on human prosociality. We observe a remarkable agreement in evaluating cooperation against good opponents. On the other hand, we notice within- and between-model variance when judging cooperation with ill-reputed individuals. We show that the differences revealed between models can significantly impact the prevalence of cooperation. Finally, we test prompts to steer LLM norms, showing that such interventions can shape LLM judgements, particularly through goal-oriented prompts. Our research connects LLM-based advices and long-term social dynamics, and highlights the need to carefully align LLM norms in order to preserve human cooperation.
As AI systems increasingly assume roles where trust and alignment with human values are essential, understanding when and why they engage in deception has become a critical research priority. We introduce The Traitors, a multi-agent simulation framework inspired by social deduction games, designed to probe deception, trust formation, and strategic communication among large language model (LLM) agents under asymmetric information. A minority of agents the traitors seek to mislead the majority, while the faithful must infer hidden identities through dialogue and reasoning. Our contributions are: (1) we ground the environment in formal frameworks from game theory, behavioral economics, and social cognition; (2) we develop a suite of evaluation metrics capturing deception success, trust dynamics, and collective inference quality; (3) we implement a fully autonomous simulation platform where LLMs reason over persistent memory and evolving social dynamics, with support for heterogeneous agent populations, specialized traits, and adaptive behaviors. Our initial experiments across DeepSeek-V3, GPT-4o-mini, and GPT-4o (10 runs per model) reveal a notable asymmetry: advanced models like GPT-4o demonstrate superior deceptive capabilities yet exhibit disproportionate vulnerability to others'falsehoods. This suggests deception skills may scale faster than detection abilities. Overall, The Traitors provides a focused, configurable testbed for investigating LLM behavior in socially nuanced interactions. We position this work as a contribution toward more rigorous research on deception mechanisms, alignment challenges, and the broader social reliability of AI systems.
Large Language Models (LLMs) demonstrate significant persuasive capabilities in one-on-one interactions, but their influence within social networks, where interconnected users and complex opinion dynamics pose unique challenges, remains underexplored. This paper addresses the research question: Can LLMs generate meaningful content that maximizes user engagement on social networks? To answer this, we propose a pipeline using reinforcement learning with simulated feedback, where the network's response to LLM-generated content (i.e., the reward) is simulated through a formal engagement model. This approach bypasses the temporal cost and complexity of live experiments, enabling an efficient feedback loop between the LLM and the network under study. It also allows to control over endogenous factors such as the LLM's position within the social network and the distribution of opinions on a given topic. Our approach is adaptive to the opinion distribution of the underlying network and agnostic to the specifics of the engagement model, which is embedded as a plug-and-play component. Such flexibility makes it suitable for more complex engagement tasks and interventions in computational social science. Using our framework, we analyze the performance of LLMs in generating social engagement under different conditions, showcasing their full potential in this task. The experimental code is publicly available at https://github.com/mminici/Engagement-Driven-Content-Generation.
The rapid proliferation of Location-Based Social Networks (LBSNs) has underscored the importance of Point-of-Interest (POI) recommendation systems in enhancing user experiences. While individual POI recommendation methods leverage users' check-in histories to provide personalized suggestions, they struggle to address scenarios requiring group decision-making. Group POI recommendation systems aim to satisfy the collective preferences of multiple users, but existing approaches face two major challenges: diverse group preferences and extreme data sparsity in group check-in data. To overcome these challenges, we propose LLMGPR, a novel framework that leverages large language models (LLMs) for group POI recommendations. LLMGPR introduces semantic-enhanced POI tokens and incorporates rich contextual information to model the diverse and complex dynamics of group decision-making. To further enhance its capabilities, we developed a sequencing adapter using Quantized Low-Rank Adaptation (QLoRA), which aligns LLMs with group POI recommendation tasks. To address the issue of sparse group check-in data, LLMGPR employs an aggregation adapter that integrates individual representations into meaningful group representations. Additionally, a self-supervised learning (SSL) task is designed to predict the purposes of check-in sequences (e.g., business trips and family vacations), thereby enriching group representations with deeper semantic insights. Extensive experiments demonstrate the effectiveness of LLMGPR, showcasing its ability to significantly enhance the accuracy and robustness of group POI recommendations.
No abstract available
With the prospect of autonomous artificial intelligence (AI) agents, studying their tendency for cooperative behavior becomes an increasingly relevant topic. This study is inspired by the super-additive cooperation theory, where the combined effects of repeated interactions and inter-group rivalry have been argued to be the cause for cooperative tendencies found in humans. We devised a virtual tournament where language model agents, grouped into teams, face each other in a Prisoner's Dilemma game. By simulating both internal team dynamics and external competition, we discovered that this blend substantially boosts both overall and initial, one-shot cooperation levels (the tendency to cooperate in one-off interactions). This research provides a novel framework for large language models to strategize and act in complex social scenarios and offers evidence for how intergroup competition can, counter-intuitively, result in more cooperative behavior. These insights are crucial for designing future multi-agent AI systems that can effectively work together and better align with human values. Source code is available at https://github.com/pippot/Superadditive-cooperation-LLMs.
Social media platforms have become primary channels for marketing and information dissemination, but they are also susceptible to false marketing and coordinated hype campaigns that mislead consumers and distort market information. This paper proposes a multimodal large model-based framework for detecting false marketing and hype propagation on social platforms. The framework integrates visual content analysis, textual semantic understanding, user behavior patterns, and network propagation dynamics to identify deceptive promotional activities. We present fundamental concepts of multimodal learning, transformer architectures, and graph neural networks. The proposed system employs vision-language models for content analysis and attention mechanisms for feature extraction. Experimental results on real-world social media datasets demonstrate that the multimodal approach achieves high accuracy in identifying false marketing content and detecting coordinated manipulation campaigns.
Large language models (LLMs) have demonstrated an impressive ability to role-play humans and replicate complex social dynamics. While large-scale social simulations are gaining increasing attention, they still face significant challenges, particularly regarding high time and computation costs. Existing solutions, such as distributed mechanisms or hybrid agent-based model (ABM) integrations, either fail to address inference costs or compromise accuracy and generalizability. To this end, we propose EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation. EcoLANG operates in two stages: (1) language evolution, where we filter synonymous words and optimize sentence-level rules through natural selection, and (2) language utilization, where agents in social simulations communicate using the evolved language. Experimental results demonstrate that EcoLANG reduces token consumption by over 20%, enhancing efficiency without sacrificing simulation accuracy.
Recent advancements in Large Language Models offer promising capabilities to simulate complex human social interactions. We investigate whether LLM-based multi-agent simulations can reproduce core human social dynamics observed in online forums. We evaluate conformity dynamics, group polarization, and fragmentation across different model scales and reasoning capabilities using a structured simulation framework. Our findings indicate that smaller models exhibit higher conformity rates, whereas models optimized for reasoning are more resistant to social influence.
Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM-agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large-scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.
Health misinformation across digital platforms has emerged as a critical fast-growing challenge to global public health, undermining trust in science and contributing to vaccine hesitancy, treatment refusal and heightened health risks. In response, this study introduces Impact, a novel simulation framework that integrates agent-based modeling (ABM) with large language model (LLM) integration and retrieval-augmented generation (RAG) to evaluate and optimize health communication strategies in complex online environments. By modeling virtual populations characterized by demographic, psychosocial, and emotional attributes, embedded within network structures that replicate the dynamics of digital platforms, the framework captures how individuals perceive, interpret and propagate both factual and misleading health messages. Messages are enriched with evidence from authoritative medical sources and iteratively refined through sentiment analysis and comparative testing, allowing the proactive pre-evaluation of diverse communication framings. Results demonstrate that misinformation spreads more rapidly than factual content, but that corrective strategies, particularly empathetic and context-sensitive messages delivered through trusted peers, can mitigate polarization, enhance institutional trust and sustain long-term acceptance of evidence-based information. These findings underscore the importance of adaptive, data-driven approaches to health communication and highlight the potential of simulation-based methods to inform scalable interventions capable of strengthening resilience against misinformation in digitally connected societies.
本报告整合了“长时程多智能体多模态社会场景”下的核心研究成果,构建了从底层技术支撑到高层社会应用的完整体系。研究涵盖了:1) 宏观层面的社会动力学模拟,揭示群体行为演化规律;2) 微观层面的长时记忆管理,解决交互一致性难题;3) 中观层面的博弈协作机制,优化多智能体决策效能;4) 多模态感知与具身智能,提升智能体在复杂环境中的生存与理解能力;5) 系统安全与社会治理,应对AI社会化带来的风险;6) 垂直领域落地与基准建设,推动医疗、教育等行业的智能化转型。这些研究共同指向了构建具备高度社会智能、长期稳定性和多模态交互能力的通用智能体系统。