长时程多智能体多模态社会场景

本报告整合了“长时程多智能体多模态社会场景”下的核心研究成果，构建了从底层技术支撑到高层社会应用的完整体系。研究涵盖了：1) 宏观层面的社会动力学模拟，揭示群体行为演化规律；2) 微观层面的长时记忆管理，解决交互一致性难题；3) 中观层面的博弈协作机制，优化多智能体决策效能；4) 多模态感知与具身智能，提升智能体在复杂环境中的生存与理解能力；5) 系统安全与社会治理，应对AI社会化带来的风险；6) 垂直领域落地与基准建设，推动医疗、教育等行业的智能化转型。这些研究共同指向了构建具备高度社会智能、长期稳定性和多模态交互能力的通用智能体系统。

共 139 篇文献，6 个研究方向

宏观社会模拟与群体动力学演化

该组文献探讨利用大语言模型（LLM）作为人类代理，模拟大规模社会系统的宏观动态，包括舆论极化、社会规范涌现、沉默螺旋、人口迁移及灾害风险感知等复杂社会现象。相关文献: Da Ju et. al, 2024 等 23 篇文献

长时程记忆架构与交互一致性维护

研究重点在于解决智能体在长时间跨度交互中的记忆瓶颈，涵盖外部记忆机制（RAG）、反射机制（Reflective Memory）、分层存储（STM/LTM）及动态剪枝技术，以确保跨会话的个性一致性。相关文献: Zhen Tan et. al, 2025 等 16 篇文献

多智能体博弈、协作策略与决策协同

关注多智能体在社会困境、外交谈判、非合作博弈中的策略选择。研究涉及权力动态、欺骗检测、信任形成、多视角辩论机制以及强化学习驱动的协同进化。相关文献: Richard Willis et. al, 2025 等 20 篇文献

多模态社会感知、具身智能与环境交互

研究智能体如何整合视听信号以识别社会规范、情感状态及文化偏见，并探讨具身智能体在物理或虚拟环境（如家庭、城市、足球场）中的任务规划与符号涌现。相关文献: Jun Yu et. al, 2025 等 33 篇文献

社会安全治理、风险防控与伦理偏见

探讨多智能体系统中的安全威胁（如传染性越狱）、虚假信息传播、回声室效应、仇恨言论检测及多模态争议性内容的治理机制。相关文献: Alhim Vera et. al, 2025 等 9 篇文献

垂直领域应用落地与交互系统基准测试

展示多智能体系统在教育、医疗、金融、政务及数字生态（电商、推荐系统）中的具体应用，并提供针对特定领域（如OS操作、临床诊断）的评估基准平台。相关文献: Rili Dang et. al, 2025 等 38 篇文献

总计139篇相关文献

WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

加权迭代专家社会（WISE）：用于鲁棒多模态多智能体辩论

Anoop Cherian, River Doyle, Eyal Ben-Dov 等, 2025-ArXiv

Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents'solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.