Large Language Models Agentic AI
代理架构与系统工程框架
集中于构建模块化、可扩展的LLM代理软件架构,涵盖系统初始化、生命周期管理、API设计、工程实践及标准化部署方案。
- AgentInit: Initializing LLM-based Multi-Agent Systems via Diversity and Expertise Orchestration for Effective and Efficient Collaboration(Chunhao Tian, Yutong Wang, Xuebo Liu, Zhexuan Wang, Liang Ding, Miao Zhang, Min Zhang, 2025, Conference on Empirical Methods in Natural Language Processing)
- SOLID: a Framework of Synergizing Optimization and LLMs for Intelligent Decision-Making(Yinsheng Wang, Tario G You, Léonard Boussioux, Shan Liu, 2025, ArXiv Preprint)
- LLM-Collab: a framework for enhancing task planning via chain-of-thought and multi-agent collaboration(Hongyu Cao, Rong Ma, Yanlong Zhai, Jun Shen, 2024, Applied Computing and Intelligence)
- Advancing U.S. Competitiveness in Agentic Gen AI: A Strategic Framework for Interoperability and Governance(Satyadhar Joshi, 2025, International Journal of Innovative Science and Research Technology)
- AGENTIC AI ARCHITECTURE: THE ROLE OF DETERMINISTIC OUTPUTS AND OBSERVABILITY IN ENSURING SYSTEM RELIABILITY(Roman Kyslyi, 2026, Наука і техніка сьогодні)
- AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search(Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li, 2025, AAAI Conference on Artificial Intelligence)
- E3-MAS: A Self-Evolution Multi-Agent System Framework(Ming-Yi Huang, Yao-Zhi Xue, Chai-Yu Lin, 2025, 2025 IEEE/IEIE International Conference on Consumer Electronics-Asia (ICCE-Asia))
- POLARIS: Typed Planning and Governed Execution for Agentic AI in Back-Office Automation(Zahra Moslemi, Keerthi Koneru, Yen-Ting Lee, Sheethal Kumar, Ramesh Radhakrishnan, 2026, ArXiv Preprint)
- AgentCoord: Visually Exploring Coordination Strategy for LLM-based Multi-Agent Collaboration(Bo Pan, Jiaying Lu, Ke Wang, Li Zheng, Zhen Wen, Yingchaojie Feng, Minfeng Zhu, Wei Chen, 2024, Computers & graphics)
- Multi-agent Architecture Search via Agentic Supernet(Gui-Min Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, Xiang Wang, 2025, International Conference on Machine Learning)
- Architecting Agentic Communities using Design Patterns(Zoran Milosevic, Fethi Rabhi, 2026, ArXiv Preprint)
- ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning(Edward Y. Chang, Longling Geng, 2025, Multi-LLM Agent Collaborative Intelligence)
- AgentSquare: Automatic LLM Agent Search in Modular Design Space(Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li, 2024, International Conference on Learning Representations)
- Retail Resilience Engine: An Agentic AI Framework for Building Reliable Retail Systems With Test-Driven Development Approach(L. Mishra, Biswaranjan Senapati, 2025, IEEE Access)
- mAIstro: an open-source multi-agentic system for automated end-to-end development of radiomics and deep learning models for medical imaging(E. Tzanis, M. Klontzas, 2025, European Journal of Radiology Artificial Intelligence)
- Engineering AI Agents for Clinical Workflows: A Case Study in Architecture,MLOps, and Governance(Cláudio Lúcio do Val Lopes, João Marcus Pitta, Fabiano Belém, Gildson Alves, Flávio Vinícius Cruzeiro Martins, 2026, ArXiv Preprint)
- Governing Cloud Data Pipelines with Agentic AI(Aswathnarayan Muthukrishnan Kirubakaran, A. Parthasarathy, Nitin Saksena, Ram Sekhar Bodala, Akshay Deshpande, Suhas Malempati, S. Carimireddy, Abhirup Mazumder, 2025, Eighth Sense Research Group)
- DMAS-Forge: A Framework for Transparent Deployment of AI Applications as Distributed Systems(Alessandro Cornacchia, Vaastav Anand, Muhammad Bilal, Zafar Qazi, Marco Canini, 2025, ArXiv Preprint)
- Architectures for Building Agentic AI(Sławomir Nowaczyk, 2025, ArXiv Preprint)
- Agentic AI for Autonomous Micro-Frontend User Interfaces and Microservices Evolution in Cloud Platforms(Jyoti Kunal, Jyoti Kunal Shah, 2025, Journal of Computer Science and Technology Studies)
- Toward standardization of GenAI-driven agentic architectures for radio access networks(Zeinab Nezami, S. A. R. Zaidi, Maryam Hafeez, Jie Xu, Karim Djemame, 2025, Frontiers in Artificial Intelligence)
- Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis(Abdelghny Orogat, Ana Rostam, Essam Mansour, 2026, ArXiv Preprint)
- IEEE AI Standards for Agentic Systems(Richard Tong, Haoyang Li, Sridhar Raghavan, Qingsong Wen, Shannon Gray, A. Paul, Joleen Liang, Janusz Zalewski, Yacheng Yang, G. Tambouratzis, Bong Chong Ang, 2025, 2025 IEEE Conference on Artificial Intelligence (CAI))
- A4FN: an Agentic AI Architecture for Autonomous Flying Networks(André Coelho, Pedro Ribeiro, Helder Fontes, Rui Campos, 2025, 2025 IEEE 36th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC))
- Beyond Prompt Chaining: The TB-CSPN Architecture for Agentic AI(Uwe M. Borghoff, Paolo Bottoni, R. Pareschi, 2025, Future Internet)
- Autonomic Microservice Management via Agentic AI and MAPE-K Integration(Matteo Esposito, Alexander Bakhtin, Noman Ahmad, Mikel Robredo, Ruoyu Su, Valentina Lenarduzzi, Davide Taibi, 2025, European Conference on Software Architecture)
- Architecting Agentic AI Systems: Product and System Design Patterns for Trustworthy Autonomous Decision-Making(Tejesvi Alekh Prasad, 2025, International Journal of Computational and Experimental Science and Engineering)
- Architecting MCP-Based Platforms for Enterprise-Scale Agentic Generative AI(Karthik Perikala, 2023, Journal of Business Intelligence and Data Analytics)
多智能体协作与社会动力学研究
研究多代理系统(MAS)的交互协议、角色分配、团队协作机制、沟通拓扑以及模拟复杂人类社会行为的动态协作研究。
- AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System(Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, Wei Han, 2025, Conference on Empirical Methods in Natural Language Processing)
- Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Accuracy, Reliability, and Latency(Nazmus Ashrafi, Salah Bouktif, Mohammed Mediani, 2025, 2025 IEEE 19th International Conference on Application of Information and Communication Technologies (AICT))
- MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration(Yucheng Zhou, Lingran Song, Jianbing Shen, 2025, Annual Meeting of the Association for Computational Linguistics)
- Multi-Agent LLM Reasoning for Clinical Procedure Sequencing from High-Granularity EHR Data(Yishan Zhong, Wenqi Shi, Ben Tamo, Micky C. Nnamdi, Yining Yuan, M. D. Wang, 2025, Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics)
- Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering(Krishna Ronanki, 2025, Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering)
- Enhancing AI Systems with Agentic Workflows Patterns in Large Language Model(Aditi Singh, Abul Ehtesham, Saket Kumar, T. T. Khoei, 2024, 2024 IEEE World AI IoT Congress (AIIoT))
- Orchestrating Agents and Data for Enterprise: A Blueprint Architecture for Compound AI(Eser Kandogan, Nikita Bhutani, Dan Zhang, Rafael Li Chen, Sairam Gurajada, Estevam Hruschka, 2025, 2025 IEEE 41st International Conference on Data Engineering Workshops (ICDEW))
- ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration(Andrew Estornell, Jean-François Ton, Yuanshun Yao, Yang Liu, 2024, International Conference on Learning Representations)
- TradingAgents: Multi-Agents LLM Financial Trading Framework(Yijia Xiao, Edward Sun, Di Luo, Wei Wang, 2024, ArXiv Preprint)
- Budgeted Multi-Agent Routing: Adaptive Role Assignment and Communication Compression for Efficient LLM-Agent Collaboration(Linghao Yang, Yilun Wu, Rao Xu, Kai Zhang, Xikai Yang, Ke Wu, 2025, 2025 5th International Conference on Electronic Communication, Computer Science and Technology (ECCST))
- Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation(Haoyuan Wu, Haisheng Zheng, Zhuolun He, Bei Yu, 2025, North American Chapter of the Association for Computational Linguistics)
- LLM-powered Multi-agent Framework for Goal-oriented Learning in Intelligent Tutoring System(Tianfu Wang, Yi Zhan, Jianxun Lian, Zhengyu Hu, Nicholas Jing Yuan, Qi Zhang, Xing Xie, Hui Xiong, 2025, Companion Proceedings of the ACM on Web Conference 2025)
- ReDel: A Toolkit for LLM-Powered Recursive Multi-Agent Systems(Andrew Zhu, Liam Dugan, Christopher Callison-Burch, 2024, Conference on Empirical Methods in Natural Language Processing)
- Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-Based Planner and Graph-Based Policy(Ziqi Jia, Junjie Li, Xiaoyang Qu, Jianzong Wang, 2025, 2025 IEEE International Conference on Robotics and Automation (ICRA))
- Scaling Large-Language-Model-based Multi-Agent Collaboration(Cheng Qian, Zihao Xie, Yifei Wang, Wei Liu, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun, 2024, International Conference on Learning Representations)
- MultiAgentESC: A LLM-based Multi-Agent Collaboration Framework for Emotional Support Conversation(Yangyang Xu, Jinpeng Hu, Zhuoer Zhao, Zhangling Duan, Xiao Sun, Xun Yang, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing)
- Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications(Raphael Shu, Nilaksh Das, Michelle Yuan, Monica Sunkara, Yi Zhang, 2024, ArXiv Preprint)
- ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning(Shuo Yang, Soyeon Caren Han, Yihao Ding, Shuhe Wang, Eduard Hoy, 2026, ArXiv Preprint)
- MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization(Zhiyu Yang, Zihan Zhou, Shuo Wang, X. Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, Zhiyuan Liu, Xiaodong Shi, Maosong Sun, 2024, Annual Meeting of the Association for Computational Linguistics)
- Synergizing Logical Reasoning, Knowledge Management and Collaboration in Multi-Agent LLM System(Adam Kostka, Jaroslaw A. Chudziak, 2025, Pacific Asia Conference on Language, Information and Computation)
- MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs(Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, Bingsheng He, 2024, Findings of the Association for Computational Linguistics: ACL 2025)
- OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning(Zhenyu Bi, Meng Lu, Yang Li, Swastik Roy, Weijie Guan, Morteza Ziyadi, Xuan Wang, 2025, No journal)
- The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration(Kotaro Furuya, Yuichi Kitagawa, 2025, ArXiv Preprint)
- Small LLMs Are Weak Tool Learners: A Multi-LLM Agent(Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, Fei Huang, 2024, Conference on Empirical Methods in Natural Language Processing)
- AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration(Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, Min Zhang, 2025, Annual Meeting of the Association for Computational Linguistics)
- Project Synapse: A Hierarchical Multi-Agent Framework with Hybrid Memory for Autonomous Resolution of Last-Mile Delivery Disruptions(Arin Gopalan Yadav, Varad Dherange, Kumar Shivam, 2026, ArXiv Preprint)
- Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems(Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, Xin Wang, 2025, Conference on Empirical Methods in Natural Language Processing)
- CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society(G. Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem, 2023, Advances in Neural Information Processing Systems 36)
- LLM-Guided Multi-Agent Collaboration for Complex Task Automation(Vishal Bharadwaj Meruga, 2025, 2025 5th International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT))
- FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making(Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yupeng Cao, Zhi Chen, Jordan W. Suchow, Rong Liu, Zhenyu Cui, Zhaozhuo Xu, Denghui Zhang, Koduvayur Subbalakshmi, Guojun Xiong, Yueru He, Jimin Huang, Dong Li, Qianqian Xie, 2024, ArXiv Preprint)
- OSC: Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration(Jusheng Zhang, Yijia Fan, Kaitong Cai, Xiaofei Sun, Keze Wang, 2025, Conference on Empirical Methods in Natural Language Processing)
- Mixture-of-Agents Enhances Large Language Model Capabilities(Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou, 2024, International Conference on Learning Representations)
- CompanionCast: Toward Social Collaboration with Multi-Agent Systems in Shared Experiences(Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, Josiah Hester, 2025, ArXiv Preprint)
- COLLAB-LLM: A Communication-Centric Role-Based Framework for Scalable Multi-Agent LLM Collaboration(Elham Albaroudi, M. Hatamleh, Sirin Mohammed Hejazi, Ahmad Yasser Alshalabi, Taha Mansouri, Ali Alameer, 2026, Asian Journal of Research in Computer Science)
- The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation(Zarreen Reza, 2025, ArXiv Preprint)
- Evolution of Cooperation in LLM-Agent Societies: A Preliminary Study Using Different Punishment Strategies(Kavindu Warnakulasuriya, P. Dissanayake, Navindu De Silva, Stephen Cranefield, B. Savarimuthu, Surangika Ranathunga, Nisansa de Silva, 2025, No journal)
- Triad: A Framework Leveraging a Multi-Role LLM-based Agent to Solve Knowledge Base Question Answering(Chang Zong, Yuchen Yan, Weiming Lu, Eliot Huang, Jian Shao, Y. Zhuang, 2024, Conference on Empirical Methods in Natural Language Processing)
- Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration(Yang Zhang, Shixin Yang, Chenjia Bai, Fei Wu, Xiu Li, Xuelong Li, Zhen Wang, 2024, Annual Meeting of the Association for Computational Linguistics)
代理推理、规划与强化学习自进化
探讨代理如何通过复杂推理、多步规划、过程监督强化学习、自我演化(Self-evolution)及负面轨迹学习提升自主决策与性能上限。
- DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration(Zhihao Jia, Mingyi Jia, Junwen Duan, Jianxin Wang, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing)
- Making Large Language Models Better Reasoners with Alignment(Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, Zhifang Sui, 2023, ArXiv Preprint)
- Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning(Mingyue Cheng, Jie Ouyang, Shuo Yu, Ruiran Yan, Yucong Luo, Zirui Liu, Daoyu Wang, Qi Liu, Enhong Chen, 2025, ArXiv Preprint)
- Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems(Minghang Zhu, Zhengliang Shi, Zhiwei Xu, Shiguang Wu, Lingjie Wang, Pengjie Ren, Zhaochun Ren, Zhumin Chen, 2025, Conference on Empirical Methods in Natural Language Processing)
- LLM Collaboration with Multi-Agent Reinforcement Learning(Shuo Liu, Zeyu Liang, Xueguang Lyu, Christopher Amato, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- EvolveSearch: An Iterative Self-Evolving Search Agent(Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, Fei Huang, 2025, Conference on Empirical Methods in Natural Language Processing)
- Retrospex: Language Agent Meets Offline Reinforcement Learning Critic(Yufei Xiang, Yiqun Shen, Yeqin Zhang, Cam-Tu Nguyen, 2025, ArXiv Preprint)
- Towards Zero-Shot, Controllable Dialog Planning with LLMs(Dirk Väth, Ngoc Thang Vu, 2024, ArXiv Preprint)
- The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective(Jiin Kim, Byeong-Gon Shin, Jin-Won Chung, Minsoo Rhu, 2025, 2026 IEEE International Symposium on High Performance Computer Architecture (HPCA))
- Exploring Advanced Large Language Models with LLMsuite(Giorgio Roffo, 2024, ArXiv Preprint)
- PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use(Qihao Wang, Mingzhe Lu, Jiayue Wu, Yue Hu, Yanbing Liu, 2026, Pacific Rim International Conference on Artificial Intelligence)
- TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning(Chuang Jiang, Mingyue Cheng, Xiaoyu Tao, Qingyang Mao, Ouyang Jie, Qi Liu, 2025, Web Search and Data Mining)
- A Roadmap to Guide the Integration of LLMs in Hierarchical Planning(Israel Puerta-Merino, Carlos Núñez-Molina, Pablo Mesejo, Juan Fernández-Olivares, 2025, ArXiv Preprint)
- ODA: Observation-Driven Agent for integrating LLMs and Knowledge Graphs(Lei Sun, Zhengwei Tao, Youdi Li, Hiroshi Arakawa, 2024, ArXiv Preprint)
- MDCrow: automating molecular dynamics workflows with large language models(Quintina Campbell, Sam Cox, Jorge Medina, Brittany Watterson, Andrew D. White, 2025, Machine Learning: Science and Technology)
- Hierarchical Multi-agent Large Language Model Reasoning for Autonomous Functional Materials Discovery(Samuel Rothfarb, Megan C. Davis, Ivana Matanovic, Baikun Li, Edward F. Holby, Wilton J. M. Kort-Kamp, 2025, ArXiv Preprint)
- Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools(Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, Yueming Jin, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- Efficient Tool Use with Chain-of-Abstraction Reasoning(Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, Tianlu Wang, 2024, International Conference on Computational Linguistics)
- ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks(Heng Zhou, Hejia Geng, Xiangyuan Xue, Li Kang, Zhenfei Yin, Lei Bai, 2025, Conference on Empirical Methods in Natural Language Processing)
- BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis(Surya Jasper, Minh Luu, Evan Pan, Aakash Tyagi, Michael Quinn, Jiang Hu, D. K. Houngninou, 2025, 2025 ACM/IEEE 7th Symposium on Machine Learning for CAD (MLCAD))
- Fine-tuning a Large Language Model for Automating Computational Fluid Dynamics Simulations(Zhehao Dong, Zhen Lu, Yue Yang, 2025, Theoretical and Applied Mechanics Letters)
- Spec2RTL-Agent: Automated Hardware Code Generation from Complex Specifications Using LLM Agent Systems(Zhongzhi Yu, Mingjie Liu, Michael Zimmer, Y. Lin, Yong Liu, Haoxing Ren, 2025, 2025 IEEE International Conference on LLM-Aided Design (ICLAD))
- AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction(Song Wang, Zhen Tan, Zihan Chen, Shuang Zhou, Tianlong Chen, Jundong Li, 2025, Conference on Empirical Methods in Natural Language Processing)
- Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning(Zhiwei Li, Yong Hu, Wenqing Wang, 2025, Conference on Empirical Methods in Natural Language Processing)
- LLM Agents: Reasoning and Quality Hillclimbing Approaches(Venkata Siva Prasad Bharathula, 2025, European Journal of Computer Science and Information Technology)
- Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search(Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu, 2024, ArXiv Preprint)
- Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement(Weimin Xiong, Yifan Song, Xiutian Zhao, Wenhao Wu, Xun Wang, Ke Wang, Cheng Li, Wei Peng, Sujian Li, 2024, Conference on Empirical Methods in Natural Language Processing)
- A self-correcting multi-agent LLM framework for language-based physics simulation and explanation(Dong-Oh Park, Hyeonbin Moon, Seunghwa Ryu, 2026, npj Artificial Intelligence)
- Reinforcement Learning of Planning Processes for Tool-Augmented LLM Agents(Lukas Müller, Hannah Wagner, Maximilian Weber, 2026, American Journal Of Big Data)
- Agent-Based ML-LLM Fusion with Self-Optimizing Prompts for Plateau Weather Alerts(Shuai Yan, Yang Xu, Shan He, 2025, 2025 6th International Conference on Information Science, Parallel and Distributed Systems (ISPDS))
- MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning(Shuo Yin, Weihao You, Zhilong Ji, Guoqiang Zhong, Jinfeng Bai, 2024, ArXiv Preprint)
- SAGE: Multi-Agent Self-Evolution for LLM Reasoning(Yulin Peng, Xinxin Zhu, Chenxing Wei, Nianbo Zeng, Leilei Wang, Ying Tiffany He, F. Richard Yu, 2026, ArXiv Preprint)
- Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First(Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, Aditya G. Parameswaran, 2025, Conference on Innovative Data Systems Research)
- Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents(Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, Timothy Baldwin, 2024, ArXiv Preprint)
- AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML(Patara Trirat, Wonyong Jeong, Sung Ju Hwang, 2024, International Conference on Machine Learning)
- Learning to Plan with Natural Language(Yiduo Guo, Yaobo Liang, Chenfei Wu, Wenshan Wu, Dongyan Zhao, Nan Duan, 2023, ArXiv Preprint)
- Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance(Ya-Ting Lu, Shenzhi Yang, Cheng Qian, Gui-Fang Chen, Qinyu Luo, Yesai Wu, Huadong Wang, X. Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, Zhiyuan Liu, Fangming Liu, Maosong Sun, 2024, International Conference on Learning Representations)
- When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator(Md Fahim Anjum, 2025, ArXiv Preprint)
- Adaptive bidirectional planning framework for enhanced safety and robust decision-making in autonomous navigation systems(Daoming Yu, Shaowen Wang, Yao Xu, Tianqi Wang, Jiaxin Zou, 2025, The Journal of Supercomputing)
- An LLM-based Framework for Biomedical Terminology Normalization in Social Media via Multi-Agent Collaboration(Yongqi Fan, Kui Xue, Zelin Li, Xiaofan Zhang, Tong Ruan, 2025, International Conference on Computational Linguistics)
- Training Task Reasoning LLM Agents for Multi-Turn Task Planning via Single-Turn Reinforcement Learning(Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang, 2025, IEEE Control Systems Letters)
- Revolutionizing Financial Management: The Role of Agentic AI in SAP Finance(Pavan Kumar Bollineni, 2025, Journal of Computer Science and Technology Studies)
- AI Agents: Evolution, Architecture, and Real-World Applications(Naveen Krishnan, 2025, ArXiv Preprint)
- ELLMA-T: an Embodied LLM-agent for Supporting English Language Learning in Social VR(Mengxue Pan, Alexandra Kitson, Hongyu Wan, Mirjana Prpa, 2024, Proceedings of the 2025 ACM Designing Interactive Systems Conference)
安全性、可靠性评估与伦理治理
关注代理部署中的安全风险(如注入、后门、幻觉),提出评估基准、合规治理、可信度量与人类可控机制。
- PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments(Olivier Schipper, Yudi Zhang, Yali Du, Mykola Pechenizkiy, Meng Fang, 2025, 2025 IEEE Conference on Games (CoG))
- Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation(Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei, Xuanjing Huang, 2024, International Conference on Computational Linguistics)
- Recursive Introspection: Teaching Language Model Agents How to Self-Improve(Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar, 2024, Neural Information Processing Systems)
- STeCa: Step-level Trajectory Calibration for LLM Agent Learning(Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li, 2025, Annual Meeting of the Association for Computational Linguistics)
- Evaluating large language model agents for automation of atomic force microscopy(Indrajeet Mandal, J. Soni, Mohd Zaki, M. Smedskjaer, K. Wondraczek, Lothar Wondraczek, N. Gosvami, N. M. A. Krishnan, 2025, Nature Communications)
- Sola-Visibility-ISPM: Benchmarking Agentic AI for Identity Security Posture Management Visibility(Gal Engelberg, Konstantin Koutsyi, Leon Goldberg, Reuven Elezra, Idan Pinto, Tal Moalem, Shmuel Cohen, Yoni Weintrob, 2026, ArXiv Preprint)
- TelAgentBench: A Multi-faceted Benchmark for Evaluating LLM-based Agents in Telecommunications(Sunwoo Lee, Daseong Jang, Dhammiko Arya, Gyoung-eun Han, Injee Song, Saerom Kim, Sang-Ju Kim, Seojin Lee, Seokyoung Hong, Sereimony Sek, Seung-Mo Cho, Sohee Park, Sungbin Yoon, Wonbeom Jang, Eric Davis, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track)
- API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs(Minghao Li, Feifan Song, Yu Bowen, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li, 2023, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing)
- Measuring an LLM's Proficiency at using APIs: A Query Generation Strategy(Ying Sheng, Sudeep Gandhe, Bhargav Kanagal, Nick Edmonds, Zachary Fisher, Sandeep Tata, Aarush Selvan, 2024, Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
- On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents(Jen-Tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixin Chen, Wenxuan Wang, Youliang Yuan, Maarten Sap, Michael R. Lyu, 2024, International Conference on Machine Learning)
- Security, privacy, and agentic AI in a regulatory view: From definitions and distinctions to provisions and reflections(Shiliang Zhang, Sabita Maharjan, 2026, ArXiv Preprint)
- Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails(Siwei Han, Kaiwen Xiong, Jiaqi Liu, Xinyu Ye, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao, 2025, ArXiv Preprint)
- Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents(Michael Kirchhof, Gjergji Kasneci, Enkelejda Kasneci, 2025, International Conference on Machine Learning)
- Looking Forward: Challenges and Opportunities in Agentic AI Reliability(Liudong Xing, Janet, Lin, 2025, ArXiv Preprint)
- Echoing: Identity Failures when LLM Agents Talk to Each Other(Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese, 2025, ArXiv Preprint)
- From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration(Gaole He, Brian Y. Lim, 2026, ArXiv Preprint)
- Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents(Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, Yongfeng Zhang, 2024, International Conference on Learning Representations)
- InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents(Qiusi Zhan, Zhixiang Liang, Zifan Ying, Daniel Kang, 2024, Annual Meeting of the Association for Computational Linguistics)
- Driving with Regulation: Trustworthy and Interpretable Decision-Making for Autonomous Driving with Retrieval-Augmented Reasoning(Tianhui Cai, Yifan Liu, Zewei Zhou, Haoxuan Ma, Seth Z. Zhao, Zhiwen Wu, Xu Han, Zhiyu Huang, Jiaqi Ma, 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
- Toward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework(Zikun Cui, Tianyi Huang, Chia-En Chiang, Cuiqianhe Du, 2025, Proceedings of the 2025 International Conference on Generative Artificial Intelligence for Business)
- Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents(Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun, 2024, ArXiv Preprint)
- LLM Constitutional Multi-Agent Governance(J. de Curtò, I. de Zarzà, 2026, ArXiv Preprint)
- G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems(Shilong Wang, Gui-Min Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, Yang Wang, 2025, Annual Meeting of the Association for Computational Linguistics)
- Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections(Xianglin Yang, Yufei He, Shuo Ji, Bryan Hooi, Jin Song Dong, 2026, ArXiv Preprint)
- Attacks on Third-Party APIs of Large Language Models(Wanru Zhao, Vidit Khazanchi, Haodi Xing, Xuanli He, Qiongkai Xu, Nicholas Donald Lane, 2024, ArXiv Preprint)
- CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent(Liang-bo Ning, Shijie Wang, Wenqi Fan, Qing Li, Xin Xu, Hao Chen, Feiran Huang, 2024, Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
- Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-Based Decision-Making Systems(Ruochen Jiao, Shaoyuan Xie, Justin Yue, Takami Sato, Lixu Wang, Yixuan Wang, Qi Alfred Chen, Qi Zhu, 2024, International Conference on Learning Representations)
- Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance(Jacopo Tagliabue, Federico Bianchi, Ciro Greco, 2025, ArXiv Preprint)
- A Blockchain-Monitored Agentic AI Architecture for Trusted Perception–Reasoning–Action Pipelines(Salman Jan, Hassan Ali Razzaqi, Ali Akarma, M. R. Belgaum, 2025, 2025 International Conference on Computer and Applications (ICCA))
- Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents(Vineeth Sai Narajala, Om Narayan, 2025, 2025 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI))
- Meaningful human control: actionable properties for AI system development(Luciano Cavalcante Siebert, Maria Luce Lupetti, Evgeni Aizenberg, Niek Beckers, Arkady Zgonnikov, Herman Veluwenkamp, David Abbink, Elisa Giaccardi, Geert-Jan Houben, Catholijn M. Jonker, Jeroen van den Hoven, Deborah Forster, Reginald L. Lagendijk, 2021, ArXiv Preprint)
- MAEBE: Multi-Agent Emergent Behavior Framework(Sinem Erisken, Timothy Gothard, Martin Leitgab, Ram Potham, 2025, ArXiv Preprint)
- Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents(Fengchao Chen, Tingmin Wu, Van Nguyen, Carsten Rudolph, 2026, ArXiv Preprint)
- PROV-AGENT: Unified Provenance for Tracking AI Agent Interactions in Agentic Workflows(Renan Souza, Amal Gueroudji, Stephen DeWitt, Daniel Rosendo, Tirthankar Ghosal, Robert Ross, Prasanna Balaprakash, Rafael Ferreira da Silva, 2025, ArXiv Preprint)
- "Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents(Rongwu Xu, Xiaojian Li, Shuo Chen, Wei Xu, 2025, Annual Meeting of the Association for Computational Linguistics)
- Red-Teaming LLM Multi-Agent Systems via Communication Attacks(Pengfei He, Yuping Lin, Shen Dong, Han Xu, Yue Xing, Hui Liu, 2025, Annual Meeting of the Association for Computational Linguistics)
- AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration(Jizhou Chen, Samuel Lee Cong, 2025, ArXiv Preprint)
- AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems(YenTing Lee, Keerthi Koneru, Zahra Moslemi, Sheethal Kumar, Ramesh Radhakrishnan, 2026, ArXiv Preprint)
- Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare(Saikat Maiti, 2026, ArXiv Preprint)
- Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification(Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, Yang Zhang, 2024, Conference on Empirical Methods in Natural Language Processing)
- Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents(Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, Jing Shao, 2025, ArXiv Preprint)
- Authenticated and Offline-Verifiable Agent-to-Agent Messaging for LLM Agents(Adil Alshammari, Sareh Assiri, Hayretdin Bahşi, 2026, 2026 IEEE 16th Annual Computing and Communication Workshop and Conference (CCWC))
- Socio-technical aspects of Agentic AI(Praveen Kumar Donta, Alaa Saleh, Ying Li, Shubham Vaishnav, Kai Fang, Hailin Feng, Yuchao Xia, Thippa Reddy Gadekallu, Qiyang Zhang, Xiaodan Shi, Ali Beikmohammadi, Sindri Magnússon, Ilir Murturi, Chinmaya Kumar Dehury, Marcin Paprzycki, Lauri Loven, Sasu Tarkoma, Schahram Dustdar, 2025, ArXiv Preprint)
具身智能、垂直行业与应用实践
研究LLM代理在物理世界(自动驾驶、机器人)、医疗、工业控制、金融及科学研究等特定领域的落地应用与决策优化。
- CPS-LLM: Large Language Model based Safe Usage Plan Generator for Human-in-the-Loop Human-in-the-Plant Cyber-Physical System(Ayan Banerjee, Aranyak Maity, Payal Kamboj, Sandeep K. S. Gupta, 2024, ArXiv Preprint)
- FinMem: A Performance-Enhanced LLM Trading Agent With Layered Memory and Character Design(Yangyang Yu, Haohang Li, Zhi Chen, Yuechen Jiang, Yang Li, Jordan W. Suchow, Denghui Zhang, K. Khashanah, 2023, IEEE Transactions on Big Data)
- Toward LLM-Agent-Based Modeling of Transportation Systems: A Conceptual Framework(Tianming Liu, Jirong Yang, Yafeng Yin, 2024, ArXiv Preprint)
- TrajLLM: A Modular LLM-Enhanced Agent-Based Framework for Realistic Human Trajectory Simulation(Chenlu Ju, Jiaxi Liu, Shobhit Sinha, Hao Xue, Flora D. Salim, 2025, Companion Proceedings of the ACM on Web Conference 2025)
- Leveraging LLM Decision-Making in the Internet of Drone Things (IoDT) Ecosystem(Fatima Shibli, Burak Tufekci, Cihan Tunc, Robin Laidig, 2025, ACM Journal on Autonomous Transportation Systems)
- Neuro-LIFT: A Neuromorphic, LLM-based Interactive Framework for Autonomous Drone FlighT at the Edge(Amogh Joshi, Sourav Sanyal, Kaushik Roy, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- Large language model agents can use tools to perform clinical calculations(Alex J. Goodell, Simon N. Chu, D. Rouholiman, Larry F. Chu, 2025, npj Digital Medicine)
- A Plan Reuse Mechanism for LLM-Driven Agent(Guopeng Li, Ruiqi Wu, Haisheng Tan, 2025, ArXiv Preprint)
- LLM-Powered Multi-Agent Collaboration for Intelligent Industrial On-Call Automation(Ruowei Fu, Yang Zhang, Zeyu Che, Xin Wu, Zhenyu Zhong, Zhiqiang Ren, Shenglin Zhang, Feng Wang, Yongqian Sun, Xiaozhou Liu, Kexin Liu, Yu Zhang, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
- Exploring the Design of LLM-based Agent in Enhancing Self-disclosure Among the Older Adults(Yijie Guo, Ruhan Wang, Zhenhan Huang, Tongtong Jin, Xiwen Yao, Yuanling Feng, Weiwei Zhang, Yuan Yao, Haipeng Mi, 2025, Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems)
- An LLM-Enabled Multi-Agent Autonomous Mechatronics Design Framework(Zeyu Wang, Frank P.-W. Lo, Qian Chen, Yongqi Zhang, Chen Lin, Xu Chen, Zhenhua Yu, Alex J. Thompson, Eric M. Yeatman, Benny P. L. Lo, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Language Evolution for Evading Social Media Regulation via LLM-Based Multi-Agent Simulation(Jinyu Cai, Jialong Li, Mingyue Zhang, Munan Li, Chen-Shu Wang, Kenji Tei, 2024, 2024 IEEE Congress on Evolutionary Computation (CEC))
- EduMAS: A Novel LLM-Powered Multi-Agent Framework for Educational Support(Qiaomu Li, Ying Xie, S. Chakravarty, Dabae Lee, 2024, 2024 IEEE International Conference on Big Data (BigData))
- Magma: A Foundation Model for Multimodal AI Agents(Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, J. Jang, Yuquan Deng, Lars Lidén, Jianfeng Gao, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling(Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl, 2025, ArXiv Preprint)
- Focus Agent: LLM-Powered Virtual Focus Group(Taiyu Zhang, Xuesong Zhang, Robbe Cools, Adalberto L. Simeone, 2024, ArXiv Preprint)
- WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration(Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, Volker Tresp, 2024, AAAI Conference on Artificial Intelligence)
- LLM Enabled Multi-Agent System for 6G Networks: Framework and Method of Dual-Loop Edge-Terminal Collaboration(Zheyan Qu, Wenbo Wang, Zitong Yu, Boquan Sun, Yang Li, Xing Zhang, 2025, IEEE Communications Magazine)
- SANNet: A Semantic-Aware Agentic AI Networking Framework for Multi-Agent Cross-Layer Coordination(Yong Xiao, Haoran Zhou, Xubo Li, Yayu Gao, Guangming Shi, Ping Zhang, 2025, GLOBECOM 2025 - 2025 IEEE Global Communications Conference)
- Re-Aligning Language to Visual Objects with an Agentic Workflow(Yuming Chen, Jiangyan Feng, Haodong Zhang, Lijun Gong, Feng Zhu, Rui Zhao, Qibin Hou, Ming-Ming Cheng, Yibing Song, 2025, International Conference on Learning Representations)
- SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs(Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin, 2025, Proceedings of the ACM on Web Conference 2025)
- Simulating Social Behavior of LLM-Based Autonomous Negotiator Agents in a Game-Theoretical Framework Using Multi-Agent Systems(Ahmad Mouri Zadeh Khaki, Ahyoung Choi, Laleh Seyyed-Kalantari, 2025, International Journal of Human–Computer Interaction)
- A living systematic literature review (L-SLR) for non–small-cell lung (NSCLC), prostate (PC), and breast cancer (BC), built with an agentic text annotation system powered by large language models (LLM) to assist treatment decision making.(Saro Sarkisian MD, R. Liu, E. Liu, A. Forsythe, 2025, Journal of Clinical Oncology)
- NimbleLabs: Accelerating Healthcare AI Development Through Agentic AI(Soorya Ram Shimgekar, Abhay Goyal, Shayan Vassef, Koustuv Saha, C. Poellabauer, Xavier Vautier, Pi Zonooz, Navin Kumar, 2025, 2025 IEEE International Conference on Big Data (BigData))
- ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework(Vali Tawosi, Keshav Ramani, Salwa Alamir, Xiaomo Liu, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW))
- NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions(Elliot Gestrin, Marco Kuhlmann, Jendrik Seipp, 2024, ArXiv Preprint)
- Intent-Based Infrastructure and Service Orchestration Using Agentic-AI(Dimitrios Brodimas, Alexios N. Birbas, Dimitrios Kapolos, S. Denazis, 2025, IEEE Open Journal of the Communications Society)
- STAR-Shield: Self-Tuning Adaptive Rules for Web Application Firewall-as-a-Service via Multiple Large Language Models(Letian Sha, Lei Xue, Nan Yi, Fu Xiao, 2025, 2025 IEEE 24th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom))
- Smartboard: Visual Exploration of Team Tactics with LLM Agent(Ziao Liu, Xiao Xie, Moqi He, Wenshuo Zhao, Yihong Wu, Liqi Cheng, Hui Zhang, Yingcai Wu, 2024, IEEE Transactions on Visualization and Computer Graphics)
- ToolACE: Winning the Points of LLM Function Calling(Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen, 2024, International Conference on Learning Representations)
- Data Space and LLM Enabled Decision-Making Support System: An Application in Drug Development(Chengjun Wang, X. Ming, Xianyu Zhang, Jiahao Xu, 2025, 2025 2nd International Conference on Electronic Engineering and Information Systems (EEISS))
- MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration(Md Hasebul Hasan, Mahir Labib Dihan, Mohammed Eunus Ali, Md. Rizwan Parvez, 2025, Conference of the European Chapter of the Association for Computational Linguistics)
- Autonomous Agentic AI Architectures for Optimizing Security Operations Centers (SOC) KPIS: Methodology, Impact on Detection, Response, and Recovery(Miroslav Stefanov, K. Stefanov, Laxima Niure Kandel, Sean Crouse, Boyan Jekov, 2025, Land Forces Academy Review)
- Poster: LLM Multi-Agent Collaboration for Network Deployment and Management(Zhengyi Cheng, Chongxi Ma, Mingxuan Tang, Jie Xu, Quan Li, Long Luo, Hongfang Yu, 2025, 2025 IEEE 33rd International Conference on Network Protocols (ICNP))
- Large Language Model-based Decision-making for COLREGs and the Control of Autonomous Surface Vehicles(Klinsmann Agyei, Pouria Sarhadi, W. Naeem, 2024, 2025 European Control Conference (ECC))
- Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Non-Custom Models for Evidence-Based Medicine.(Joshua J. Woo, A. Yang, Reena J. Olsen, Sayyida S. Hasan, D. Nawabi, Benedict U. Nwachukwu, Riley J Williams, Prem N. Ramkumar, 2024, Arthroscopy)
- TinyAgent: Function Calling at the Edge(Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, G. Anumanchipalli, Kurt Keutzer, A. Gholami, 2024, Conference on Empirical Methods in Natural Language Processing)
- Conversational health agents: a personalized large language model-powered agent framework(Mahyar Abbasian, Iman Azimi, Amir M. Rahmani, Ramesh C. Jain, 2025, JAMIA Open)
- BuilDroid: A Self-Correcting LLM Agent for Automated Android Builds(Jaehyeong Kim, Rui Rua, K. Ali, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
- Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation(Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Chuan Xiao, 2024, Neural Information Processing Systems)
- Large Language Model Agents for Radio Map Generation and Wireless Network Planning(Hong Quan, Wanli Ni, Tong Zhang, Xiangyu Ye, Ziyi Xie, Shuai Wang, Yuanwei Liu, Hui Song, 2025, IEEE Networking Letters)
- MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration(Dingkang Yang, Jinjie Wei, Mingcheng Li, Jiyao Liu, Lihao Liu, Ming Hu, Junjun He, Yakun Ju, Wei Zhou, Yang Liu, Lihua Zhang, 2024, ArXiv Preprint)
- Agentic AI for Regulatory Intelligence: Designing Scalable Compliance Lifecycle Systems in Multinational Tech Enterprises(Chinenye Blessing Onyekaonwu, Emmanuel Igba, Amina Catherine Peter- Anyebe, 2024, International Journal of Scientific Research and Modern Technology)
- Multi-Agent LLM Collaboration for Adaptive Code Review, Debugging, and Security Analysis(Tanush Sharanarthi, Sreenidhi Polineni, 2025, 2025 International Conference on Mechatronics, Robotics, and Artificial Intelligence (MRAI))
- PowerAgent: A Road Map Toward Agentic Intelligence in Power Systems: Foundation Model, Model Context Protocol, and Workflow(Qian Zhang, Le Xie, 2025, IEEE Power and Energy Magazine)
- Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition(Huy Ha, Peter R. Florence, Shuran Song, 2023, Conference on Robot Learning)
- Lang-PINN: From Language to Physics-Informed Neural Networks via a Multi-Agent Framework(Xin He, Liangliang You, Hongduan Tian, Bo Han, Ivor Tsang, Yew-Soon Ong, 2025, ArXiv Preprint)
- Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training(Meng Xiao, Xunxin Cai, Qingqing Long, Chengrui Wang, Yuanchun Zhou, Hengshu Zhu, 2025, ArXiv Preprint)
- RepairAgent: An Autonomous, LLM-Based Agent for Program Repair(Islem Bouzenia, Prem Devanbu, Michael Pradel, 2024, 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE))
- Biomni: A General-Purpose Biomedical AI Agent(Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf H. Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, S. Marwaha, Jennefer N Carter, Xin Zhou, Matthew T Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, M. Snyder, Le Cong, Aviv Regev, J. Leskovec, 2025, bioRxiv)
- ChatEDA: A Large Language Model Powered Autonomous Agent for EDA(Haoyuan Wu, Zhuolun He, Xinyun Zhang, Xufeng Yao, Su Zheng, Haisheng Zheng, Bei Yu, 2023, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
- A feasibility study of automating radiotherapy planning with large language model agents(QingXing Wang, Zhongqiu Wang, Minghua Li, Xinye Ni, Rong Tan, Wenwen Zhang, Maitudi Wubulaishan, Wei Wang, Zhiyong Yuan, Zhen Zhang, Cong Liu, 2025, Physics in Medicine & Biology)
- A Large-Scale Study on the Development and Issues of Multi-Agent AI Systems(Daniel Liu, Krishna Upadhyay, Vinaik Chhetri, A. B. Siddique, Umar Farooq, 2026, ArXiv Preprint)
- DRC-Coder: Automated DRC Checker Code Generation Using LLM Autonomous Agent(Chen-Chia Chang, Chia-Tung Ho, Yaguang Li, Yiran Chen, Haoxing Ren, 2024, Proceedings of the 2025 International Symposium on Physical Design)
- DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation(Ziming You, Yumiao Zhang, Dexuan Xu, Yiwei Lou, Yandong Yan, Wei Wang, Huaming Zhang, Yu Huang, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing)
- Design of Task Allocation and Decision-Making Styles for AI Agents Based on LLM(MengGuo Fu, J. Gou, 2025, 2025 10th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA))
- LLM-guided chemical process optimization with a multi-agent approach(Tong Zeng, Srivathsan Badrinarayanan, Janghoon Ock, Cheng-Kai Lai, A. Farimani, 2025, Machine Learning: Science and Technology)
- FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading(Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xueqing Peng, Mingquan Lin, Kaleb E Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, Qianqian Xie, 2025, Annual Meeting of the Association for Computational Linguistics)
- A Vehicle-Infrastructure Multi-Layer Cooperative Decision-Making Framework(Yiming Cui, Shiyu Fang, Peng Hang, Jian Sun, 2025, 2025 IEEE Intelligent Vehicles Symposium (IV))
- Automatic building energy model development and debugging using large language models agentic workflow(Liang Zhang, Vitaly Ford, Zhelun Chen, Jianli Chen, 2025, Energy and Buildings)
- LEAD: LLM-enhanced deep reinforcement learning for stable decision-making in critical autonomous driving scenarios(Dongwei Xu, Enwen Qiao, Tongcheng Gu, Hongda Fu, Chengju Sun, Haifeng Guo, Yuqing Liu, 2025, Neurocomputing)
- Non-Prehensile Tool-Object Manipulation by Integrating LLM-Based Planning and Manoeuvrability-Driven Controls(Hoi-Yin Lee, Peng Zhou, Anqing Duan, Wanyu Ma, Chenguang Yang, D. Navarro-Alarcón, 2024, Robotics Comput. Integr. Manuf.)
- Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents(Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, Yanfeng Wang, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- OpenFOAMGPT: A retrieval-augmented large language model (LLM) agent for OpenFOAM-based computational fluid dynamics(Sandeep Pandey, Ran Xu, Wenkang Wang, Xu Chu, 2025, Physics of Fluids)
- agentAR: Creating Augmented Reality Applications with Tool-Augmented LLM-based Autonomous Agents(Chenfei Zhu, Shao-Kang Hsia, Xiyun Hu, Ziyi Liu, Jingyu Shi, K. Ramani, 2025, Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology)
- A Language Agent for Autonomous Driving(Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, Yue Wang, 2023, ArXiv Preprint)
- SEDM: A Safety‐Enhanced Decision‐Making Framework for Autonomous Driving by Integrating Large Language Models and XGBoost(Jun Li, Baozhu Chen, Kai Xu, Xiaohan Yang, Mengting Sun, Guojun Li, Haojie Du, 2026, IET Intelligent Transport Systems)
- An Iterative Decision Refinement Framework for LLM-based Flight Control System(Qiace Zhang, Xiangqun Cai, 2025, 2025 IEEE International Conference on Unmanned Systems (ICUS))
- Optimization modeling and verification from problem specifications using a multi-agent multi-stage LLM framework(Mahdi Mostajabdaveh, Timothy T. L. Yu, Rindranirina Ramamonjison, G. Carenini, Zirui Zhou, Yong Zhang, 2024, INFOR: Information Systems and Operational Research)
- On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software(Ali Nouri, Johan Andersson, Kailash De Jesus Hornig, Zhennan Fei, Emil Knabe, Håkan Sivencrona, Beatriz Cabrero-Daniel, Christian Berger, 2025, Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering)
- Evaluation of Large Language Models for Decision Making in Autonomous Driving(Kotaro Tanahashi, Yuichi Inoue, Yu Yamaguchi, Hidetatsu Yaginuma, Daiki Shiotsuka, Hiroyuki Shimatani, Kohei Iwamasa, Yoshiaki Inoue, Takafumi Yamaguchi, Koki Igari, Tsukasa Horinouchi, Kento Tokuhiro, Yugo Tokuchi, Shunsuke Aoki, 2023, ArXiv Preprint)
- Communication-Free Adaptive Swarm Robotic System: LLM-Based Decision Making and MARL-Based Multi-Policy Control(Takahiro Yoshida, Yuichiro Sueoka, 2025, Journal of Robotics and Mechatronics)
- LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language(Kun Chu, Xufeng Zhao, Cornelius Weber, Stefan Wermter, 2025, ArXiv Preprint)
- Autonomous Vehicle Maneuvering Using Vision–LLM Models for Marine Surface Vehicles(Tae-Yeon Kim, Woen-Sug Choi, 2025, Journal of Marine Science and Engineering)
- LLM-Based Decision Making Framework for Autonomous Drone Navigation(Mirza Aarish Baig, Brad Alvarez, Richard Lage, Jayesh Soni, Himanshi Upadhyay, 2026, 2026 IEEE 5th International Conference on AI in Cybersecurity (ICAIC))
- Towards Interactive and Learnable Cooperative Driving Automation: A Large Language Model-Driven Decision-Making Framework(Shiyu Fang, Jiaqi Liu, Mingyu Ding, Yiming Cui, Chengqi Lv, Peng Hang, Jian Sun, 2024, IEEE Transactions on Vehicular Technology)
- An LLM Framework for Inferring Household Energy Consumption Through Behaviour Simulation(Shaylin Chetty, Hai Le Vu, Hao Wang, R. Smyth, 2025, Proceedings of the 12th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation)
- Planning, Living and Judging: A Multi-agent LLM-based Framework for Cyclical Urban Planning(Hang Ni, Yuzhi Wang, Hao Liu, 2024, ArXiv Preprint)
- Elicitron: An LLM Agent-Based Simulation Framework for Design Requirements Elicitation(Mohammadmehdi Ataei, Hyunmin Cheong, Daniele Grandi, Ye Wang, Nigel Morris, Alexander Tessier, 2024, Journal of Computing and Information Science in Engineering)
- Richelieu: Self-Evolving LLM-Based Agents for AI Diplomacy(Zhenyu Guan, Xiangyu Kong, Fangwei Zhong, Yizhou Wang, 2024, Advances in Neural Information Processing Systems 37)
- Efficacy of Autonomous Vehicle’s Adaptive Decision-Making Based on Large Language Models Across Multiple Driving Scenarios(Guanzhi Xiong, Siyang Liu, Yihong Yan, Qile Li, Hangze Li, 2025, IEEE Access)
- Mitigating LLM Hallucinations Using a Multi-Agent Framework(Ahmed M. Darwish, Essam A. Rashed, Ghada Khoriba, 2025, Information)
- From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing(Lanxiao Huang, Daksh Dave, Ming Jin, Tyler Cody, Peter A. Beling, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing)
- TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents(Geon Lee, Wenchao Yu, Kijung Shin, Wei Cheng, Haifeng Chen, 2025, AAAI Conference on Artificial Intelligence)
- Simulation Study on Real-Time Autonomous Driving Decision-Making Using BEV Perception and Large Language Models(Gaosong Shi, Mingxia Yu, Xiaofan Sun, 2026, Technologies)
- Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code(Muhammad Haseeb, 2025, ArXiv Preprint)
- LightVA: Lightweight Visual Analytics With LLM Agent-Based Task Planning and Execution(Yuheng Zhao, Junjie Wang, Linbing Xiang, Xiaowen Zhang, Zifei Guo, C. Turkay, Yu Zhang, Siming Chen, 2024, IEEE Transactions on Visualization and Computer Graphics)
- RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing(Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, Xiangyu Zhang, 2025, International Conference on Machine Learning)
- PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection(Wenhao Li, Selvakumar Manickam, Yung-Wey Chong, Shankar Karuppayah, 2025, 2025 IEEE International Conference on Big Data (BigData))
- MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution(Wei Tao, Yucheng Zhou, Wenqiang Zhang, Yu-Xi Cheng, 2024, Neural Information Processing Systems)
- Personality-Driven Decision-Making in LLM-Based Autonomous Agents(Lewis Newsham, Daniel Prince, 2025, Adaptive Agents and Multi-Agent Systems)
- FinAgent: An Agentic AI Framework Integrating Personal Finance and Nutrition Planning(Toqeer Ali Syed, Abdulaziz Alshahrani, Ali Ullah, Ali Akarma, Sohail Khan, Muhammad Nauman, Salman Jan, 2025, 2025 International Conference on Computer and Applications (ICCA))
- DriveAgent: Multi-Agent Structured Reasoning With LLM and Multimodal Sensor Fusion for Autonomous Driving(Xinmeng Hou, Wuqi Wang, Long Yang, Haohong Lin, Jinglun Feng, Haigen Min, Xiangmo Zhao, 2025, IEEE Robotics and Automation Letters)
- Thucy: An LLM-based Multi-Agent System for Claim Verification across Relational Databases(Michael Theologitis, Dan Suciu, 2025, ArXiv Preprint)
- SenseRAG: Constructing Environmental Knowledge Bases with Proactive Querying for LLM-Based Autonomous Driving(Xuewen Luo, Fan Ding, Fengze Yang, Yang Zhou, J. Loo, H. Tew, Chenxi Liu, 2025, 2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW))
- First Field Trial of LLM-Powered AI Agent for Lifecycle Management of Autonomous Driving Optical Networks(Xiaomin Liu, Qizhi Qiu, Yihao Zhang, Yuming Cheng, L. Yi, Weisheng Hu, Q. Zhuge, 2024, Optical Fiber Communications Conference and Exhibition)
- UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design(Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Laurence Li, Jiri Gesi, Qi He, T. Li, Dakuo Wang, 2025, Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems)
- Integrated decision-making and path planning framework for autonomous driving in multi-lane obstacle avoidance(Tengfei Fu, Hongliang Zhou, Zhiyuan Liu, 2025, Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering)
- Autonomous Industrial Control using an Agentic Framework with Large Language Models(Javal Vyas, Mehmet Mercangöz, 2024, IFAC-PapersOnLine)
- El Agente: An Autonomous Agent for Quantum Chemistry(Yunheng Zou, Austin H. Cheng, Abdulrahman Aldossary, Jiaru Bai, Shi Xuan Leong, Jorge A. Campos Gonzalez Angulo, Chang-Min Choi, Cher Tian Ser, Gary Tom, Andrew Wang, Zijian Zhang, Ilya Yakavets, Han Hao, Chris Crebolder, Varinia Bernales, Al'an Aspuru-Guzik, 2025, Matter)
- Knowledge-extractor: a self-evolving scientific framework for hydrogen energy research driven by AI agents(Tongao Yao, Yang Yang, Yujie Yan, X. Ou, Mingyang Li, Chenxi Wang, Wuzhe Li, Chenghao Du, Xuqiang Shao, Zhengyang Gao, Weijie Yang, 2025, AI Agent)
- VideoAgent: Long-form Video Understanding with Large Language Model as Agent(Xiaohan Wang, Yuhui Zhang, Orr Zohar, S. Yeung-Levy, 2024, European Conference on Computer Vision)
- LLM-Augmented Reinforcement Learning for Adaptive and Intelligent Decision-Making(Gopinath Karunanithi, Yatindra Kumar Gupta, Mallesh Deshapaga, Somnath Banerjee, Vandana Roy, 2025, 2025 World Conference on Cutting-Edge Science and Technology (WCCEST))
- Agent Laboratory: Using LLM Agents as Research Assistants(Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, E. Barsoum, 2025, Findings of the Association for Computational Linguistics: EMNLP 2025)
- IntelForge: Multi-Agent LLM Framework for Cyber Threat Intelligence Enrichment(Noam Tarshish, D. Hodisan, A. Shabtai, 2025, 2025 Annual Computer Security Applications Conference Workshops (ACSAC Workshops))
- OrcaLoca: An LLM Agent Framework for Software Issue Localization(Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, Jishen Zhao, 2025, International Conference on Machine Learning)
- SafeDrive: Knowledge- and Data-Driven Risk-Sensitive Decision-Making for Autonomous Vehicles with Large Language Models(Zhiyuan Zhou, Heye Huang, Boqi Li, Shiyue Zhao, Yao Mu, Jianqiang Wang, 2024, Accident Analysis and Prevention)
- CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning(Duo Wu, Jinghe Wang, Yuan Meng, Yanning Zhang, Le Sun, Zhi Wang, 2024, ArXiv Preprint)
- Hybrid LLM-DDQN-Based Joint Optimization of V2I Communication and Autonomous Driving(Zijiang Yan, Hao Zhou, Hina Tabassum, Xue Liu, 2025, IEEE Wireless Communications Letters)
- AMAP Agentic Planning Technical Report(AMAP AI Agent Team, Yulan Hu, Xiangwen Zhang, Sheng Ouyang, Hao Yi, Lu Xu, Qinglin Lang, Lide Tan, Xiang Cheng, Tianchen Ye, Zhicong Li, Ge Chen, Wenjin Yang, Zheng Pan, Shaopan Xiong, Siran Yang, Ju Huang, Yan Zhang, Jiamang Wang, Yong Liu, Yinfeng Huang, Ning Wang, Tucheng Lin, Xin Li, Ning Guo, 2025, ArXiv Preprint)
- SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Drivers’ Driving-thinking Data(Ye Jin, Ruoxuan Yang, Zhijie Yi, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, Jiangtao Gong, 2023, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- STRIDE: A Systematic Framework for Selecting AI Modalities -- Agentic AI, AI Assistants, or LLM Calls(Shubhi Asthana, Bing Zhang, Chad DeLuca, Ruchi Mahindru, Hima Patel, 2025, ArXiv Preprint)
- AVA: Towards Autonomous Visualization Agents through Visual Perception‐Driven Decision‐Making(Shusen Liu, H. Miao, Zhimin Li, M. Olson, Valerio Pascucci, P. Bremer, 2023, Computer Graphics Forum)
- Large Language Model-Based Bidding Behavior Agent and Market Sentiment Agent-Assisted Electricity Price Prediction(Xin Lu, Jing Qiu, Yi Yang, Chenxi Zhang, Jiafeng Lin, Sihai An, 2025, IEEE Transactions on Energy Markets, Policy and Regulation)
- Exploring Applicability of LLM-Powered Autonomous Agents to Solve Real-life Problems: Microsoft Entra ID Administration Agent (MEAN)(Roberto Rodriguez, Nestori Syynimaa, 2024, Proceedings of the 26th International Conference on Enterprise Information Systems)
- Magic: AN LLM-based multi-agent activated graph-reasoning intelligent collaboration model for liver disease diagnosis(Bowen Liu, Yaqing Nie, Hong Song, Yucong Lin, Jingtao Li, Xu Weng, Zhaoli Su, Yuhong Suo, Tingting Lv, Xinyan Zhao, Jian Yang, 2025, Information Fusion)
- KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph(Jinhao Jiang, Kun Zhou, Wayne Xin Zhao, Yang Song, Chen Zhu, Hengshu Zhu, Ji-Rong Wen, 2024, Annual Meeting of the Association for Computational Linguistics)
- From Bits to Boardrooms: A Cutting-Edge Multi-Agent LLM Framework for Business Excellence(Zihao Wang, Junming Zhang, 2025, European Conference on Artificial Intelligence)
- EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records(Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C. Ho, Carl Yang, M. D. Wang, 2024, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing)
- Memoria: A Scalable Agentic Memory Framework for Personalized Conversational AI(Samarth Sarin, Lovepreet Singh, Bhaskarjit Sarmah, Dhagash Mehta, 2025, 2025 5th International Conference on AI-ML-Systems (AIMLSystems))
- LLM-augmented hierarchical reinforcement learning for human-like decision-making of autonomous driving(Lin Li, Runjia Tan, Jianwu Fang, Jianru Xue, Chen Lv, 2025, Expert Systems with Applications)
- Edge Agentic AI Framework for Autonomous Network Optimisation in O-RAN(Abdelaziz Salama, Zeinab Nezami, Mohammed M. H. Qazzaz, Maryam Hafeez, S. A. R. Zaidi, 2025, 2025 IEEE 36th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC))
- CROSSAGENTIE: Cross-Type and Cross-Task Multi-Agent LLM Collaboration for Zero-Shot Information Extraction(Meng Lu, Yuzhang Xie, Zhenyu Bi, Shuxiang Cao, Xuan Wang, 2025, Findings of the Association for Computational Linguistics: ACL 2025)
- Design and evaluation of an Autonomous Cyber Defence agent using DRL and an augmented LLM(Johannes F. Loevenich, Erik Adler, Tobias Hürten, R. R. F. Lopes, 2025, Computer Networks)
- LLM-Driven Agentic AI Approach to Enhanced O-RAN Resilience in Next-Generation Networks(Xingqi Wu, Yuhui Wang, Junaid Farooq, Juntao Chen, 2025, IEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS))
- Conversational, agentic AI-enhanced architectural design process: three approaches to multimodal AI-enhanced early-stage performative design exploration(LokHang Cheung, Likai Wang, Dongxue Lei, 2025, Architectural Intelligence)
- Contextual Autonomy: The Next Phase of Agentic AI Architecture Beyond RAG and Prompt-Centric Design(2026, Journal of Computational Analysis and Applications)
本报告对 Large Language Models Agentic AI 领域的文献进行了系统性梳理,划分为五大逻辑板块:架构与工程框架确保系统的稳健性与可扩展性;多智能体协作与社会动力学研究核心在于交互与群体智慧;推理、规划与强化学习探讨智能核心的自主进化能力;安全、可靠与评估治理确立了应用边界与可信度底线;最后通过广泛的垂直领域与具身智能应用展现了其在真实物理与数字场景中的工业落地价值。
总计292篇相关文献
In the advanced technology nodes, the integrated design rule checker (DRC) is often utilized in place and route tools for fast optimization loops for power-performance-area. Implementing integrated DRC checkers to meet the standard of commercial DRC tools demands extensive human expertise to interpret foundry specifications, analyze layouts, and debug code iteratively. However, this labor-intensive process, requiring to be repeated by every update of technology nodes, prolongs the turnaround time of designing circuits. In this paper, we present DRC-Coder, a multi-agent framework with vision capabilities for automated DRC code generation. By incorporating vision language models and large language models (LLM), DRC-Coder can effectively process textual, visual, and layout information to perform rule interpretation and coding by two specialized LLMs. We also design an auto-evaluation function for LLMs to enable DRC code debugging. Experimental results show that targeting on a sub-3nm technology node for a state-of-the-art standard cell layout tool, DRC-Coder achieves perfect F1 score 1.000 in generating DRC codes for meeting the standard of a commercial DRC tool, highly outperforming standard prompting techniques (F1=0.631). DRC-Coder can generate code for each design rule within four minutes on average, which significantly accelerates technology advancement and reduces engineering costs.
Code auditing is the process of reviewing code with the aim of identifying bugs. Large Language Models (LLMs) have demonstrated promising capabilities for this task without requiring compilation, while also supporting user-friendly customization. However, auditing a code repository with LLMs poses significant challenges: limited context windows and hallucinations can degrade the quality of bug reports, and analyzing large-scale repositories incurs substantial time and token costs, hindering efficiency and scalability. This work introduces an LLM-based agent, RepoAudit, designed to perform autonomous repository-level code auditing. Equipped with agent memory, RepoAudit explores the codebase on demand by analyzing data-flow facts along feasible program paths within individual functions. It further incorporates a validator module to mitigate hallucinations by verifying data-flow facts and checking the satisfiability of path conditions associated with potential bugs, thereby reducing false positives. RepoAudit detects 40 true bugs across 15 real-world benchmark projects with a precision of 78.43%, requiring on average only 0.44 hours and $2.54 per project. Also, it detects 185 new bugs in high-profile projects, among which 174 have been confirmed or fixed. We have open-sourced RepoAudit at https://github.com/PurCL/RepoAudit.
We introduce DriveAgent, a modular multi-agent autonomous driving framework that leverages large language model (LLM) reasoning combined with multimodal sensor fusion for autonomous driving. DriveAgent orchestrates specialized agents operating on camera, Light Detection and Ranging (LiDAR), Inertial Measurement Unit (IMU), and Global Positioning System (GPS) with LLM-driven analytical processes to deliver temporally aligned perception, causal reasoning, and action recommendations. The framework operates through a modular agent-based pipeline comprising four principal modules: (i) a descriptive analysis agent identifying critical sensor data events based on filtered timestamps, (ii) dedicated vehicle-level analysis conducted by LiDAR and vision agents that collaboratively assess vehicle conditions and movements, (iii) environmental reasoning and causal analysis agents explaining contextual changes and their underlying mechanisms, and (iv) an urgency-aware decision-generation agent prioritizing insights and proposing timely maneuvers. This modular design empowers the LLM to effectively coordinate specialized perception and reasoning agents, delivering cohesive, interpretable insights into complex autonomous driving scenarios. Extensive experiments demonstrate that DriveAgent substantially outperforms baseline methods, achieving a 26.31% improvement in vehicle reasoning and consistent enhancements of up to 2.85% in environmental reasoning. These results highlight the effectiveness of our LLM-driven multi-agent sensor fusion framework in boosting the robustness and reliability of autonomous driving systems.
Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces Repair Agent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. Repair Agent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable Repair Agent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates Repair Agent's effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270k tokens per bug, which, under the current pricing of OpenAI's GPT-3.5 model, translates to 14 cents per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.
No abstract available
Multi-agent Large Language Model (LLM) systems have been leading the way in applied LLM research across a number of fields. One notable area is software development, where researchers have advanced the automation of code implementation, code testing, code maintenance, inter alia, using LLM agents. However, software development is a multifaceted environment that extends beyond just code. As such, a successful LLM system must factor in multiple stages of the software development life-cycle (SDLC). In this paper, we propose a vision for ALMAS, an Autonomous LLM-based Multi-Agent Software Engineering framework, which follows the above SDLC philosophy such that it may work within an agile software development team to perform several tasks end-to-end. ALMAS aligns its agents with agile roles, and can be used in a modular fashion to seamlessly integrate with human developers and their development environment. We showcase the progress towards ALMAS through our published works and a use case demonstrating the framework, where ALMAS is able to seamlessly generate an application and add new feature.
Existing LLM-enabled multi-agent frameworks are predominantly limited to digital or simulated environments and confined to narrowly focused knowledge domain, constraining their applicability to complex engineering tasks that require the design of physical embodiment, cross-disciplinary integration, and constraint-aware reasoning. This work proposes a multi-agent autonomous mechatronics design framework, integrating expertise across mechanical design, optimization, electronics, and software engineering to autonomously generate functional prototypes with minimal direct human design input. Operating primarily through a language-driven workflow, the framework incorporates structured human feedback to ensure robust performance under real-world constraints. To validate its capabilities, the framework is applied to a real-world challenge involving autonomous water-quality monitoring and sampling, where traditional methods are labor-intensive and ecologically disruptive. Leveraging the proposed system, a fully functional autonomous vessel was developed with optimized propulsion, cost-effective electronics, and advanced control. The design process was carried out by specialized agents, including a high-level planning agent responsible for problem abstraction and dedicated agents for structural, electronics, control, and software development. This approach demonstrates the potential of LLM-based multiagent systems to automate real-world engineering workflows and reduce reliance on extensive domain expertise.
Abstract Simulation is a widely used approach for evaluating system performance, robustness, and potential issues during design and testing. Large Language Models (LLMs) have recently shown strong potential in autonomous agent systems, including negotiation tasks—a core aspect of commerce. This paper evaluates LLM-based autonomous negotiator agents (LANAs) in a buyer-seller bargaining game to assess their decision-making and reasoning. We simulate interactions between agents embodying contrasting social behaviors: (a) Cunning vs. Kind, and (b) Greedy vs. Generous. By analyzing both the game outcomes and the agents’ internal reasoning, we find that LLMs can effectively simulate distinct social behaviors in both dialogue and decision-making. Our results offer insights into how social traits affect negotiation dynamics, emphasizing the importance of clear policy design to ensure fairness and reliability in LANA-based systems.
LLM-based multi-agent systems (MAS) have shown promise in tackling complex tasks. However, existing solutions often suffer from limited agent coordination and heavy reliance on predefined Standard Operating Procedures (SOPs), which demand extensive human input. To address these limitations, we propose MegaAgent, a large-scale autonomous LLM-based multi-agent system. MegaAgent generates agents based on task complexity and enables dynamic task decomposition, parallel execution, efficient communication, and comprehensive system monitoring of agents. In evaluations, MegaAgent demonstrates exceptional performance, successfully developing a Gobang game within 800 seconds and scaling up to 590 agents in a national policy simulation to generate multi-domain policies. It significantly outperforms existing systems, such as MetaGPT, in both task completion efficiency and scalability. By eliminating the need for predefined SOPs, MegaAgent demonstrates exceptional scalability and autonomy, setting a foundation for advancing true autonomy in MAS. Our code is available at https://github.com/Xtra-Computing/MegaAgent .
Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.
Recently, Large Language Model based Autonomous System (LLMAS) has gained great popularity for its potential to simulate complicated behaviors of human societies. One of its main challenges is to present and analyze the dynamic events evolution of LLMAS. In this work, we present a visualization approach to explore the detailed statuses and agents’ behavior within LLMAS. Our approach outlines a general pipeline that organizes raw execution events from LLMAS into a structured behavior model. We leverage a behavior summarization algorithm to create a hierarchical summary of these behaviors, arranged according to their sequence over time. Additionally, we design a cause trace method to mine the causal relationship between agent behaviors. We then develop AgentLens, a visual analysis system that leverages a hierarchical temporal visualization for illustrating the evolution of LLMAS, and supports users to interactively investigate details and causes of agents’ behaviors. Two usage scenarios and a user study demonstrate the effectiveness and usability of our AgentLens.
Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non-specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM-based multi-agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is built on a novel cognitive architecture featuring a hierarchical memory framework that enables flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling and submission. El Agente Q is benchmarked on six university-level course exercises and two case studies, demonstrating robust problem-solving performance (averaging>87% task success) and adaptive error handling through in situ debugging. It also supports longer-term, multi-step task execution for more complex workflows, while maintaining transparency through detailed action trace logs. Together, these capabilities lay the foundation for increasingly autonomous and accessible quantum chemistry.
We design and demonstrate the first field trial of LLM-powered AI Agent for ADON. Three operation modes of the Agent are proposed for network lifecycle management to process wavelength add/drop, soft/hard failures, and power optimizations.
The common-sense reasoning abilities and vast general knowledge of large language models (LLMs) make them a natural fit for interpreting user requests in a smart home assistant context. LLMs, however, lack specific knowledge about the user and their home, which limits their potential impact. Smart home agent with grounded execution (SAGE), overcomes these and other limitations by using a scheme in which a user request triggers an LLM-controlled sequence of discrete actions. These actions can be used to retrieve information, interact with the user, or manipulate device states. SAGE controls this process through a dynamically constructed tree of LLM prompts, which help it decide which action to take next, whether an action was successful, and when to terminate the process. The SAGE action set augments an LLM’s capabilities to support some of the most critical requirements for a smart home assistant. These include: flexible and scalable user preference management (“Is my team playing tonight?”), access to any smart device’s full functionality without device-specific code via API reading (“Turn down the screen brightness on my dryer”), persistent device state monitoring (“Remind me to throw out the milk when I open the fridge”), natural device references using only a photo of the room (“Turn on the lamp on the dresser”), and more. We introduce a benchmark of 50 new and challenging smart home tasks where SAGE achieves a 76% success rate, significantly outperforming existing LLM-enabled baselines (30% success rate).
: Microsoft Entra ID is Microsoft’s identity and access management solution used by many public and private sector organisations globally. In March 2023, Microsoft retired two PowerShell modules which have enabled automation of administrative tasks, such as user management. The replacement module is based on Microsoft Graph API, and its effective usage would require administrators to learn software development skills. In this paper, we will report the results of work-in-progress research on exploring the applicability of LLM-powered autonomous agents to solve real-life problems. We describe the design and proof-of-concept implementation of MEAN, an agent that performs Entra ID administrative tasks using Microsoft Graph API based on natural language prompts. The results show that LLM-powered autonomous agents can perform at least simple Entra ID administrative tasks. This indicates that the agents could ease the administrative burden by removing the need to learn software development skills.
Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem-solving to autonomous decision-making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS, which leverages graph neural networks to detect anomalies on the multi-agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees. The code is available at https://github.com/wslong20/G-safeguard.
Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization -- precisely identifying software problems by navigating to relevant code sections -- remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.
In this paper, we aim to improve the reasoning ability of large language models (LLMs) over knowledge graphs (KGs) to answer complex questions. Inspired by existing methods that design the interaction strategy between LLMs and KG, we propose an autonomous LLM-based agent framework, called KG-Agent, which enables a small LLM to actively make decisions until finishing the reasoning process over KGs. In KG-Agent, we integrate the LLM, multifunctional toolbox, KG-based executor, and knowledge memory, and develop an iteration mechanism that autonomously selects the tool then updates the memory for reasoning over KG. To guarantee the effectiveness, we leverage program language to formulate the multi-hop reasoning process over the KG, and synthesize a code-based instruction dataset to fine-tune the base LLM. Extensive experiments demonstrate that only using 10K samples for tuning LLaMA-7B can outperform state-of-the-art methods using larger LLMs or more data, on both in-domain and out-domain datasets. Our code and data will be publicly released.
Large language models (LLMs) are evolving into autonomous decision-makers, raising concerns about catastrophic risks in high-stakes scenarios, particularly in Chemical, Biological, Radiological and Nuclear (CBRN) domains. Based on the insight that such risks can originate from trade-offs between the agent's Helpful, Harmlessness and Honest (HHH) goals, we build a novel three-stage evaluation framework, which is carefully constructed to effectively and naturally expose such risks. We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis. Results reveal that LLM agents can autonomously engage in catastrophic behaviors and deception, without being deliberately induced. Furthermore, stronger reasoning abilities often increase, rather than mitigate, these risks. We also show that these agents can violate instructions and superior commands. On the whole, we empirically prove the existence of catastrophic risks in autonomous LLM agents. We release our code to foster further research.
The embedding of Large Language Models (LLMs) into autonomous agents is a rapidly developing field which enables dynamic, configurable behaviours without the need for extensive domain-specific training. In our previous work, we introduced SANDMAN, a Deceptive Agent architecture leveraging the Five-Factor OCEAN personality model, demonstrating that personality induction significantly influences agent task planning. Building on these findings, this study presents a novel method for measuring and evaluating how induced personality traits affect task selection processes-specifically planning, scheduling, and decision-making-in LLM-based agents. Our results reveal distinct task-selection patterns aligned with induced OCEAN attributes, underscoring the feasibility of designing highly plausible Deceptive Agents for proactive cyber defense strategies.
Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 27 different types of attack/defense methods, and 7 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30\%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. We also introduce a new metric to evaluate the agents' capability to balance utility and security. Our code can be found at https://github.com/agiresearch/ASB.
Chemical process optimization is crucial to maximize production efficiency and economic performance. Optimization algorithms, including gradient-based solvers, numerical methods, and parameter grid searches, become impractical when operating constraints are ill-defined or unavailable, requiring engineers to rely on subjective heuristics to estimate feasible parameter ranges. To address this constraint definition bottleneck, we present a multi-agent framework of large language model (LLM) agents that autonomously infer operating constraints from minimal process descriptions, then collaboratively guide optimization using the inferred constraints. Our AutoGen-based agentic framework employs OpenAI’s o3 model, with specialized agents for constraint generation, parameter validation, simulation execution, and optimization guidance. Through two phases: (i) autonomous constraint generation using embedded domain knowledge, and (ii) iterative multi-agent optimization, the framework eliminates the need for predefined operational bounds. Validated on the hydrodealkylation process across cost, yield, and yield-to-cost ratio metrics, the framework demonstrated competitive performance with conventional optimization methods while achieving a 31-fold reduction in wall-time relative to grid search, converging in under 20 min and requiring far fewer iterations to converge. Beyond computational efficiency, the framework’s reasoning-guided search demonstrates sophisticated process understanding, correctly identifying utility trade-offs, and applying domain-informed heuristics. Unlike conventional optimization methods like Bayesian optimization that require predefined constraints, our approach uniquely combines autonomous constraint generation with interpretable, reasoning-guided parameter exploration. Reproducibility analysis across five independent trials demonstrates consistent convergence behavior, while model comparison reveals that reasoning-capable LLM architectures (o3, o1) are essential for successful optimization, with standard models failing to converge effectively. This approach shows significant potential for optimization scenarios where operational constraints are poorly characterized or unavailable, particularly for emerging processes and retrofit applications.
Scene simulation in autonomous driving has gained significant attention because of its huge potential for generating customized data. However, existing editable scene simulation approaches face limitations in terms of user interaction efficiency, multi-camera photo-realistic rendering and external digital assets integration. To address these challenges, this paper introduces ChatSim, the first system that enables editable photo-realistic 3D driving scene simulations via natural language commands with external digital assets. To enable editing with high command flexibility, ChatSim leverages a large language model (LLM) agent collaboration framework. To generate photo-realistic outcomes, ChatSim employs a novel multi-camera neural radiance field method. Furthermore, to unleash the potential of extensive high-quality digital assets, ChatSim employs a novel multi-camera lighting estimation method to achieve scene-consistent assets' rendering. Our experiments on Waymo Open Dataset demonstrate that ChatSim can handle complex language commands and generate corresponding photo-realistic scene videos. Code can be accessed at: https://github.com/yifanlu0227/chatSim.
Recently, autonomous agents built on large language models (LLMs) have experienced significant development and are being deployed in real-world applications. These agents can extend the base LLM's capabilities in multiple ways. For example, a well-built agent using GPT-3.5-Turbo as its core can outperform the more advanced GPT-4 model by leveraging external components. More importantly, the usage of tools enables these systems to perform actions in the real world, moving from merely generating text to actively interacting with their environment. Given the agents' practical applications and their ability to execute consequential actions, it is crucial to assess potential vulnerabilities. Such autonomous systems can cause more severe damage than a standalone language model if compromised. While some existing research has explored harmful actions by LLM agents, our study approaches the vulnerability from a different perspective. We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions. We conduct comprehensive evaluations using various attack methods, surfaces, and properties to pinpoint areas of susceptibility. Our experiments reveal that these attacks can induce failure rates exceeding 80\% in multiple scenarios. Through attacks on implemented and deployable agents in multi-agent scenarios, we accentuate the realistic risks associated with these vulnerabilities. To mitigate such attacks, we propose self-examination detection methods. However, our findings indicate these attacks are difficult to detect effectively using LLMs alone, highlighting the substantial risks associated with this vulnerability.
Table reasoning requires models to jointly perform comprehensive semantic understanding and precise numerical operations. Although recent large language model (LLM)-based methods have achieved promising results, most of them still rely on a single-turn reasoning paradigm that processes flattened tables in a single forward pass. This paradigm suffers from inherent limitations, including context overflow on large tables, weak sensitivity to continuous numerical values, and the absence of explicit tool-use and reflection. In this paper, we propose TableMind, a tuning-based autonomous programmatic table agent that simulates the human-like cognitive schema of multi-turn interaction within a lightweight LLM. Instead of adopting a training-free workflow design, TableMind learns to internalize planning, action, and reflection through a principled two-stage training strategy. To bootstrap structured table reasoning capabilities, we construct and filter high-quality reasoning data for the supervised fine-tuning (SFT) stage. To enable precise code generation, we introduce a designed multi-perspective reward scheme and a novel optimization objective in the reinforcement learning (RL) stage. Extensive experiments on diverse benchmarks demonstrate that TableMind consistently outperforms previous baselines, validating the effectiveness of training autonomous agents to improve overall performance.
Creating Augmented Reality (AR) applications requires expertise in both design and implementation, posing significant barriers to entry for non-expert users. While existing methods reduce some of this burden, they often fall short in flexibility or usability for complex or varied use cases. To address this, we introduce agentAR, an AR authoring system that leverages a tool-augmented large language model (LLM)–based autonomous agent to support end-to-end, in-situ AR application creation from natural language input. Built on an application structure and tool library derived from state-of-the-art AR research, the agent autonomously creates AR applications from natural language dialogue. We demonstrate the effectiveness of agentAR through a case study of six AR applications and a user study with twelve participants, showing that it significantly reduces user effort while supporting the creation of diverse and functional AR experiences.
The integration of a complex set of electronic design automation (EDA) tools to enhance interoperability is a critical concern for circuit designers. Recent advancements in large language models (LLMs) have showcased their exceptional capabilities in natural language processing and comprehension, offering a novel approach to interfacing with EDA tools. This research article introduces ChatEDA, an autonomous agent for EDA empowered by an LLM, AutoMage, complemented by EDA tools serving as executors. ChatEDA streamlines the design flow from the register-transfer level (RTL) to the graphic data system version II (GDSII) by effectively managing task decomposition, script generation, and task execution. Through comprehensive experimental evaluations, ChatEDA has demonstrated its proficiency in handling diverse requirements, and our fine-tuned AutoMage model has exhibited superior performance compared to GPT-4 and other similar LLMs.
Agents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data-driven approach for this problem. Firstly, we collect real-world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.
Automated Driving System (ADS) is a safety-critical software system responsible for the interpretation of the vehicle’s environment and making decisions accordingly. The unbounded complexity of the driving context, including unforeseeable events, necessitate continuous improvement, often achieved through iterative DevOps processes. However, DevOps processes are themselves complex, making these improvements both time- and resource-intensive. Automation in code generation for ADS using Large Language Models (LLM) is one potential approach to address this challenge. Nevertheless, the development of ADS requires rigorous processes to verify, validate, assess, and qualify the code before it can be deployed in the vehicle and used. In this study, we developed and evaluated a prototype for automatic code generation and assessment using a designed pipeline of a LLM-based agent, simulation model, and rule-based feedback generator in an industrial setup. The LLM-generated code is evaluated automatically in a simulation model against multiple critical traffic scenarios, and an assessment report is provided as feedback to the LLM for modification or bug fixing. We report about the experimental results of the prototype employing Codellama:34b, DeepSeek (r1:32b and Coder:33b), CodeGemma:7b, Mistral:7b, and GPT4 for Adaptive Cruise Control (ACC) and Unsupervised Collision Avoidance by Evasive Manoeuvre (CAEM). We finally assessed the tool with 11 experts at two Original Equipment Manufacturers (OEMs) by conducting an interview study.
Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.
LLM-based autonomous agents often fail to execute complex web tasks that require dynamic interaction, largely due to the inherent uncertainty and complexity of these environments. Existing LLM-based web agents typically rely on rigid, expert-designed policies specific to certain states and actions, lacking the flexibility and generalizability needed to adapt to unseen tasks. In contrast, humans excel by exploring unknowns, continuously adapting strategies based on new observations, and resolving ambiguities through exploration. To emulate human-like adaptability, web agents need strategic exploration and complex decision-making. Monte Carlo Tree Search (MCTS) is well-suited for this, but classical MCTS struggles with vast action spaces, unpredictable state transitions, and incomplete information in web tasks. In light of this, we develop WebPilot, a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. Specifically, the Global Optimization phase involves generating a high-level plan by breaking down tasks into manageable subtasks, continuously refining this plan through reflective analysis of new observations and previous subtask attempts, thereby focusing the search process and mitigating challenges posed by vast action spaces in classical MCTS. Subsequently, the Local Optimization phase executes each subtask using a tailored MCTS designed for complex environments, effectively addressing uncertainties and managing incomplete information by iteratively refining decisions based on new observations. Experimental results on WebArena and MiniWoB++ demonstrate the effectiveness of WebPilot. Notably, on WebArena, WebPilot achieves SOTA performance with GPT-4, achieving a 93% relative increase in success rate over the concurrent tree search-based method. WebPilot advances autonomous agents, enabling more reliable decision-making in practical environments.
No abstract available
This paper explores the significant shift towards agentic workflows in the application of Large Language Models (LLMs), moving away from traditional, linear interactions between users and AI. Through a case study analysis, we highlight the effectiveness of agentic workflows, which facilitate a more dynamic and iterative engagement, in improving outcomes in tasks such as question answering, code generation or stock analysis. Central to the agentic workflow are four foundational design patterns: reflection, planning, multi-agent collaboration, and tool utilization. These components are crucial for boosting LLM productivity and enhancing performance. The study demonstrates how agentic workflows, by promoting an iterative and reflective process, can serve as a crucial step towards achieving Artificial General Intelligence (AGI).
Large language models (LLMs) can answer expert-level questions in medicine but are prone to hallucinations and arithmetic errors. Early evidence suggests LLMs cannot reliably perform clinical calculations, limiting their potential integration into clinical workflows. We evaluated ChatGPT’s performance across 48 medical calculation tasks, finding incorrect responses in one-third of trials (n = 212). We then assessed three forms of agentic augmentation: retrieval-augmented generation, a code interpreter tool, and a set of task-specific calculation tools (OpenMedCalc) across 10,000 trials. Models with access to task-specific tools showed the greatest improvement, with LLaMa and GPT-based models demonstrating a 5.5-fold (88% vs 16%) and 13-fold (64% vs 4.8%) reduction in incorrect responses, respectively, compared to the unimproved models. Our findings suggest that integration of machine-readable, task-specific tools may help overcome LLMs’ limitations in medical calculations.
As chemical plants evolve towards full autonomy, the need for effective fault handling and control in dynamic, unpredictable environments becomes increasingly critical. This paper proposes an innovative approach to industrial automation, introducing validation and reprompting architectures utilizing large language model (LLM)-based autonomous control agents. The proposed agentic system, comprising of operator, validator, and reprompter agents, enables autonomous management of control tasks, adapting to unforeseen disturbances without human intervention. By utilizing validation and reprompting architectures, the framework allows agents to recover from errors and continuously improve decision-making in real-time industrial scenarios. We hypothesize that this mechanism will enhance performance and reliability across a variety of LLMs, offering a path toward fully autonomous systems capable of handling unexpected challenges, paving the way for robust, adaptive control in complex industrial environments. To demonstrate the concept's effectiveness, we created a simple case study involving a temperature control experiment embedded on a microcontroller device, validating the proposed approach.
Molecular dynamics (MD) simulations are essential for understanding biomolecular systems but remain challenging to automate. Recent advances in large language models (LLMs) have demonstrated success in automating complex scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an agentic LLM assistant capable of automating MD workflows for proteins. MDCrow uses chain-of-thought over 40 expert-designed tools for handling and processing files, setting up simulations, analyzing the simulation outputs, and retrieving relevant information from literature and databases. We assess MDCrow’s performance across 25 common tasks of varying complexity, and we evaluate the agent’s robustness to difficulty and prompt style. gpt-4o is able to complete increasingly complex tasks with low variance, followed closely by llama3-405b, a compelling open-source model. While prompt style does not influence the best models’ performance, it has significant effects on smaller models.
A central piece in enabling intelligent agentic behavior in foundation models is to make them capable of introspecting upon their behavior, reasoning, and correcting their mistakes as more computation or interaction is available. Even the strongest proprietary large language models (LLMs) do not quite exhibit the ability of continually improving their responses sequentially, even in scenarios where they are explicitly told that they are making a mistake. In this paper, we develop RISE: Recursive IntroSpEction, an approach for fine-tuning LLMs to introduce this capability, despite prior work hypothesizing that this capability may not be possible to attain. Our approach prescribes an iterative fine-tuning procedure, which attempts to teach the model how to alter its response after having executed previously unsuccessful attempts to solve a hard test-time problem, with optionally additional environment feedback. RISE poses fine-tuning for a single-turn prompt as solving a multi-turn Markov decision process (MDP), where the initial state is the prompt. Inspired by principles in online imitation learning and reinforcement learning, we propose strategies for multi-turn data collection and training so as to imbue an LLM with the capability to recursively detect and correct its previous mistakes in subsequent iterations. Our experiments show that RISE enables Llama2, Llama3, and Mistral models to improve themselves with more turns on math reasoning tasks, outperforming several single-turn strategies given an equal amount of inference-time computation. We also find that RISE scales well, often attaining larger benefits with more capable models. Our analysis shows that RISE makes meaningful improvements to responses to arrive at the correct solution for challenging prompts, without disrupting one-turn abilities as a result of expressing more complex distributions.
PURPOSE The purpose of the study is to demonstrate the value of custom methods, namely Retrieval Augmented Generation(RAG)-based Large Language Models(LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament(ACL) injury case. METHODS A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source(Open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models(LLama3 8b/70b and Mistral8x7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with Artificial Intelligence(AI) Agents and re-evaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. ROUGE and METEOR scores were calculated to assess semantic similarity in the response. RESULTS All non-custom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's Open-Source Llama3 70b(94%). The highest performing model with RAG and AI Agents was Open AI's GPT4(95%). CONCLUSION RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis. CLINICAL RELEVANCE Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and Agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.
Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\sim45\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\%\sim11.82\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.
We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Agentic Reasoning dynamically leverages web search, code execution, and structured memory to address complex problems requiring deep research. A key innovation in our framework is the Mind-Map agent, which constructs a structured knowledge graph to store reasoning context and track logical relationships, ensuring coherence in long reasoning chains with extensive tool usage. Additionally, we conduct a comprehensive exploration of the Web-Search agent, leading to a highly effective search mechanism that surpasses all prior approaches. When deployed on DeepSeek-R1, our method achieves a new state-of-the-art (SOTA) among public models and delivers performance comparable to OpenAI Deep Research, the leading proprietary model in this domain. Extensive ablation studies validate the optimal selection of agentic tools and confirm the effectiveness of our Mind-Map and Web-Search agents in enhancing LLM reasoning. The code is at: https://github.com/theworldofagents/Agentic-Reasoning
Scientific data visualization plays a crucial role in research by enabling the direct display of complex information and assisting researchers in identifying implicit patterns. Despite its importance, the use of Large Language Models (LLMs) for scientific data visualization remains rather unexplored. In this study, we introduce MatPlotAgent, an efficient model-agnostic LLM agent framework designed to automate scientific data visualization tasks. Leveraging the capabilities of both code LLMs and multi-modal LLMs, MatPlotAgent consists of three core modules: query understanding, code generation with iterative debugging, and a visual feedback mechanism for error correction. To address the lack of benchmarks in this field, we present MatPlotBench, a high-quality benchmark consisting of 100 human-verified test cases. Additionally, we introduce a scoring approach that utilizes GPT-4V for automatic evaluation. Experimental results demonstrate that MatPlotAgent can improve the performance of various LLMs, including both commercial and open-source models. Furthermore, the proposed evaluation method shows a strong correlation with human-annotated scores.
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to ground and act in the visual-spatial world (spatial-temporal intelligence). To endow agentic capabilities for tasks ranging from UI navigation to robot manipulation, Magma is trained on large amounts of heterogeneous datasets that span from images, videos to robotics data, where actionable visual objects (e.g. clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and object movements (e.g. trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM help bridge the gap between verbal and action abilities and significantly enhance spatio-temporal intelligence which is fundamental to agentic tasks, as shown in Fig. 1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. Moreover, Magma preserves strong multimodal understanding ability and compares favorably to popular large multimodal models that are trained on much larger datasets. We have made our model and code public for reproducibility1.
Biomedical research underpins progress in our understanding of human health and disease, drug discovery, and clinical care. However, with the growth of complex lab experiments, large datasets, many analytical tools, and expansive literature, biomedical research is increasingly constrained by repetitive and fragmented workflows that slow discovery and limit innovation, underscoring the need for a fundamentally new way to scale scientific expertise. Here, we introduce Biomni, a general-purpose biomedical AI agent designed to autonomously execute a wide spectrum of research tasks across diverse biomedical subfields. To systematically map the biomedical action space, Biomni first employs an action discovery agent to create the first unified agentic environment – mining essential tools, databases, and protocols from tens of thousands of publications across 25 biomedical domains. Built on this foundation, Biomni features a generalist agentic architecture that integrates large language model (LLM) reasoning with retrieval-augmented planning and code-based execution, enabling it to dynamically compose and carry out complex biomedical workflows – entirely without relying on predefined templates or rigid task flows. Systematic benchmarking demonstrates that Biomni achieves strong generalization across heterogeneous biomedical tasks – including causal gene prioritization, drug repurposing, rare disease diagnosis, micro-biome analysis, and molecular cloning – without any task-specific prompt tuning. Real-world case studies further showcase Biomni’s ability to interpret complex, multi-modal biomedical datasets and autonomously generate experimentally testable protocols. Biomni envisions a future where virtual AI biologists operate alongside and augment human scientists to dramatically enhance research productivity, clinical insight, and healthcare. Biomni is ready to use at https://biomni.stanford.edu, and we invite scientists to explore its capabilities, stress-test its limits, and co-create the next era of biomedical discoveries.
Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE.
Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling for driving agentic systems at the edge. We first show how to enable accurate function calling for open-source models via the LLMCompiler framework. We then systematically curate a high-quality dataset for function calling, which we use to fine-tune two small language models, TinyAgent-1.1B and 7B. For efficient inference, we introduce a novel tool retrieval method to reduce the input prompt length and utilize quantization to further accelerate the inference speed. As a driving application, we demonstrate a local Siri-like system for Apple’s MacBook that can execute user commands through text or voice input. Our results show that our models can achieve, and even surpass, the function-calling capabilities of larger models like GPT-4-Turbo, while being fully deployed at the edge. We open-source our [dataset, models, and installable package](https://github.com/SqueezeAILab/TinyAgent) and provide a [demo video](https://www.youtube.com/watch?v=0GvaGL9IDpQ) for our MacBook assistant agent.
This work presents a large language model (LLM)-based agent OpenFOAMGPT tailored for OpenFOAM-centric computational fluid dynamics (CFD) simulations, leveraging two foundation models from OpenAI: the GPT-4o (GPT means Generative Pre-trained Transformer) and a chain-of-thought–enabled o1 preview model. Both agents demonstrate success across multiple tasks. While the price of token with o1 model is six times as that of GPT-4o, it consistently exhibits superior performance in handling complex tasks, from zero-shot/few-shot case setup to boundary condition modifications, zero-shot turbulence model adjustments, and zero-shot code translation. Through an iterative correction loop, the agent efficiently addressed single-phase and multiphase flow, heat transfer, Reynolds-averaged Navier–Stokes modeling, large eddy simulation, and other engineering scenarios, often converging in a limited number of iterations at low token costs. To embed domain-specific knowledge, we employed a retrieval-augmented generation pipeline, demonstrating how preexisting simulation setups can further specialize the agent for subdomains such as energy and aerospace. Despite the great performance of the agent, human oversight remains crucial for ensuring accuracy and adapting to shifting contexts. Fluctuations in model performance over time suggest the need for monitoring in mission-critical applications. Although our demonstrations focus on OpenFOAM, the adaptable nature of this framework opens the door to developing LLM-driven agents into a wide range of solvers and codes. By streamlining CFD simulations, this approach has the potential to accelerate both fundamental research and industrial engineering advancements.
Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with real-world time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi-modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in-context examples. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.
Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.
Day-ahead electricity price prediction is crucial for market participants to make optimal trading decisions. The implementation of the five-minute settlement (5MS) process in the Australian National Electricity Market (NEM) on October 1, 2021, reduced the settlement interval from 30 minutes to 5 minutes. This change has led to more frequent adjustments in pricing, allowing for a more accurate reflection of real-time supply and demand conditions. However, this increased frequency has significantly heightened the complexity of price fluctuations in the wholesale market. Consequently, conventional machine learning and deep learning methods struggle to provide accurate predictions at this higher resolution. Since electricity prices are fundamentally determined by the supply-demand balance and the bidding behaviors of market participants, this work introduces individual participant's bidding behaviors into the prediction model. We fine-tune a pre-trained Large Language Model (LLM) to create bidding behavior agents, which forecasts day-ahead bidding behaviors. Moreover, market sentiment plays a significant role in electricity price volatility, yet it remains challenging to quantify and assess its impact. To address this, we employ a pre-trained LLM to analyze online resources, incorporating market sentiment into the price prediction model. Additionally, to enhance the accuracy of spike predictions, we improve the conditional time series generative adversarial network (CTSGAN) model by utilizing a spike confusion matrix and further strengthen the model by integrating bidding behavior and market sentiment as inputs. Case studies demonstrate that the proposed model significantly improves both electricity price and spike prediction accuracy, offering a robust tool for market participants to navigate the complexities of the modern electricity market.
Using commercial software for radio map generation and wireless network planning often require complex manual operations, posing significant challenges in terms of scalability, adaptability, and user-friendliness, due to heavy manual operations. To address these issues, we propose an automated solution that employs large language model (LLM) agents. These agents are designed to autonomously generate radio maps and facilitate wireless network planning for specified areas, thereby minimizing the necessity for extensive manual intervention. To validate the effectiveness of our proposed solution, we develop a software platform that integrates LLM agents. Experimental results demonstrate that a large amount manual operations can be saved via the proposed LLM agent, and the automated solutions can achieve an enhanced coverage and signal-to-interference-noise ratio (SINR), especially in urban environments.
Objective. Radiotherapy planning requires significant expertise to balance tumor control and organ-at-risk (OAR) sparing. Automated planning can improve both efficiency and quality. This study introduces GPT-Plan, a novel multi-agent system powered by the GPT-4 family of large language models (LLMs), for automating the iterative radiotherapy plan optimization. Approach. GPT-Plan uses LLM-driven agents, mimicking the collaborative clinical workflow of a dosimetrist and physicist, to iteratively generate and evaluate text-based radiotherapy plans based on predefined criteria. Supporting tools assist the agents by leveraging historical plans, mitigating LLM hallucinations, and balancing exploration and exploitation. Performance was evaluated on 12 lung (IMRT) and 5 cervical (VMAT) cancer cases, benchmarked against the ECHO auto-planning method and manual plans. The impact of historical plan retrieval on efficiency was also assessed. Results. For IMRT lung cancer cases, GPT-Plan generated high-quality plans, demonstrating superior target coverage and homogeneity compared to ECHO while maintaining comparable or better OAR sparing. For VMAT cervical cancer cases, plan quality was comparable to a senior physicist and consistently superior to a junior physicist, particularly for OAR sparing. Retrieving historical plans significantly reduced the number of required optimization iterations for lung cases (p < 0.01) and yielded iteration counts comparable to those of the senior physicist for cervical cases (p = 0.313). Occasional LLM hallucinations have been mitigated by self-reflection mechanisms. One limitation was the inaccuracy of vision-based LLMs in interpreting dose images. Significance. This pioneering study demonstrates the feasibility of automating radiotherapy planning using LLM-powered agents for complex treatment decision-making tasks. While challenges remain in addressing LLM limitations, ongoing advancements hold potential for further refining and expanding GPT-Plan’s capabilities.
Configuring computational fluid dynamics (CFD) simulations typically demands extensive domain expertise, limiting broader access. Although large language models (LLMs) have advanced scientific computing, their use in automating CFD workflows is underdeveloped. We introduce a novel approach centered on domain-specific LLM adaptation. By fine-tuning Qwen2.5-7B-Instruct on NL2FOAM, our custom dataset of 28716 natural language-to-OpenFOAM configuration pairs with chain-of-thought (CoT) annotations, we enable direct translation from natural language descriptions to executable CFD setups. A multi-agent framework orchestrates the process, autonomously verifying inputs, generating configurations, running simulations, and correcting errors. Evaluation on a benchmark of 21 diverse flow cases demonstrates state-of-the-art performance, achieving 88.7% solution accuracy and 82.6% first-attempt success rate. This significantly outperforms larger general-purpose models like Qwen2.5-72B-Instruct, DeepSeek-R1, and Llama3.3-70B-Instruct, while also requiring fewer correction iterations and maintaining high computational efficiency. The results highlight the critical role of domain-specific adaptation in deploying LLM assistants for complex engineering workflows. Our code and fine-tuned model have been deposited at https://github.com/YYgroup/AutoCFD.
Large language models (LLMs) are transforming laboratory automation by enabling self-driving laboratories (SDLs) that could accelerate materials research. However, current SDL implementations rely on rigid protocols that fail to capture the adaptability and intuition of expert scientists in dynamic experimental settings. Here, we show that LLM agents can automate atomic force microscopy (AFM) through our Artificially Intelligent Lab Assistant (AILA) framework. Further, we develop AFMBench—a comprehensive evaluation suite challenging LLM agents across the complete scientific workflow from experimental design to results analysis. We find that state-of-the-art LLMs struggle with basic tasks and coordination scenarios. Notably, models excelling at materials science question-answering perform poorly in laboratory settings, showing that domain knowledge does not translate to experimental capabilities. Additionally, we observe that LLM agents can deviate from instructions, a phenomenon referred to as sleepwalking, raising safety alignment concerns for SDL applications. Our ablations reveal that multi-agent frameworks significantly outperform single-agent approaches, though both remain sensitive to minor changes in instruction formatting or prompting. Finally, we evaluate AILA’s effectiveness in increasingly advanced experiments—AFM calibration, feature detection, mechanical property measurement, graphene layer counting, and indenter detection. These findings establish the necessity for benchmarking and robust safety protocols before deploying LLM agents as autonomous laboratory assistants across scientific disciplines. LLM agents could revolutionize laboratory automation, but their capabilities remain poorly tested. Here, the authors create a framework automating atomic force microscopy with LLMs and benchmark them through an end-to-end evaluation suite, revealing major limitations and safety concerns
Large-language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human-computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.
Abstract Objective Conversational Health Agents (CHAs) are interactive systems providing healthcare services, such as assistance and diagnosis. Current CHAs, especially those utilizing Large Language Models (LLMs), primarily focus on conversation aspects. However, they offer limited agent capabilities, specifically needing more multistep problem-solving, personalized conversations, and multimodal data analysis. We aim to overcome these limitations. Materials and methods We propose openCHA, an open-source LLM-powered framework, designed to enable the development of conversational agents. OpenCHA offers a foundational and structured architecture and codebase, enabling researchers and developers to build and customize their CHA based on the specifics of their intended application. The framework leverages knowledge acquisition, problem-solving capabilities, multilingual, and multimodal conversations, and allows interaction with various AI platforms. We have released the framework as open source for the community on GitHub (https://github.com/Institute4FutureHealth/CHA and https://opencha.com). Results We demonstrated the openCHA’s capability to develop CHAs across multiple health domains using 2 demos and 5 use cases. In diabetic patient management, developed CHA achieved a 92.1% accuracy rate, surpassing GPT4’s 51.8%. In food recommendations, developed CHA outperformed GPT4. The developed CHA excelled as an evaluator for mental health chatbots, recording the lowest Mean Absolute Error at 0.31, compared to competitors like GPT, Misteral, Gemini, and Claude. Additionally, the empathy enabled CHA identified emotional states with 89% accuracy, and in physiological data analysis of heart rate from Photoplethysmography (PPG) signals, the developed CHA achieved an mean absolute error of 2.83, far lower than GPT-4o’s 8.93. Discussion The openCHA framework enhances CHAs by enabling features such as explainability, personalization, and reliability through its integration with LLMs and external data sources. The developed CHAs face challenges like latency, token limits, and scalability. Future efforts will focus on improving planning robustness, enhancing accuracy and evaluation methods, and resolving user query ambiguity to further refine the framework’s effectiveness. Conclusion The diverse demos and use cases of openCHA demonstrate the framework’s capacity to empower the development of a wide range of CHAs for various healthcare tasks.
Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.
The rapid advancement of chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents, and provides insight into their"cognitive"processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following cooperation in multi-agent settings. Our contributions include introducing a novel communicative agent framework, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing our library to support research on communicative agents and beyond: https://github.com/camel-ai/camel.
Recent breakthroughs in large language model-driven autonomous agents have revealed that multi-agent collaboration often surpasses each individual through collective reasoning. Inspired by the neural scaling law--increasing neurons enhances performance, this study explores whether the continuous addition of collaborative agents can yield similar benefits. Technically, we utilize directed acyclic graphs to organize agents into a multi-agent collaboration network (MacNet), upon which their interactive reasoning is topologically orchestrated for autonomous task solving. Extensive evaluations reveal that it effectively supports collaboration among over a thousand agents, with irregular topologies outperforming regular ones. We also identify a collaborative scaling law--the overall performance follows a logistic growth pattern as agents scale, with collaborative emergence occurring earlier than traditional neural emergence. We speculate this may be because scaling agents catalyzes their multidimensional considerations during interactive reflection and refinement, thereby producing more comprehensive artifacts. The code is available at https://github.com/OpenBMB/ChatDev/tree/macnet.
This paper introduces a novel approach using Large Language Models (LLMs) integrated into an agent framework for flexible and effective personal mobility generation. LLMs overcome the limitations of previous models by effectively processing semantic data and offering versatility in modeling various tasks. Our approach addresses three research questions: aligning LLMs with real-world urban mobility data, developing reliable activity generation strategies, and exploring LLM applications in urban mobility. The key technical contribution is a novel LLM agent framework that accounts for individual activity patterns and motivations, including a self-consistency approach to align LLMs with real-world activity data and a retrieval-augmented strategy for interpretable activity generation. We evaluate our LLM agent framework and compare it with state-of-the-art personal mobility generation approaches, demonstrating the effectiveness of our approach and its potential applications in urban mobility. Overall, this study marks the pioneering work of designing an LLM agent framework for activity generation based on real-world human activity data, offering a promising tool for urban mobility analysis.
Existing large language model (LLM) agents for automating data science show promise, but they remain constrained by narrow task scopes, limited generalization across tasks and models, and over-reliance on state-of-the-art (SOTA) LLMs. We introduce DatawiseAgent, a notebook-centric LLM agent framework for adaptive and robust data science automation. Inspired by how human data scientists work in computational notebooks, DatawiseAgent introduces a unified interaction representation and a multi-stage architecture based on finite-state transducers (FSTs). This design enables flexible long-horizon planning, progressive solution development, and robust recovery from execution failures. Extensive experiments across diverse data science scenarios and models show that DatawiseAgent consistently achieves SOTA performance by surpassing strong baselines such as AutoGen and TaskWeaver, demonstrating superior effectiveness and adaptability. Further evaluations reveal graceful performance degradation under weaker or smaller models, underscoring the robustness and scalability.
Large language models (LLMs) excel at rapid generation of text and multimodal content, yet they falter on transaction-style planning that demands ACID-like guarantees and real-time disruption recovery. We present Adaptive LLM Agent System (ALAS), a framework that tackles four fundamental LLM deficits: (i) absence of self-verification, (ii) context erosion, (iii) next-token myopia, and (iv) lack of persistent state. ALAS decomposes each plan into role-specialized agents, equips them with automatic state tracking, and coordinates them through a lightweight protocol. When disruptions arise, agents apply history-aware local compensation, avoiding costly global replanning and containing cascade effects. On real-world, large-scale job-shop scheduling benchmarks, ALAS sets new best results for static sequential planning and excels in dynamic reactive scenarios with unexpected disruptions. These gains show that principled modularization plus targeted compensation can unlock scalable and resilient planning with LLMs.
With the proliferation of Large Language Models (LLMs), the detection of misinformation has become increasingly important and complex. This research proposes an innovative verifiable misinformation detection LLM agent that goes beyond traditional true/false binary judgments. The agent actively verifies claims through dynamic interaction with diverse web sources, assesses information source credibility, synthesizes evidence, and provides a complete verifiable reasoning process. Our designed agent architecture includes three core tools: precise web search tool, source credibility assessment tool and numerical claim verification tool. These tools enable the agent to execute multi-step verification strategies, maintain evidence logs, and form comprehensive assessment conclusions. We evaluate using standard misinformation datasets such as FakeNewsNet, comparing with traditional machine learning models and LLMs. Evaluation metrics include standard classification metrics, quality assessment of reasoning processes, and robustness testing against rewritten content. Experimental results show that our agent outperforms baseline methods in misinformation detection accuracy, reasoning transparency, and resistance to information rewriting, providing a new paradigm for trustworthy AI-assisted fact-checking.
Usability testing is a fundamental yet challenging research method for user experience (UX) researchers to evaluate a web design. Recent advances in Large Language Model-simulated Agent (LLM Agent) research inspired us to design UXAgent to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features an LLM Agent module and a universal browser connector module so that UX researchers can automatically generate thousands of simulated users to test the target website. The system can generate UX study results in qualitative (e.g., interviewing how an agent thinks), quantitative (e.g., # of actions), and video recording formats for UX researchers to analyze. Through a heuristic user evaluation with five UX researchers, participants praised the innovation of our system but also expressed concerns about the future of UX study with LLM Agents1.
Requirements elicitation, a critical, yet time-consuming and challenging step in product development, often fails to capture the full spectrum of user needs. This may lead to products that fall short of expectations. This paper introduces a novel framework that leverages Large Language Models (LLMs) to automate and enhance the requirements elicitation process. LLMs are used to generate a vast array of simulated users (LLM agents), enabling the exploration of a much broader range of user needs and unforeseen use cases. These agents engage in product experience scenarios, through explaining their actions, observations, and challenges. Subsequent agent interviews and analysis uncover valuable user needs, including latent ones. We validate our framework with three experiments. First, we explore different methodologies for the challenge of diverse agent generation, discussing their advantages and shortcomings. We measure the diversity of identified user needs and demonstrate that context-aware agent generation leads to greater diversity. Second, we show how our framework effectively mimics empathic lead user interviews, identifying a greater number of latent needs than conventional human interviews. Third, we showcase that LLMs can be used to analyze interviews, capture needs and classify them as latent or not. Our work highlights the potential of using LLMs to accelerate early-stage product development, reduce costs, and increase innovation.
Despite recent progress in generating hardware register transfer level (RTL) code with large language models (LLMs), existing solutions still suffer from a substantial gap between practical application scenarios and the requirements of real-world RTL code development. Prior approaches either focus on overly simplified hardware descriptions or depend on extensive human guidance to process complex specifications, limiting their scalability and automation potential. In this paper, we address this gap by proposing an LLM agent system, termed Spec2RTL-Agent, designed to directly process complex specification documentation and generate corresponding RTL code implementations, advancing LLM-based RTL code generation toward more realistic application settings. To achieve this goal, Spec2RTL-Agent introduces a novel multi-agent collaboration framework that integrates three key enablers: (1) a reasoning and understanding module that translates specifications into structured, step-by-step implementation plans; (2) a progressive coding and prompt optimization module that iteratively refines the code across multiple representations (pseudocode, Python, and C++) to enhance correctness and synthesisability for RTL conversion; and (3) an adaptive reflection module that identifies and traces the source of errors during generation, ensuring a more robust code generation flow. Instead of directly generating RTL from natural language, our system strategically generates synthesizable C++ code, which is then optimized for high-level synthesis (HLS). This agent-driven refinement ensures greater correctness and compatibility compared to naive direct RTL generation approaches. We evaluate Spec2RTL-Agent on a benchmark of three specification documents, demonstrating its effectiveness in generating accurate RTL code with as much as 75% fewer human interventions compared to existing approaches. These results underscore Spec2RTL-Agent’s role as the first fully automated multi-agent system for RTL generation from unstructured specification documents, reducing the reliance on human effort and expertise in hardware design.
Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.
Intelligent Tutoring Systems (ITSs) have revolutionized education by offering personalized learning experiences. However, as goal-oriented learning, which emphasizes efficiently achieving specific objectives, becomes increasingly important in professional contexts, existing ITSs often struggle to deliver this type of targeted learning experience. In this paper, we propose GenMentor, an LLM-powered multi-agent framework designed to deliver goal-oriented, personalized learning within ITS. GenMentor begins by accurately mapping learners' goals to required skills using a fine-tuned LLM trained on a custom goal-to-skill dataset. After identifying the skill gap, it schedules an efficient learning path using an evolving optimization approach, driven by a comprehensive and dynamic profile of learners' multifaceted status. Additionally, GenMentor tailors learning content with an exploration-drafting-integration mechanism to align with individual learner needs. Extensive automated and human evaluations demonstrate GenMentor's effectiveness in learning guidance and content quality. Furthermore, we have deployed it in practice and also implemented it as an application. Practical human study with professional learners further highlights its effectiveness in goal alignment and resource targeting, leading to enhanced personalization. Supplementary resources are available at ttps://github.com/GeminiLight/gen-mentor.
Large language models (LLMs) fine-tuned on multimodal financial data have demonstrated impressive reasoning capabilities in various financial tasks. However, they often struggle with multi-step, goal-oriented scenarios in interactive financial markets, such as trading, where complex agentic approaches are required to improve decision-making. To address this, we propose \textsc{FLAG-Trader}, a unified architecture integrating linguistic processing (via LLMs) with gradient-driven reinforcement learning (RL) policy optimization, in which a partially fine-tuned LLM acts as the policy network, leveraging pre-trained knowledge while adapting to the financial domain through parameter-efficient fine-tuning. Through policy gradient optimization driven by trading rewards, our framework not only enhances LLM performance in trading but also improves results on other financial-domain tasks. We present extensive empirical evidence to validate these enhancements.
Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task-specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in-context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidating the collective efforts of research community. Code repo is available at https://github.com/tsinghua-fib-lab/AgentSquare.
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete various tasks in a self-directed fashion. The challenge of tool use demands that LLMs not only understand user queries and generate answers accurately but also excel in task planning, tool invocation, and result summarization. While traditional works focus on training a single LLM with all these capabilities, performance limitations become apparent, particularly with smaller models. To overcome these challenges, we propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. Each component is implemented by a single LLM that focuses on a specific capability and collaborates with others to accomplish the task. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability. To effectively train this framework, we introduce a two-stage training paradigm. First, we fine-tune a backbone LLM on the entire dataset without discriminating sub-tasks, providing the model with a comprehensive understanding of the task. Second, the fine-tuned LLM is used to instantiate the planner, caller, and summarizer respectively, which are continually fine-tuned on respective sub-tasks. Evaluation across various tool-use benchmarks illustrates that our proposed multi-LLM framework surpasses the traditional single-LLM approach, highlighting its efficacy and advantages in tool learning.
In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. To overcome this challenge, we empirically study the reason why LLMs fail to resolve GitHub issues and analyze the major factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, significantly outperforming the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the advanced LLM.
Phishing websites remain a major cybersecurity threat, exploiting deceptive structures, brand impersonation, and social engineering to evade detection. Recent advances in large language models (LLMs) have improved phishing detection through contextual understanding, yet most existing approaches rely on single-agent classification, which is prone to hallucination and often lacks interpretability and robustness. To address these limitations, we propose PhishDebate, a modular multi-agent LLM-based debate framework for phishing website detection. Four specialized agents independently analyze webpage aspects, including URL structure, HTML composition, semantic content, and brand impersonation, under the coordination of a Moderator and final Judge. Through structured debate and divergent reasoning, the framework achieves more accurate and interpretable decisions. By reducing uncertain predictions and providing transparent reasoning, PhishDebate functions as an analyst-augmentation system that lowers cognitive load and supports early, left-of-exploit detection of phishing threats. Evaluations on commercial LLMs show that PhishDebate achieves 98.2 % recall on a real-world phishing dataset and outperforms single-agent and Chain-of-Thought (CoT) baselines. Its modular design enables agent-level configurability, allowing adaptation to varying resource and application requirements, and offers scalability to high-velocity, large-scale security data environments.
This work leverages Large Language Models (LLMs) to simulate human mobility, addressing challenges like high costs and privacy concerns in traditional models. Our hierarchical framework integrates persona generation, activity selection, and destination prediction, using real-world demographic and psychological data to create realistic movement patterns. Both physical models and language models are employed to explore and demonstrate different methodologies for human mobility simulation. By structuring data with summarization and weighted density metrics, the system ensures scalable memory management while retaining actionable insights. Preliminary results indicate that LLM-driven simulations align with observed real-world patterns, offering scalable, interpretable insights for social problems such as urban planning, traffic management, and public health. The framework's ability to dynamically generate personas and activities enables it to provide adaptable and realistic daily routines. This study demonstrates the transformative potential of LLMs in advancing mobility modeling for societal and urban applications. The source code and interactive demo for our framework are available at https://github.com/cju0/TrajLLM.
Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the **I**terative step-level **P**rocess **R**efinement **(IPR)** framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical finds highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.
Recently, Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience and have attracted considerable attention. Despite the impressive progress, the research question regarding the safety vulnerability of LLM-empowered RecSys still remains largely under-investigated. Given the security and privacy concerns, it is more practical to focus on attacking the black-box RecSys, where attackers can only observe the system's inputs and outputs. However, traditional attack approaches employing reinforcement learning (RL) agents are not effective for attacking LLM-empowered RecSys due to the limited capabilities in processing complex textual inputs, planning, and reasoning. On the other hand, LLMs provide unprecedented opportunities to serve as attack agents to attack RecSys because of their impressive capability in simulating human-like decision-making processes. Therefore, in this paper, we propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs, where an LLM-based agent is developed to attack LLM-Empowered RecSys. Specifically, our method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, we utilize the prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real-world datasets demonstrate the effectiveness of our proposed attacking method.
Visual analytics (VA) requires analysts to iteratively propose analysis tasks based on observations and execute tasks by creating visualizations and interactive exploration to gain insights. This process demands skills in programming, data processing, and visualization tools, highlighting the need for a more intelligent, streamlined VA approach. Large language models (LLMs) have recently been developed as agents to handle various tasks with dynamic planning and tool-using capabilities, offering the potential to enhance the efficiency and versatility of VA. We propose LightVA, a lightweight VA framework that supports task decomposition, data analysis, and interactive exploration through human-agent collaboration. Our method is designed to help users progressively translate high-level analytical goals into low-level tasks, producing visualizations and deriving insights. Specifically, we introduce an LLM agent-based task planning and execution strategy, employing a recursive process involving a planner, executor, and controller. The planner is responsible for recommending and decomposing tasks, the executor handles task execution, including data analysis, visualization generation and multi-view composition, and the controller coordinates the interaction between the planner and executor. Building on the framework, we develop a system with a hybrid user interface that includes a task flow diagram for monitoring and managing the task planning process, a visualization panel for interactive data exploration, and a chat view for guiding the model through natural language instructions. We examine the effectiveness of our method through a usage scenario and an expert study.
Many people struggle with learning a new language when moving to a new country, with traditional tools falling short in providing contextualized learning tailored to each learner’s needs. The recent development of large language models (LLMs) and embodied conversational agents (ECAs) in social virtual reality (VR) provides new opportunities to practice language learning in a contextualized and naturalistic way that takes into account the learner’s language level and needs. To explore this opportunity, we developed ELLMA-T, a design probe that integrates an LLM (GPT-4) with an ECA for English language learning in social VR (VRChat), informed by the situated learning framework. We conducted a feasibility study to explore the potential and challenges of LLM-based ECAs for language learning in social VR. Drawing on qualitative interviews (N=12), we reveal the potential of ELLMA-T to generate realistic, believable, and context-specific role plays for agent-learner interaction in VR, and LLM’s capability to provide initial language assessment and continuous feedback to learners. We provide four design implications for the future development of LLM-based language agents in social VR.
The rapid advancement of Large Language Models (LLMs) has led to substantial investment in enhancing their capabilities and expanding their feature sets. Despite these developments, a critical gap remains between model sophistication and their dependable deployment in real-world applications. A key concern is the inconsistency of LLM-generated outputs in production environments, which hinders scalability and reliability. In response to these challenges, we propose a novel framework that integrates custom-defined, rule-based logic to constrain and guide LLM behavior effectively. This framework enforces deterministic response boundaries while considering the model’s reasoning capabilities. Furthermore, we introduce a quantitative performance scoring mechanism that achieves an 85.5% improvement in response consistency, facilitating more predictable and accountable model outputs. The proposed system is industry-agnostic and can be generalized to any domain with a well-defined validation schema. This work contributes to the growing research on aligning LLMs with structured, operational constraints to ensure safe, robust, and scalable deployment.
This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at https://github.com/NanshineLoong/Self-Evolving-Benchmark).
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general time-consuming and requires a large amount of human effort. Therefore, recent works have started exploiting large language models (LLM) to lessen such burden and increase the usability of AutoML frameworks via a natural language interface, allowing non-expert users to build their data-driven solutions. These methods, however, are usually designed only for a particular process in the AI development pipeline and do not efficiently use the inherent capacity of the LLMs. This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML, i.e., from data retrieval to model deployment. AutoML-Agent takes user's task descriptions, facilitates collaboration between specialized LLM agents, and delivers deployment-ready models. Unlike existing work, instead of devising a single plan, we introduce a retrieval-augmented planning strategy to enhance exploration to search for more optimal plans. We also decompose each plan into sub-tasks (e.g., data preprocessing and neural network design) each of which is solved by a specialized agent we build via prompting executing in parallel, making the search process more efficient. Moreover, we propose a multi-stage verification to verify executed results and guide the code generation LLM in implementing successful solutions. Extensive experiments on seven downstream tasks using fourteen datasets show that AutoML-Agent achieves a higher success rate in automating the full AutoML process, yielding systems with good performance throughout the diverse domains.
Recent advancements in medical Large Language Models (LLMs) have showcased their powerful reasoning and diagnostic capabilities. Despite their success, current unified multimodal medical LLMs face limitations in knowledge update costs, comprehensiveness, and flexibility. To address these challenges, we introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM). Inspired by our empirical findings highlighting the benefits of role assignment and diagnostic discernment in LLMs, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director, each embodied by an LLM-based agent. This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases. Extensive experimental evaluations conducted on a wide range of publicly accessible multimodal medical datasets, incorporating text, image, audio, and video modalities, demonstrate that MAM consistently surpasses the performance of modality-specific LLMs. Notably, MAM achieves significant performance improvements ranging from 18% to 365% compared to baseline models. Our code is released at https://github.com/yczhou001/MAM.
In general, educational support with Large Language Models (LLMs) faces challenges in knowledge organization, expertise integration, and contextual adaptation. So, we present EduMAS, a novel multi-agent framework that coordinates specialized agents with graph-based knowledge navigation. Our framework introduces three key innovations: (1) Specialized Agents that provide expertise in different learning aspects to solve decomposed subtasks professionally; (2) Graph Navigator for graph-based knowledge extraction and selection to improve the quality of responses; (3) The Emotional Awareness mechanism for better contextual adaptation. Through comprehensive experiments on college-level physics education and evaluated by six state-of-the-art LLMs, EduMAS demonstrates significant improvements over the baseline model in complex concept integration, cross-disciplinary understanding, and theory-to-application translation. Ablation studies further validate the contribution of each framework component, Specialized Agents and Graph Navigator play important roles in performance improvement. Our work provides strong support for LLM-powered multi-agent system in AI-assisted education.
Recent advancements have highlighted that Large Language Models (LLMs) are prone to hallucinations when solving complex reasoning problems, leading to erroneous results. To tackle this issue, researchers incorporate Knowledge Graphs (KGs) to improve the reasoning ability of LLMs. However, existing methods face two limitations: 1) they typically assume that all answers to the questions are contained in KGs, neglecting the incompleteness issue of KGs, and 2) they treat the KG as a static repository and overlook the implicit logical reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an innovative neural-symbolic agent framework that achieves collaborative augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments and transform complex reasoning tasks into a multi-step interactive process, enabling KGs to participate deeply in the reasoning process. SymAgent consists of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages LLM's inductive reasoning capability to extract symbolic rules from KGs, guiding efficient question decomposition. The Agent-Executor autonomously invokes predefined action tools to integrate information from KGs and external documents, addressing the issues of KG incompleteness. Furthermore, we design a self-learning framework comprising online exploration and offline iterative policy updating phases, enabling the agent to automatically synthesize reasoning trajectories and improve performance. Experimental results demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields better or comparable performance compared to various strong baselines. Further analysis reveals that our agent can identify missing triples, facilitating automatic KG updates.
Tactics play an important role in team sports by guiding how players interact on the field. Both sports fans and experts have a demand for analyzing sports tactics. Existing approaches allow users to visually perceive the multivariate tactical effects. However, these approaches require users to experience a complex reasoning process to connect the multiple interactions within each tactic to the final tactical effect. In this work, we collaborate with basketball experts and propose a progressive approach to help users gain a deeper understanding of how each tactic works and customize tactics on demand. Users can progressively sketch on a tactic board, and a coach agent will simulate the possible actions in each step and present the simulation to users with facet visualizations. We develop an extensible framework that integrates large language models (LLMs) and visualizations to help users communicate with the coach agent with multimodal inputs. Based on the framework, we design and develop Smartboard, an agent-based interactive visualization system for fine-grained tactical analysis, especially for play design. Smartboard provides users with a structured process of setup, simulation, and evolution, allowing for iterative exploration of tactics based on specific personalized scenarios. We conduct case studies based on real-world basketball datasets to demonstrate the effectiveness and usefulness of our system.
Recent progress with LLM-based agents has shown promising results across various tasks. However, their use in answering questions from knowledge bases remains largely unexplored. Implementing a KBQA system using traditional methods is challenging due to the shortage of task-specific training data and the complexity of creating task-focused model structures. In this paper, we present Triad, a unified framework that utilizes an LLM-based agent with multiple roles for KBQA tasks. The agent is assigned three roles to tackle different KBQA subtasks: agent as a generalist for mastering various subtasks, as a decision maker for the selection of candidates, and as an advisor for answering questions with knowledge. Our KBQA framework is executed in four phases, involving the collaboration of the agent’s multiple roles. We evaluated the performance of our framework using three benchmark datasets, and the results show that our framework outperforms state-of-the-art systems on the LC-QuAD and YAGO-QA benchmarks, yielding F1 scores of 11.8% and 20.7%, respectively.
Leveraging advanced reasoning capabilities and extensive world knowledge of large language models (LLMs) to construct generative agents for solving complex real-world problems is a major trend. However, LLMs inherently lack embodiment as humans, resulting in suboptimal performance in many embodied decision-making tasks. In this paper, we introduce a framework for building human-like generative driving agents using post-driving self-report driving-thinking data from human drivers as both demonstration and feedback. To capture high-quality, natural language data from drivers, we conducted urban driving experiments, recording drivers’ verbalized thoughts under various conditions to serve as chain-of-thought prompts and demonstration examples for the LLM-Agent. The framework’s effectiveness was evaluated through simulations and human assessments. Results indicate that incorporating expert demonstration data significantly reduced collision rates by 81.04% and increased human likeness by 50% compared to a baseline LLM-based agent. Our study provides insights into using natural language-based human demonstration data for embodied tasks. The driving-thinking dataset is available at https://github.com/AIR-DISCOVER/Driving-Thinking-Dataset.
Abstract This paper explores the use of Large Language Models (LLMs) in modeling real-world optimization problems. We concretely define the task of translating natural language descriptions into optimization models (NL2OPT) and provide criteria for classifying optimization problems for the NL2OPT task. Our novel multi-agent modeling framework leverages relations identifier agents and a multi-agent verification mechanism, eliminating the need for solver execution. Additionally, we introduce a straightforward and practical evaluation framework, offering a more effective assessment method compared to traditional execution-based evaluations. We have created a unique dataset tailored for optimization modeling, featuring Problem Specifications as a structured representation of optimization problems. Through comprehensive experiments, our study compares our modeling framework with existing LLM reasoning strategies, highlighting their relative effectiveness in optimization modeling tasks. We also perform ablation studies to explore the effect of different components of our modeling framework. Experimental results demonstrate that our multi-agent framework outperforms many common LLM prompting strategies.
Large Language Model-based Multi-Agent Systems (LLM-MAS) have revolutionized complex problem-solving capability by enabling sophisticated agent collaboration through message-based communications. While the communication framework is crucial for agent coordination, it also introduces a critical yet unexplored security vulnerability. In this work, we introduce Agent-in-the-Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM-MAS by intercepting and manipulating inter-agent messages. Unlike existing attacks that compromise individual agents, AiTM demonstrates how an adversary can compromise entire multi-agent systems by only manipulating the messages passing between agents. To enable the attack under the challenges of limited control and role-restricted communication format, we develop an LLM-powered adversarial agent with a reflection mechanism that generates contextually-aware malicious instructions. Our comprehensive evaluation across various frameworks, communication structures, and real-world applications demonstrates that LLM-MAS is vulnerable to communication-based attacks, highlighting the need for robust security measures in multi-agent systems.
This position paper presents A4FN, an Agentic Artificial Intelligence (AI) architecture for intent-driven automation in Flying Networks (FNs) using Unmanned Aerial Vehicles (UAVs) as access nodes. A4FN leverages Generative AI and Large Language Models (LLMs) to enable real-time, context-aware network control via a distributed agentic system. It comprises two components: the Perception Agent (PA), which semantically interprets multimodal input – including imagery, audio, and telemetry data – from UAV-mounted sensors to derive Service Level Specifications (SLSs); and the Decision-and-Action Agent (DAA), which reconfigures the network based on inferred intents. A4FN embodies key properties of Agentic AI, including autonomy, goal-driven reasoning, and continuous perception-action cycles. Designed for mission-critical, infrastructure-limited scenarios such as disaster response, it supports adaptive reconfiguration, dynamic resource management, and interoperability with emerging wireless technologies. The paper details the A4FN architecture, its core innovations, and open research challenges in multi-agent coordination and Agentic AI integration in next-generation FNs.
The application of agentic AI systems in autonomous decision-making is growing in the areas of healthcare, smart cities, digital forensics, and supply chain management. Even though these systems are flexible and offer real-time reasoning, they also raise concerns of trust and oversight, and integrity of the information and activities upon which they are founded. The paper suggests a single architecture model comprising of LangChain-based multi-agent system with a permissioned blockchain to guarantee constant monitoring, policy enforcement, and immutable auditability of agentic action. The framework relates the perception conceptualization-action cycle to a blockchain layer of governance that verifies the inputs, evaluates recommended actions, and documents the outcomes of the execution. A Hyperledger Fabric-based system, action executors MCP-integrated, and LangChain agent are introduced and experiments of smart inventory management, traffic-signal control, and healthcare monitoring are done. The results suggest that blockchain-security verification is efficient in preventing unauthorized practices, offers traceability throughout the whole decision-making process, and maintains operational latency within reasonable ranges. The suggested framework provides a universal system of implementing high-impact agentic AI applications that are autonomous yet responsible.
. This article examines the architecture of agentic artificial intelligence systems, with particular focus on ensuring deterministic outputs and observability as key factors of system reliability. Unlike traditional AI applications that perform isolated inference tasks, agentic systems operate autonomously over extended periods, making sequences of decisions and invoking external tools. Modern approaches to designing agent architectures based on large language models (LLMs) are reviewed, the problems of output non-determinism are analyzed, and architectural solutions for overcoming them are proposed. It is established that the stochastic nature of text generation by large language models creates a fundamental problem of result reproducibility, which becomes critically important when agentic systems are used to automate business processes in finance, healthcare, and legal domains. Special attention is given to the role of structured logging, request tracing, and output validation in ensuring transparency and controllability of agentic systems. Four key pillars of agentic system observability are analyzed: structured event logging, distributed request tracing, performance metrics collection, and audit trail maintenance. An architectural model is proposed that integrates a Deterministic Output Validator (DOV) and a multi-level observability system as mandatory components. The validator implements schema validation, semantic consistency checks, idempotency enforcement, and business rule compliance verification. It is argued that the implementation of these components is a necessary condition for the industrial deployment of agentic systems in mission-critical domains, and that the feedback loop between validation and observability provides a continuous improvement cycle that enables progressive enhancement of system reliability.
No abstract available
No abstract available
No abstract available
No abstract available
As generative AI (GenAI) agents become more common in enterprise settings, they introduce security challenges that differ significantly from those posed by traditional systems. These agents aren’t just LLMs—they reason, remember, and act, often with minimal human oversight. This paper introduces a comprehensive threat model tailored specifically for GenAI agents, focusing on how their autonomy, persistent memory access, complex reasoning, and tool integration create novel risks. This research work identifies 9 primary threats and organizes them across five key domains: cognitive architecture vulnerabilities, temporal persistence threats, operational execution vulnerabilities, trust boundary violations, and governance circumvention. These threats aren’t just theoretical—they bring practical challenges such as delayed exploitability, cross-system propagation, cross system lateral movement, and subtle goal misalignments that are hard to detect with existing frameworks and standard approaches. To help address this, the research work present two complementary frameworks: ATFAA (Advanced Threat Framework for Autonomous AI Agents), which organizes agent-specific risks, and SHIELD, a framework proposing practical mitigation strategies designed to reduce enterprise exposure. While this work builds on existing work in LLM and AI security, the focus is squarely on what makes agents different—and why those differences matter. Ultimately, this research argues that GenAI agents require a new lens for security. If we fail to adapt our threat models and defenses to account for their unique architecture and behavior, we risk turning a powerful new tool into a serious enterprise liability.
System reliability and operational resilience are two critical success factors in the retail industry that are directly connected to customer satisfaction and business sustainability. Staying competitive in today’s dynamic and rapidly evolving market requires rapid adaptability. However, it contradicts the reliability and resilience. This paper proposes an innovative solution, the Retail Resilience Engine (RRE), to establish a balance between these success factors and market demand. It is a unique framework that combines Test-Driven Development (TDD) with a Large Language Model (LLM). This framework follows the state-of-the-art Agentic-AI architecture. It effectively evaluates the decision-making process at rapid speed in retail by incorporating diverse factors, including inventory management, demand forecasting, and customer feedback. As a result, the system reliability is improved significantly. The experimental analysis of the proposed framework shows its decision-making is similar to human experts with a similarity index of 97.5%. It further proves the reliability of the system. The framework also scales effectively, maintaining high accuracy, precision, recall, and F1 scores across varying dataset sizes. The robustness analysis of the system demonstrates the agility enhancement across diverse retail domains, ensuring consistent performance with accuracy exceeding 90% across all tested scenarios. The integration of a creative filtering mechanism further enhances the performance of the RRE framework by preventing 98.2% of the irrelevant inputs. Overall, the proposed RRE framework demonstrates the impressive potential to transform retail systems by enhancing reliability, scalability, and decision-making quality through an Agentic-AI approach.
This paper introduces a novel framework that integrates agentic Artificial Intelligence (AI) with Intent-Based Networks (IBN) to enable autonomous management, configuration, and optimization of mobile network services and resources. Leveraging the advanced reasoning and natural language processing capabilities of an Large Language Model (LLM), the proposed architecture translates high-level user intents into precise network actions, facilitating user-friendly and scalable network orchestration. The framework employs a distributed multi-agent system, where specialized agents collaborate to decompose user intents, provide computational infrastructure, and deploy services using industry-standard Infrastructure-as-Code (IaC) tools. By supporting natural language interactions, the system reduces operational complexity and enhances accessibility for users with varying technical expertise. Experimental evaluations demonstrate significant improvements in task completion rates, response accuracy, and operational efficiency compared to traditional manual methods, particularly for complex network management tasks. In essence, this work creates an intelligent network orchestration framework that adapts to user needs by automatically configuring network and computing resources while operating with minimal human intervention.
Cloud-native organizations increasingly rely on microservices for backend modularity and micro-frontends for scalable user interface delivery. Yet, real-world systems still struggle to evolve these layers coherently under high release velocity, shifting product goals, and variable workloads. This paper presents a unified Agentic AI framework that autonomously coordinates the co-evolution of micro-frontend UIs (implemented in ReactJS and Angular) and microservices. The proposed architecture integrates reinforcement learning for continuous control, large language models for code and configuration synthesis, and a policy-governed multi-agent control plane that executes progressive delivery (feature flags, canary, blue-green) via Kubernetes and service meshes. We formalize decisions using Markov Decision Processes, propose drift detection models for UI-API compatibility, and formulate traffic-shifting optimization for safe rollouts. A mini empirical study across e-commerce, SaaS analytics, and multi-cloud migration scenarios demonstrates reductions in adaptation latency, error rates, and manual intervention relative to strong DevOps baselines. We discuss reliability, explainability, and governance challenges, and lay out future research on hybrid RL-LLM agents, knowledge-graph-aware planning, digital twins, and compliance-aware rewards.
No abstract available
While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.
No abstract available
No abstract available
The integration of Agentic AI into SAP Finance represents a transformative advancement in enterprise financial management, combining autonomous decision-making capabilities with sophisticated data analytics to revolutionize traditional financial processes. This comprehensive article explores how Agentic AI is reshaping SAP Finance through enhanced automation of routine financial tasks, deployment of advanced predictive analytics for forecasting and risk assessment, and the provision of real-time financial intelligence that enables dynamic decision-making. By examining the technical architecture, implementation strategies, and organizational impacts, this article demonstrates how Agentic AI empowers finance professionals to transcend operational constraints and focus on strategic initiatives while simultaneously improving accuracy, compliance, and responsiveness in financial operations across the enterprise landscape.
Agentic AI networking (AgentNet) is a novel AI-native networking paradigm that relies on a large number of specialized AI agents to collaborate and coordinate for autonomous decision-making, dynamic environmental adaptation, and complex goal achievement. It has the potential to facilitate real-time network management alongside capabilities for self-configuration, self-optimization, and self-adaptation across diverse and complex networking environments, laying the foundation for fully autonomous networking systems in the future. Despite its promise, AgentNet is still in the early stage of development, and there still lacks an effective networking framework to support automatic goal discovery and multi-agent self-orchestration and task assignment. This paper proposes SANNet, a novel semantic-aware agentic AI networking architecture that can infer the semantic goal of the user and automatically assign agents associated with different layers of a mobile system to fulfill the inferred goal. Motivated by the fact that one of the major challenges in AgentNet is that different agents may have different and even conflicting objectives when collaborating for certain goals, we introduce a dynamic weighting-based conflict-resolving mechanism to address this issue. We prove that SANNet can provide theoretical guarantee in both conflict-resolving and model generalization performance for multi-agent collaboration in dynamic environment. We develop a hardware prototype of SANNet based on the open RAN and 5GS core platform. Our experimental results show that SANNet can significantly improve the performance of multi-agent networking systems, even when agents with conflicting objectives are selected to collaborate for the same goal.
Extracting meaningful information from unstructured medical data is a major challenge in healthcare analytics, requiring substantial time, computational resources, and specialized expertise. Agentic AI introduces new opportunities to automate and streamline these workflows. We present a multiagent architecture that democratizes medical data analysis for data scientists, medical researchers, and healthcare practitioners. The system enables users to: (i) obtain comprehensive insights through automated dataset analysis; (ii) automatically integrate additional knowledge from supporting files; and (iii) develop predictive models without extensive machine learning expertise. The architecture contains six specialized agents: (i) Type Identification Agent, which classifies data (structured/unstructured) and performs privacy-preserving anonymization; (ii) Feature Identification Agent, which extracts dataset features; (iii) Feature Enrichment Agent, which generates contextually relevant keyword vocabularies for each feature based on user intent; (iv) Additional File Integration Agent, which uses semantic and keyword-based extraction to incorporate supplementary information from PDF, Excel, and CSV files; (v) Input-Output Optimization Agent, which determines ideal input and output features for machine learning based on user intent; and (vi) Modeling Advisory Agent, which recommends suitable predictive models. We evaluate the system across multiple medical data modalities. For healthcare providers, research institutions, and health-tech companies, this workflow enables faster decisionmaking, reduced data-processing costs, improved regulatory compliance, and the ability to transform raw medical data into actionable insights.
The issue of limited household budgets and nutritional demands continues to be a challenge especially in the middle-income environment where food prices fluctuate. This paper introduces a price aware agentic AI system, which combines personal finance management with diet optimization. With household income and fixed expenditures, medical and well-being status, as well as real-time food costs, the system creates nutritionally sufficient meals plans at comparatively reasonable prices that automatically adjust to market changes. The framework is implemented in a modular multi-agent architecture, which has specific agents (budgeting, nutrition, price monitoring, and health personalization). These agents share the knowledge base and use the substitution graph to ensure that the nutritional quality is maintained at a minimum cost. Simulations with a representative Saudi household case study show a steady 12-18% reduction in costs relative to a static weekly menu, nutrient adequacy of over 95% and high performance with price changes of ±20-30%. The findings indicate that the framework can locally combine affordability with nutritional adequacy and provide a viable avenue of capacity-building towards sustainable and fair diet planning in line with Sustainable Development Goals on Zero Hunger and Good Health.
Abstract Security Operations Centers (SOCs) face significant challenges due to the large volume, diversity, and dynamics of incident events. Alarm fatigue, delayed initiation of response, and the high share of false positives or missed threats limit team effectiveness and increase organizational risk. This study presents a methodology for automated management of key performance indicators (KPIs) in an SOC environment through an Agentic AI architecture and machine learning. Within the project, 214 CSV files were processed, comprising over 8.6 million data rows extracted from SIEM, Incident Management, Task Tracking, and CRM systems. Sixteen specific indicators were used, grouped into four categories: detection and filtering (TTD, FNR, FPR), response and resolution (TTR, IRR, SIHR), recovery and operations (MTTR, OE), satisfaction and risk management (CSR, SIER). The system includes ten specialized Agentic AI agents with clearly defined roles ‒ monitoring time parameters, predicting false alarm probabilities, automatically triggering playbooks, calculating operational metrics, and analyzing customer satisfaction. Five machine learning models were trained: two XGBoost classifiers for FPR and FNR, two LightGBM regressors for TTR and MTTR, and a BERT model for textual feedback analysis. The results demonstrate reduced detection and response times, a lower rate of false alarms, and improved operational predictability in calculating KPI values. The methodology shows the applicability of Agentic AI for optimizing SOC processes on real and public data, without the need for manual intervention in most processing phases.
Cloud data pipelines increasingly operate under dynamic workloads, evolving schemas, cost constraints, and strict governance requirements. Despite advances in cloud-native orchestration frameworks, most production pipelines rely on static configurations and reactive operational practices, resulting in prolonged recovery times, inefficient resource utilization, and high manual overhead. This paper presents Agentic Cloud Data Engineering, a policy-aware control architecture that integrates bounded AI agents into the governance and control plane of cloud data pipelines. In Agentic Cloud Data Engineering platform, specialized agents analyze pipeline telemetry and metadata, reason over declarative cost and compliance policies, and propose constrained operational actions such as adaptive resource reconfiguration, schema reconciliation, and automated failure recovery. All agent actions are validated against governance policies to ensure predictable and auditable behavior. We evaluate Agentic Cloud Data Engineering platform using representative batch and streaming analytics workloads constructed from public enterprise-style datasets. Experimental results show that Agentic Cloud Data Engineering platform reduces mean pipeline recovery time by up to 45%, lowers operational cost by approximately 25%, and decreases manual intervention events by over 70% compared to static orchestration, while maintaining data freshness and policy compliance. These results demonstrate that policy-bounded agentic control provides an effective and practical approach for governing cloud data pipelines in enterprise environments.
The deployment of AI agents within legacy Radio Access Network (RAN) infrastructure poses significant safety and reliability challenges for future 6G networks. This paper presents a novel Edge AI framework for autonomous network optimisation in Open RAN environments, addressing these challenges through three core innovations: (1) a persona-based multi-tools architecture enabling distributed, context-aware decision-making; (2) proactive anomaly detection agent powered by traffic predictive tool; and (3) a safety, aligned reward mechanism that balances performance with operational stability.Integrated into the RAN Intelligent Controller (RIC), our framework leverages multimodal data fusion, including network KPIs, a traffic prediction model, and external information sources, to anticipate and respond to dynamic network conditions. Extensive evaluation using realistic 5G scenarios demonstrates that the edge framework achieves zero network outages under high-stress conditions, compared to 8.4% for traditional fixed-power networks and 3.3% for large language model (LLM) agent-based approaches, while maintaining near real-time responsiveness and consistent QoS. These results establish that, when equipped with the right tools and contextual awareness, AI agents can be safely and effectively deployed in critical network infrastructure, laying the framework for intelligent and autonomous 5G and beyond network operations.
The rapid expansion of multinational technology enterprises, particularly in highly regulated sectors such as e-commerce and healthcare, has amplified the complexity of managing diverse and evolving global compliance requirements. Traditional regulatory monitoring and response models—largely manual and reactive—are no longer scalable in the face of dynamic legislation, cross-border data governance rules, and sector-specific standards. This paper proposes an agentic AI–driven regulatory intelligence framework designed to automate and optimize the entire compliance lifecycle across jurisdictions. Leveraging lessons from Amazon’s large-scale operational structure, the study explores the integration of horizon scanning, natural language understanding, autonomous policy interpretation, and AI-driven risk assessment to enable real-time detection of regulatory changes, automated control mapping, and proactive remediation workflows. The system architecture includes distributed AI agents capable of orchestrating governance tasks across departments while maintaining auditability, human oversight, and ethical alignment. By transitioning compliance from a static, document tation-heavy function to a dynamic, intelligence-led ecosystem, this research demonstrates how agentic AI can significantly reduce regulatory exposure, enhance operational resilience, and enable strategic decision-making at global scale. The proposed model offers a blueprint for enterprises seeking to future-proof their compliance operations amidst increasing regulatory volatility.
The rapid evolution of artificial intelligence has given rise to agentic AI systems—autonomous entities capable of perceiving their environment, making decisions, and executing actions with minimal human intervention. This work provides a systematic analysis of agentic AI frameworks, governance models, and implementation strategies. Drawing on a comprehensive review of the literature, we examine the current state of agentic AI technologies, highlight key challenges in governance, security, and ethical oversight, and compare architectural frameworks for responsible deployment. Our results, illustrated through detailed framework comparisons and governance analyses, demonstrate that while agentic AI holds transformative potential across multiple sectors, notable gaps persist in standardization, regulatory compliance, and interoperability. To address these issues, we propose a layered architecture that embeds governance and security across all system layers. An analysis of the competitive landscape further identifies critical interoperability challenges that could undermine U.S. leadership. Based on these insights, we outline a strategic framework for U.S. competitiveness, emphasizing accelerated standards development, international collaboration, and investment in interoperability research. Finally, emerging trends and future directions are explored to provide a comprehensive roadmap for responsible deployment of agentic AI.
Agentic memory is emerging as a key enabler for large language models (LLM) to maintain continuity, personalization, and long-term context in extended user interactions, critical capabilities for deploying LLMs as truly interactive and adaptive agents. Agentic memory refers to the memory that provides an LLM with agent-like persistence: the ability to retain and act upon information across conversations, similar to how a human would. We present Memoria, a modular memory framework that augments LLM-based conversational systems with persistent, interpretable, and context-rich memory. Memoria integrates two complementary components: dynamic session-level summarization and a weighted knowledge graph (KG)-based user modeling engine that incrementally captures user traits, preferences, and behavioral patterns as structured entities and relationships. This hybrid architecture enables both short-term dialogue coherence and long-term personalization while operating within the token constraints of modern LLMs. We demonstrate how Memoria enables scalable, personalized conversational artificial intelligence (AI) by bridging the gap between stateless LLM interfaces and agentic memory systems, offering a practical solution for industry applications requiring adaptive and evolving user experiences.
This paper synthesizes key insights from emerging IEEE Artificial Intelligence Standards Committee (AISC) standards - P3394 and P3428 - that are shaping agent-based software engineering for intelligent systems. IEEE P3394 (LLM Agent Interface) defines a Universal Message Format (UMF) and communication protocols for Large Language Model (LLM) agents, establishing standard message envelopes, semantic payload, agent roles, session management, and interaction patterns. IEEE P3428 (LLM Agents for Education) specifies a modular agent architecture and lifecycle tailored to adaptive learning environments. It standardizes agent components, lifecycle states, and orchestration mechanisms to enable plug-and-play integration of multiple AI-driven agents in an adaptive instructional system. Together, P3394 and P3428 promote modular, interoperable, and scalable design of intelligent agent ecosystems. We highlight how P3394's universal message protocols and P3428's standardized agent lifecycle complement each other in supporting LLM-based agents and agent-based intelligent systems. We also briefly discuss IEEE P3427, an initiative on semantic information agents, which underscores the broader context of evaluation and continuous improvement of agent-based systems. By unifying communication interfaces and architectural frameworks, these standards lay a foundation for next-generation agentic systems that can seamlessly interoperate across platforms and domains.
Enterprise adoption of generative AI is rapidly shifting from isolated prompt-driven applications toward complex agentic systems that integrate retrieval, reasoning, and tool execution. As these systems grow in scale, the lack of a standardized interaction model between agents and external capabilities introduces challenges in reliability, observability, security, and operational governance. This paper presents aplat form architecture centered on the Model Context Protocol (MCP) as a first-class systems abstraction for enterprise-scale agentic generative AI. MCP servers act as strongly isolated, capability-oriented services that expose tools, data access, and actions to agents through well-defined contracts. This separation enables controlled tool invocation, bounded execution, and fault isolation across complex multi-agent workflows. We describe the architectural principles, execution lifecycle, and operational characteristics of MCP-based platforms, including agent orchestration, context management, latency governance, and failure containment. The paper draws on production deployment experience and provides guidance for building scalable, cost-aware, and reliable agentic AI systems in enterprise environments. Keywords: Model Context Protocol, Agentic AI, Generative AI Platforms, Distributed Systems, Enterprise Architecture
Large language models (LLMs) have gained significant interest in industry due to their impressive capabilities across a wide range of tasks. However, the widespread adoption of LLMs presents several challenges, such as integration into existing applications and infrastructure, utilization of company proprietary data, models, and APIs, and meeting cost, quality, responsiveness, and other requirements. To address these challenges, there is a notable shift from monolithic models to compound AI systems, with the premise of more powerful, versatile, and reliable applications. However, progress thus far has been piecemeal, with proposals for agentic workflows, programming models, and extended LLM capabilities, without a clear vision of an overall architecture. In this paper, we propose a ‘blueprint architecture’ for compound AI systems for orchestrating agents and data for enterprise applications. In our proposed architecture the key orchestration concept is ‘streams' to coordinate the flow of data and instructions among agents. Existing proprietary models and APIs in the enterprise are mapped to ‘agents', defined in an ‘agent registry’ that serves agent metadata and learned representations for search and planning. Agents can utilize proprietary data through a ‘data registry’ that similarly registers enterprise data of various modalities. Tying it all together, data and task ‘planners' break down, map, and optimize tasks and queries for given quality of service (QoS) requirements such as cost, accuracy, and latency. We illustrate an implementation of the architecture for a use-case in the HR domain and discuss opportunities and challenges for ‘agentic AI’ in the enterprise.
Agentic systems built on large language models (LLMs) offer promising capabilities for automating complex workflows in healthcare AI. We introduce mAIstro, an open-source, autonomous multi-agentic framework for end-to-end development and deployment of medical AI models. The system orchestrates exploratory data analysis, radiomic feature extraction, image segmentation, classification, and regression through a natural language interface, requiring no coding from the user. Built on a modular architecture, mAIstro supports both open- and closed-source LLMs, and was evaluated using a large and diverse set of prompts across 16 open-source datasets, covering a wide range of imaging modalities, anatomical regions, and data types. The agents successfully executed all tasks, producing interpretable outputs and validated models. This work presents the first agentic framework capable of unifying data analysis, AI model development, and inference across varied healthcare applications, offering a reproducible and extensible foundation for clinical and research AI integration. The code is available at: https://github.com/eltzanis/mAIstro
The operational resilience of electric power grids is facing growing challenges caused by aging infrastructure, increasing system complexity, and a rising frequency of extreme weather events. Traditional control paradigms, built around deterministic models and human-in-the-loop decision making, will become insufficient to manage the escalating demands on power grids. In response, recent advances in artificial intelligence (AI)—particularly the emergence of general-purpose AI agents capable of tool use, reasoning, and task orchestration—offer a new direction for enhancing grid flexibility and resiliency. This article introduces the concept of the Power Agent: an AI-enabled, context-aware assistant that leverages foundation models, standardized tool interfaces, and structured workflows to support grid operation and planning decisions. We discuss the conceptual architecture, implementation pathways, and system-level benefits of deploying Power Agents in power grid operations, with an emphasis on augmenting operator capabilities, improving situational awareness, and reducing operational bottlenecks.
The adoption of Generative Artificial Intelligence (GenAI) in Radio Access Networks (RAN) presents new opportunities for automation and intelligence across network operations. GenAI-powered agents, leveraging Large Language Models (LLMs), can enhance planning, execution, and decision-making for orchestration and real-time optimisation of 6G networks. Standardizing the implementation of the Agentic architecture for RAN is now essential to establish a unified framework for RANOps and AgentOps. One of the key challenges is to develop a blueprint that incorporates best practices for memory integration, tool generation, multi-agent orchestration, and performance benchmarking. This study highlights key areas requiring standardization, including agent tool specifications, RAN-specific LLM fine-tuning, validation frameworks, and AI-friendly documentation. We propose a dedicated research initiative on GenAI-for-RAN and GenAI-on-RAN to address these gaps and advance AI-driven network automation.
Large-language-model (LLM)-based AI agents have recently showcased impressive versatility by employing dynamic reasoning, an adaptive, multi-step process that coordinates with external tools. This shift from static, single-turn inference to agentic, multi-turn workflows broadens task generalization and behavioral flexibility, but it also introduces serious concerns about system-level cost, efficiency, and sustainability. This paper presents the first comprehensive system-level analysis of AI agents, quantifying their resource usage, latency behavior, energy consumption, and datacenter-wide power consumption demands across diverse agent designs and test-time scaling strategies. We further characterize how AI agent design choices, such as few-shot prompting, reflection depth, and parallel reasoning, impact accuracy-cost tradeoffs. Our findings reveal that while agents improve accuracy with increased compute, they suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs. Through detailed evaluation of representative agents, we highlight the profound computational demands introduced by AI agent workflows, uncovering a looming sustainability crisis. These results call for a paradigm shift in agent design toward compute-efficient reasoning, balancing performance with deployability under real-world constraints.
Large Language Model (LLM) agents, acting on their users'behalf to manipulate and analyze data, are likely to become the dominant workload for data systems in the future. When working with data, agents employ a high-throughput process of exploration and solution formulation for the given task, one we call agentic speculation. The sheer volume and inefficiencies of agentic speculation can pose challenges for present-day data systems. We argue that data systems need to adapt to more natively support agentic workloads. We take advantage of the characteristics of agentic speculation that we identify, i.e., scale, heterogeneity, redundancy, and steerability - to outline a number of new research opportunities for a new agent-first data systems architecture, ranging from new query interfaces, to new query processing techniques, to new agentic memory stores.
The ability to wield tools was once considered exclusive to human intelligence, but it's now known that many other animals, like crows, possess this capability. Yet, robotic systems still fall short of matching biological dexterity. In this paper, we investigate the use of Large Language Models (LLMs), tool affordances, and object manoeuvrability for non-prehensile tool-based manipulation tasks. Our novel method leverages LLMs based on scene information and natural language instructions to enable symbolic task planning for tool-object manipulation. This approach allows the system to convert a human language sentence into a sequence of feasible motion functions. We have developed a novel manoeuvrability-driven controller using a new tool affordance model derived from visual feedback. This controller helps guide the robot's tool utilization and manipulation actions, even within confined areas, using a stepping incremental approach. The proposed methodology is evaluated with experiments to prove its effectiveness under various manipulation scenarios.
To achieve faithful reasoning that aligns with human expectations, large language models (LLMs) need to ground their reasoning to real-world knowledge (e.g., web facts, math and physical rules). Tools help LLMs access this external knowledge, but there remains challenges for fine-tuning LLM agents (e.g., Toolformer) to invoke tools in multi-step reasoning problems, where inter-connected tool calls require holistic and efficient tool usage planning. In this work, we propose a new method for LLMs to better leverage tools in multi-step reasoning. Our method, Chain-of-Abstraction (CoA), trains LLMs to first decode reasoning chains with abstract placeholders, and then call domain tools to reify each reasoning chain by filling in specific knowledge. This planning with abstract chains enables LLMs to learn more general reasoning strategies, which are robust to shifts of domain knowledge (e.g., math results) relevant to different reasoning questions. It also allows LLMs to perform decoding and calling of external tools in parallel, which avoids the inference delay caused by waiting for tool responses. In mathematical reasoning and Wiki QA domains, we show that our method consistently outperforms previous chain-of-thought and tool-augmented baselines on both in-distribution and out-of-distribution test sets, with an average ~6% absolute QA accuracy improvement. LLM agents trained with our method also show more efficient tool use, with inference speed being on average ~1.4x faster than baseline tool-augmented LLMs.
This study focuses on optimizing planning and tool-use processes of LLM agents when final answer verification is unreliable. A planner is initialized via knowledge distillation and rejection sampling, followed by reinforcement learning using a process-level reward measuring tool invocation completeness and execution validity. Training is conducted on 2,900 industrial-style agent trajectories. The approach reduces invalid action sequences by 41.6% and improves task completion reliability by 29.8%, demonstrating that process-oriented RL yields stable improvements independent of answer correctness.
The functionality of Large Language Model (LLM) agents is primarily determined by two capabilities: action planning and answer summarization. The former, action planning, is the core capability that dictates an agent's performance. However, prevailing training paradigms employ end-to-end, multi-objective optimization that jointly trains both capabilities. This paradigm faces two critical challenges: imbalanced optimization objective allocation and scarcity of verifiable data, making it difficult to enhance the agent's planning capability. To address these challenges, we propose Reinforcement Learning with Tool-use Rewards (RLTR), a novel framework that decouples the training process to enable a focused, single-objective optimization of the planning module. Crucially, RLTR introduces a reward signal based on tool-use completeness to directly evaluate the quality of tool invocation sequences. This method offers a more direct and reliable training signal than assessing the final response content, thereby obviating the need for verifiable data. Our experiments demonstrate that RLTR achieves an 8%-12% improvement in planning performance compared to end-to-end baselines. Moreover, this enhanced planning capability, in turn, translates to a 5%-6% increase in the final response quality of the overall agent system.
Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this letter introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in a lower bound of the multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks.
Large language models have shown strong capabilities in performing natural language planning tasks, largely due to the chain-of-thought method, which enhances their ability to solve complex tasks through explicit intermediate inference. However, they face challenges in acquiring new knowledge, executing calculations, and interacting with the environment. Although previous work has enabled large language models to use external tools to improve reasoning and environmental interaction, there was no scalable or cohesive structure for these technologies. In this paper, we present LLM-Collab, where Collab represents the cooperative interaction between two AI agents, and the large language model plays a key role in the creation of AI agents. For this method, we took large language models as the reasoning core for AI agents and designed two AI agents to cooperate on the planning tasks: One as an analyst for tool selection and phase validation, and the other as an executor of specific tasks. Our method provided a comprehensive list of external tools to facilitate the invocation and integration of agents, ensuring a seamless collaboration process. This paradigm established a unified framework for autonomous task-solving based on massive language models by demonstrating how language communication and tool selection enable multi-agent collaboration.
Large Language Models show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erroneous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforcement learning phase. In the online phase, a dedicated Planner is trained via group Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Experiments on the ToolHop and T-Eval benchmarks show PEARL significantly outperforms existing methods, achieving a new state-of-the-art success rate of \textbf{56.5\%} on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents.
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains, yet automated agent design remains a significant challenge. Current automated agent design approaches are often constrained by limited search spaces that primarily optimize workflows but fail to integrate crucial human-designed components like memory, planning, and tool use. Furthermore, these methods are hampered by high evaluation costs, as evaluating even a single new agent on a benchmark can require tens of dollars. The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource-intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. We formalize a hierarchical search space that jointly models agentic workflow and composable functional components. This structure moves beyond optimizing workflows alone by co-optimizing functional components, which enables the discovery of more complex and effective agent architectures. To make exploration within this expansive space feasible, we mitigate high evaluation costs by training a value model on a high-quality dataset, generated via a novel strategy combining combinatorial coverage and balanced Bayesian sampling for low-cost evaluation. Guiding the entire process is a hierarchical Monte Carlo Tree Search (MCTS) strategy, which is informed by uncertainty to efficiently navigate the search space. Evaluated across a comprehensive set of seven benchmarks spanning embodied, math, web, tool, and game domains, AgentSwift discovers agents that achieve an average performance gain of 8.34\% over both existing automated agent search methods and manually designed agents. Moreover, our framework exhibits steeper and more stable search trajectories. By enabling the efficient, automated composition of workflow with functional components, AgentSwift provides a scalable methodology to explore complex agent designs. Our framework serves as a launchpad for researchers to rapidly prototype and discover powerful agent architectures without the impediment of prohibitive evaluation costs.
As Large Language Models (LLMs) evolve into powerful agentic systems, the telecommunications industry’s expansion into AI services necessitates industry-grounded benchmarks to evaluate their underexplored domain-specific capabilities. To address the gap left by generic benchmarks that fail to assess realistic, non-English performance, we present TelAgent-Bench, a Korean benchmark for the telecommunications domain evaluating five core agen-tic capabilities: Reasoning, Planning, Action (tool-use), Retrieval-Augmented Generation, and Instruction Following. Evaluations reveal significant performance disparities between models that employ explicit reasoning and those that do not, providing actionable insights for deploying agentic LLMs in real-world telecommunications tasks.
Large language models (LLMs) are increasingly used to automate or augment penetration testing, but their effectiveness and reliability across attack phases remain unclear. We present a comprehensive evaluation of multiple LLM-based agents, from single-agent to modular designs, across realistic penetration testing scenarios, measuring empirical performance and recurring failure patterns. We also isolate the impact of five core functional capabilities via targeted augmentations: Global Context Memory (GCM), Inter-Agent Messaging (IAM), Context-Conditioned Invocation (CCI), Adaptive Planning (AP), and Real-Time Monitoring (RTM). These interventions support, respectively: (i) context coherence and retention, (ii) inter-component coordination and state management, (iii) tool use accuracy and selective execution, (iv) multi-step strategic planning, error detection, and recovery, and (v) real-time dynamic responsiveness. Our results show that while some architectures natively exhibit subsets of these properties, targeted augmentations substantially improve modular agent performance, especially in complex, multi-step, and real-time penetration testing tasks.
The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.
Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.
Connecting Large Language Models (LLMs) with the ability to leverage APIs (Web Search, Charting, Calculators, Calendar, Flight Search, Hotel Search, Data Lookup, etc. ) is likely to allow us to solve a variety of new hard problems. Several research efforts have made this observation and suggested recipes for LLMs to emit API calls, and proposed mechanisms by which they can generate additional text conditioned on the output for the API call. However, in practice, the focus has been on relatively simple slot-filling tasks that make an API call rather unlocking novel capabilities by combining different tools, reasoning over the response from a tool, making multiple invocations, or complex planning. In this paper, we pose the following question: what does it mean to say that an LLM is proficient at using a set of APIs? We answer this question in the context of structured APIs by defining seven capabilities for API-use. We provide an approach for generating synthetic tasks that exercise each of these capabilities given only the description of an API. We argue that this provides practitioners with a principled way to construct a dataset to evaluate an LLM's ability to use a given set of APIs. Through human evaluations, we show that our approach produces high-quality tasks for each of the seven capabilities. We also describe how we used this approach to on-board new API and create principled evaluation sets for multiple LLM-based products.
Recently, there has been increasing interest in using Large Language Models (LLMs) to construct complex multi-agent systems to perform tasks such as compiling literature reviews, drafting consumer reports, and planning vacations. Many tools and libraries exist for helping create such systems, however none support *recursive* multi-agent systems—where the models themselves flexibly decide when to delegate tasks and how to organize their delegation structure. In this work, we introduce ReDel: a toolkit for recursive multi-agent systems that supports custom tool-use, delegation schemes, event-based logging, and interactive replay in an easy-to-use web interface. We show that, using ReDel, we are able to achieve significant performance gains on agentic benchmarks and easily identify potential areas of improvements through the visualization and debugging tools. Our code, documentation, and PyPI package are open-source at https://github.com/zhudotexe/redel, and free to use under the MIT license.
The latest A2A systems often use LLM-based agents for tool calling and planning workflows, introducing risks such as prompt injection attacks. We introduce a lightweight, proof-carrying receipt for agent-to-agent (A2A) messaging that supports fully offline verification. This receipt enables third parties to verify who authored a message and confirm whether it was logged, without contacting the sender, the receiver, or any online service. An agent signs a canonical JSON representation of each message. Servers log the payload hash in a Merkle tree and periodically sign the tree state as a Signed Tree Head (STH). The returned bundle includes a receipt containing a Merkle proof, an STH, and a minimal decentralized identifier (DID) registry for offline key discovery. A verifier recomputes the Merkle root and validates the payload-hash match, Merkle inclusion, and both the STH and client signatures entirely offline. In our prototype, the proof size grows as $O(\log N)$, and the median offline verification time is approximately 0.4 s on our test platform. Receipts self-authenticate individual messages within a single transparency log, providing a lightweight alternative to other approaches.
Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and fall short on geospatial tasks that require spatial reasoning, multi-hop planning, and real-time map interaction. To address these challenges, we introduce MapAgent, a hierarchical multi-agent plug-and-play framework with customized toolsets and agentic scaffolds for map-integrated geospatial reasoning. Unlike existing flat agent-based approaches that treat tools uniformly-often overwhelming the LLM when handling similar but subtly different geospatial APIs-MapAgent decouples planning from execution. A high-level planner decomposes complex queries into subgoals, which are routed to specialized modules. For tool-heavy modules-such as map-based services-we then design a dedicated map-tool agent that efficiently orchestrates related APIs adaptively in parallel to effectively fetch geospatial data relevant for the query, while simpler modules (e.g., solution generation or answer extraction) operate without additional agent overhead. This hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs. We evaluate MapAgent on four diverse geospatial benchmarks-MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA-and demonstrate substantial gains over state-of-the-art tool-augmented and agentic baselines. We open-source our framwork at https://github.com/Hasebul/MapAgent.
Clinicians often rely on data engineers to retrieve complex patient information from electronic health record (EHR) systems, a process that is both inefficient and time-consuming. We propose EHRAgent, a large language model (LLM) agent empowered with accumulative domain knowledge and robust coding capability. EHRAgent enables autonomous code generation and execution to facilitate clinicians in directly interacting with EHRs using natural language. Specifically, we formulate a multi-tabular reasoning task based on EHRs as a tool-use planning process, efficiently decomposing a complex task into a sequence of manageable actions with external toolsets. We first inject relevant medical information to enable EHRAgent to effectively reason about the given query, identifying and extracting the required records from the appropriate tables. By integrating interactive coding and execution feedback, EHRAgent then effectively learns from error messages and iteratively improves its originally generated code. Experiments on three real-world EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate, verifying its strong capacity to tackle complex clinical tasks with minimal demonstrations.
We present a framework for robot skill acquisition, which 1) efficiently scale up data generation of language-labelled robot data and 2) effectively distills this data down into a robust multi-task language-conditioned visuo-motor policy. For (1), we use a large language model (LLM) to guide high-level planning, and sampling-based robot planners (e.g. motion or grasp samplers) for generating diverse and rich manipulation trajectories. To robustify this data-collection process, the LLM also infers a code-snippet for the success condition of each task, simultaneously enabling the data-collection process to detect failure and retry as well as the automatic labeling of trajectories with success/failure. For (2), we extend the diffusion policy single-task behavior-cloning approach to multi-task settings with language conditioning. Finally, we propose a new multi-task benchmark with 18 tasks across five domains to test long-horizon behavior, common-sense reasoning, tool-use, and intuitive physics. We find that our distilled policy successfully learned the robust retrying behavior in its data collection procedure, while improving absolute success rates by 33.2% on average across five domains. Code, data, and additional qualitative results are available on https://www.cs.columbia.edu/~huy/scalingup/.
Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.
This paper introduces OSC (Orchestrating Cognitive Synergy), a knowledge-aware adaptive collaboration framework designed to enhance cognitive synergy in multi-agent systems with large language models. While prior work has advanced agent selection and result aggregation, efficient linguistic interactions for deep collaboration among expert agents remain a critical bottleneck. OSC addresses this gap as a pivotal intermediate layer between selection and aggregation, introducing Collaborator Knowledge Models (CKM) to enable each agent to dynamically perceive its collaborators'cognitive states. Through real-time cognitive gap analysis, agents adaptively adjust communication behaviors, including content focus, detail level, and expression style, using learned strategies. Experiments on complex reasoning and problem-solving benchmarks demonstrate that OSC significantly improves task performance and communication efficiency, transforming"parallel-working individuals''into a"deeply collaborative cognitive team.''This framework not only optimizes multi-agent collaboration but also offers new insights into LLM agent interaction behaviors.
Automated code review systems have improved software development, yet many lack contextual awareness, leading to redundant feedback and limited adaptability to user-specific coding styles. This paper presents a multi-agent AI-driven framework that leverages FAISS-based memory to improve review efficiency, personalization, and collaboration. The system integrates code review, bug detection, and security analysis agents, operating independently and interactively to analyze and refine code submissions. A feedback mechanism enables users to influence future AI suggestions, ensuring that recommendations evolve based on individual preferences and past interactions. Through structured experiments, we evaluated the impact of FAISS memory on the reduction of redundant feedback, evaluated the effectiveness of collaborative agent execution, and measured the system’s ability to adapt to coding patterns over time. The results indicate that the incorporation of FAISS significantly improves AI-assisted development by minimizing unnecessary repetitions while maintaining essential corrective feedback. This research demonstrates the potential of adaptive AI systems in software engineering, contributing to more intelligent, context-aware, and efficient code review methodologies.
Large language models (LLMs) have demonstrated a remarkable ability to serve as general-purpose tools for various language-based tasks. Recent works have demonstrated that the efficacy of such models can be improved through iterative dialog between multiple models. While these paradigms show promise in improving model efficacy, most works in this area treat collaboration as an emergent behavior, rather than a learned behavior. In doing so, current multi-agent frameworks rely on collaborative behaviors to have been sufficiently trained into off-the-shelf models. To address this limitation, we propose ACC-Collab, an Actor-Critic based learning framework to produce a two-agent team (an actor-agent and a critic-agent) specialized in collaboration. We demonstrate that ACC-Collab outperforms SotA multi-agent techniques on a wide array of benchmarks.
,
Large Language Models (LLMs) are increasingly deployed in multi-agent systems; however, existing frameworks continue to suffer from communication ambiguity, coordination failures, and poor scalability as task complexity increases. This paper introduces COLLAB-LLM, a communication-centric, role-based framework designed to enable reliable and scalable collaboration among LLM agents. The framework combines a structured communication protocol, a hierarchical role architecture, and a dynamic distributed task-graph engine to support coordinated planning, efficient negotiation, and adaptive task execution. COLLAB-LLM is evaluated on over 120 complex, multi-step tasks spanning software engineering, business process automation, and scientific research synthesis. Task success is defined using task-specific completion criteria, with a task considered successful when the aggregate completion score exceeds 0.8. Under identical underlying LLM configurations, COLLAB-LLM achieves an 89% overall success rate, representing a 13–19% improvement over strong single-agent and multi-agent state-of-the-art baselines, with statistically significant gains in performance, communication efficiency, and robustness. Experimental results demonstrate that structured communication and role specialization substantially reduce ambiguity, improve collaboration quality, and enable scalable coordination for teams of up to eight agents. This work establishes foundational design principles for high-performing collaborative AI systems and provides a practical, reproducible pathway toward scalable, human-aligned multi-agent LLM architectures. All experimental artifacts, task definitions, prompts, and evaluation scripts will be released to support reproducibility.
This paper explores the integration of advanced Multi-Agent Systems (MAS) techniques to develop a team of agents with enhanced logical reasoning, long-term knowledge retention, and Theory of Mind (ToM) capabilities. By uniting these core components with optimized communication protocols, we create a novel framework called SynergyMAS, which fosters collaborative teamwork and superior problem-solving skills. The system's effectiveness is demonstrated through a product development team case study, where our approach significantly enhances performance and adaptability. These findings highlight SynergyMAS's potential to tackle complex, real-world challenges.
Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose $\ours$, a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess $\ours$ on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.
Large Language Models (LLMs) have shown promising potential in business applications, particularly in enterprise decision support and strategic planning, yet current approaches often struggle to reconcile intricate operational analyses with overarching strategic goals across diverse market environments, leading to fragmented workflows and reduced collaboration across organizational levels. This paper introduces BusiAgent, a novel multi-agent framework leveraging LLMs for advanced decision-making in complex corporate environments. BusiAgent integrates three core innovations: an extended Continuous Time Markov Decision Process (CTMDP) for dynamic agent modeling, a generalized entropy measure to optimize collaborative efficiency, and a multi-level Stackelberg game to handle hierarchical decision processes. Additionally, contextual Thompson sampling is employed for prompt optimization, supported by a comprehensive quality assurance system to mitigate errors. Extensive empirical evaluations across diverse business scenarios validate BusiAgent's efficacy, demonstrating its capacity to generate coherent, client-focused solutions that smoothly integrate granular insights with high-level strategy, significantly outperforming established approaches in both solution quality and user satisfaction. By fusing cutting-edge AI technologies with deep business insights, BusiAgent marks a substantial step forward in AI-driven enterprise decision-making, empowering organizations to navigate complex business landscapes more effectively.
Real-world clinical decisions emerge from collaboration, yet most medical AI functions as a single expert. To better reflect clinical practice, we introduce a multi-agent framework where large language model (LLM) agents, each embodying a distinct specialty (e.g., cardiology, endocrinology), collaboratively recommend a sequence of patient procedures. Each agent reasons using a private knowledge base of domain-specific PubMed literature, a shared memory containing the patient's electronic health records (EHRs), and the history of the ongoing inter-agent discussion. We simulate two organizational models: one requiring unanimous consensus and another where a designated team leader makes the final call. Using the MIMIC-III dataset to predict procedural sequences, we show that the leader-based model consistently outperforms both the consensus and single-agent configurations, achieving higher accuracy on metrics including Mean Reciprocal Rank (MRR). Our work presents a robust and interpretable paradigm for multi-agent clinical decision support, more closely aligning AI with the collaborative nature of clinical practice.
Cyber threat intelligence (CTI) provides defenders with knowledge about attacks and adversaries, including their infrastructure, tools, and attack techniques. Enriching CTI with contextual information enables security teams to prioritize risks, derive actionable outputs, and respond to threats more effectively. Yet, the growing scale and complexity of cyberattacks make manual enrichment increasingly challenging, creating the need for automated and reliable solutions. In this paper, we present IntelForge, a novel multi-agent framework for automated CTI enrichment, built on orchestrated large language model (LLM)-based AI agents. Each agent in the system performs a distinct role, ranging from entity extraction to external retrieval, scoring, reporting, and evaluation, enabling a scalable and modular enrichment process. Leveraging task specialization and agent collaboration, IntelForge enriches raw CTI reports with high-value external sources and produces analyst-ready intelligence. To assess the quality of this enrichment, we compare IntelForge’s source rankings to those of human experts and state-of-the-art LLM baselines. Our results show that IntelForge enriches CTI reports more effectively, achieving substantially lower deviation and higher correlation with human experts than single-LLM baselines. These findings demonstrate that structured agent-based LLM pipelines provide a powerful alternative to single-model solutions for CTI enrichment.
No abstract available
This paper presents a structured overview of recent research on the application of Large Language Models (LLMs) and multi-agent systems in software requirements analysis and introduces a conceptual, role-based multi-agent architecture for addressing identified limitations. The proposed approach emphasizes hierarchical supervision, collaboration between Large and Small Language Models, and structured cross-checking based on report comparison and uncertaintyaware arbitration to improve robustness and reliability. The contribution of this work is theoretical and exploratory, providing a foundation for future research on multi-agent LLM systems aimed at improving the software requirements analysis process.
A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using MARL methods for LLM collaboration and highlights the associated challenges.
Multi-agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem-solving. However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents’ communication topologies particularly important. Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout , which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance. Compared to state-of-the-art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks. Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness. We release our code at https://github.
Recently, with the development of tool-calling capabilities in large language models (LLMs), these models have demonstrated significant potential for automating electronic design automation (EDA) flows by interacting with EDA tool APIs via EDA scripts. However, considering the limited understanding of EDA tools, LLMs face challenges in practical scenarios where diverse interfaces of EDA tools exist across different platforms. Additionally, EDA flow automation often involves intricate, long-chain tool-calling processes, increasing the likelihood of errors in intermediate steps. Any errors will lead to the instability and failure of EDA flow automation. To address these challenges, we introduce EDAid, a multi-agent collaboration system where multiple agents harboring divergent thoughts converge towards a common goal, ensuring reliable and successful EDA flow automation. Specifically, each agent is controlled by ChipLlama models, which are expert LLMs fine-tuned for EDA flow automation. Our experiments demonstrate the state-of-the-art (SOTA) performance of our ChipLlama models and validate the effectiveness of our EDAid in the automation of complex EDA flows, showcasing superior performance compared to single-agent systems.
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents, each focusing on a specific domain. However, the impact of clumsy or even malicious agents--those who frequently make errors in their tasks--on the overall performance of the system remains underexplored. This paper investigates: (1) What is the resilience of various system structures (e.g., A$\rightarrow$B$\rightarrow$C, A$\leftrightarrow$B$\leftrightarrow$C) under faulty agents, on different downstream tasks? (2) How can we increase system resilience to defend against these agents? To simulate faulty agents, we propose two approaches--AutoTransform and AutoInject--which introduce mistakes into the agents' responses. Experiments on four downstream tasks using six systems show that the"hierarchical"structure, i.e., A$\rightarrow$(B$\leftrightarrow$C), exhibits superior resilience with the lowest performance drop of 5.5%, compared to 10.5% and 23.7% of other two structures. To further improve resilience, we introduce (1) Challenger, that introduces a mechanism for each agent to challenge others' outputs, and (2) Inspector, an additional agent to review and correct messages, recovering up to 96.4% errors made by faulty agents. Our code and data are available at https://github.com/CUHK-ARISE/MAS-Resilience.
Multi-agent autonomous systems (MAS) are better at addressing challenges that spans across multiple domains than singular autonomous agents. This holds true within the field of software engineering (SE) as well. The state-of-the-art research on MAS within SE focuses on integrating LLMs at the core of autonomous agents to create LLM-based multi-agent autonomous (LMA) systems. However, the introduction of LMA systems into SE brings a plethora of challenges. One of the major challenges is the strategic allocation of tasks between humans and the LMA system in a trustworthy manner. To address this challenge, a RACI-based framework is proposed in this work in progress article, along with implementation guidelines and an example implementation of the framework. The proposed framework can facilitate efficient collaboration, ensure accountability, and mitigate potential risks associated with LLM-driven automation while aligning with the Trustworthy AI guidelines. The future steps for this work delineating the planned empirical validation method are also presented.
The ubiquitous computing resources in 6G networks provide ideal environments for the fusion of large language models (LLMs) and intelligent services through the agent framework. With auxiliary modules and planning cores, LLM-enabled agents can autonomously plan and take actions to deal with diverse environment semantics and user intentions. However, the limited resources of individual network devices significantly hinder the efficient operation of LLM-enabled agents with complex tool calls, highlighting the urgent need for efficient multi-level device collaborations. To this end, the framework and method of the LLM-enabled multi-agent system with dual-loop terminal-edge collaborations are proposed in 6G networks. Firstly, the outer loop consists of the iterative collaborations between the global agent and multiple sub-agents deployed on edge servers and terminals, where the planning capability is enhanced through task decomposition and parallel sub-task distribution. Secondly, the inner loop utilizes sub-agents with dedicated roles to circularly reason, execute, and replan the sub-task, and the parallel tool calling generation with offloading strategies is incorporated to improve efficiency. The improved task planning capability and task execution efficiency are validated through the conducted case study in 6G-supported urban safety governance. Finally, the open challenges and future directions are thoroughly analyzed in 6G networks, accelerating the advent of the 6G era.
Grounding the reasoning ability of large language models (LLMs) for embodied tasks is challenging due to the complexity of the physical world. Especially, LLM planning for multi-agent collaboration requires communication of agents or credit assignment as the feedback to re-adjust the proposed plans and achieve effective coordination. However, existing methods that overly rely on physical verification or self-reflection suffer from excessive and inefficient querying of LLMs. In this paper, we propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. Specifically, we perform critic regression to learn a sequential advantage function from LLM-planned data, and then treat the LLM planner as an optimizer to generate actions that maximize the advantage function. It endows the LLM with the foresight to discern whether the action contributes to accomplishing the final task. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Overcooked-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents and query rounds of LLMs, demonstrating its high efficiency for grounding LLMs. More results are given at https://embodied-read.github.io
Proper initialization is crucial for any system, particularly in multi-agent systems (MAS), where it plays a pivotal role in determining both the system's efficiency and effectiveness. However, existing MAS initialization methods do not fully account for the collaborative needs of the generated agents in subsequent stages. Inspired by the principles of effective team composition, we propose AgentInit, which aims to optimize the structure of agent teams. Specifically, in addition to multi-round interactions and reflections between agents during agent generation, AgentInit incorporates a Natural Language to Format mechanism to ensure consistency and standardization. Balanced team selection strategies using Pareto principles are subsequently applied to jointly consider agent team diversity and task relevance to promote effective and efficient collaboration and enhance overall system performance. Experiments show that AgentInit consistently outperforms state-of-the-art initialization methods and pre-defined strategies across various frameworks and tasks, achieving an overall performance improvement of up to 1.2 and 1.6, respectively, while also significantly reducing token consumption. Further analysis confirms its strong transferability to similar tasks and verifies the effectiveness of its key components, demonstrating its capability and adaptability as a reliable MAS initialization method. Source code and models are available at https://github.com/1737423697/AgentInit.
No abstract available
This paper presents the MAPLE framework, which harnesses large language models (LLMs) to facilitate multi-agent collaboration for fully automated deployment and management of large-scale networks. Within MAPLE, a supervisor agent interprets natural language instructions from users, orchestrates specialized agents to execute tasks, and validates outcomes through integration with a network simulation platform. Experimental findings show that MAPLE outperforms single-agent approaches in terms of success rates for topology deployment and service configuration. Moreover, experiments reveal that by adaptively employing LLMs with varying capabilities according to task requirements and inter-agent dependencies, the framework effectively balances task success rates with cost efficiency.
Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbf{DDO}, a novel LLM-based framework that performs \textbf{D}ual-\textbf{D}ecision \textbf{O}ptimization by decoupling the two sub-tasks and optimizing them with distinct objectives through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task. The code is available at https://github.com/zh-jia/DDO.
Large language models (LLMs) have shown promising code generation capabilities; however, they still face challenges in generating successful code for non-trivial programming tasks. To enhance the LLM’s code generation abilities, we investigate, in this paper, the combination of two complementary inference-time strategies: multi-agent collaboration and runtime execution-based debugging. We conduct an extensive empirical study to evaluate our proposed framework using 19 diverse LLMs on two established coding benchmarks, measuring functional accuracy as a code quality characteristic. Our study reveals that that our combination outperforms the basic performance of LLMs and the individual approaches. In particlular, on HumanEval benchmark, across 19 LLMs, our approach achieves 64.82% accuracy compared to 56.48% achieved by the basic one-shot prompting approach. It also outperforms the individual multi-agent collaboration approach by more than 7.66%. Moreover, our extensive empirical study revealed that optimal combinatory improvement is achieved when both individual strategies perform at similar levels. The proposed framework can be readily deployed in existing development pipelines to enhance the reliability of code generation while maintaining reasonable generation speeds, offering a practical solution to improve automated code generation in enterprise environments. Our code is available at: https://github.com/nazmus-ashrafi/multiagent_vs_debugger
The combination of multi-agent systems and Large Language Models (LLMs) has become an exciting new paradigm capable of the real-time automation of complex tasks. In this paper, we introduce a framework of LLM-Guided Multi-Agent Collaboration that utilizes LLMs as high-level reasoning engines to coordinate, optimize and adjust the interactions between multiple autonomous agents. The proposed framework builds on almost zero task decomposition and intent interpretation in multitasking and resolution of conflict that is provided by the natural language understanding and decision support capability of the LLMs. Agents that have domain-specific knowledge will work in cooperation, under LLM supervision to deliver an agent-centric task execution that is scalable, explainable and adaptive. We consider the framework against the disparate domains of process automation, data-driven decision-making, and cybersecurity incident response. In the experimental results, better efficiency, resource use, and flexibility were observed in respect to dedicated multi-agent models. This paper indicates the promise that LLM-based coordination may hold to turn multi-agent collaboration into a more intelligent, context-aware, and robust paradigm of automating complex tasks in the real world.
Multi-agent collaboration using large language mod els (LLMs) has shown promise in complex reasoning tasks, yet cu rrent approaches suffer from high computational costs, redundan t communication, and unstable performance gains. We propose B udgeted Multi-Agent Routing (BMAR), a framework that dynami cally allocates agents and communication rounds under explicit b udget constraints. BMAR introduces three key innovations: (1) ad aptive role assignment based on task characteristics and uncertai nty, (2) claim-evidence compression to reduce communication ove rhead, and (3) budget-aware routing that optimizes the accuracycost trade-off. Experiments across mathematical reasoning, multi -hop question answering, and code debugging demonstrate that $\mathbf{B}$ MAR achieves superior Pareto efficiency: at a $4 \times$ budget, BMAR improves accuracy by 8.9 percentage points on GSM8K over sing le-agent baselines (77.1% vs. 68.2%), while using 58% fewer toke ns than fixed multi-agent approaches at comparable accuracy lev els. Our approach provides a principled framework for deploying cost-effective multi-agent systems in resource-constrained enviro nments.
In large-scale enterprises, on-call engineers (OCEs) are critical for ensuring service availability and reliability. However, as incidents grow in volume and complexity, traditional manual on-call processes are becoming increasingly inadequate. Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and multi-agent collaboration, presenting new opportunities for automation. We propose OncallX, an end-to-end automated on-call system designed for real-world industrial scenarios that integrates LLMs with multi-agent cooperation to enable intelligent and efficient incident management. OncallX first enhances user queries by leveraging external knowledge bases and multi-turn dialogue interactions. Subsequently, multiple expert agents collaborate through tree-search-based mechanisms to generate effective responses and solutions. When incidents cannot be resolved automatically, OncallX accurately assigns them to the most appropriate teams. Comprehensive experiments conducted in the real-world production environment of a top-tier global online video service provider demonstrate that OncallX efficiently responds to incidents and accurately triages tickets, significantly outperforming existing methods in both automated metrics and human evaluations. Furthermore, OncallX has been successfully deployed in production for two months, during which it has substantially enhanced on-call efficiency, reducing average incident response time to just 21 seconds and average triage time to 4 seconds—representing a transformative improvement in operational excellence.
The potential of automatic task-solving through Large Language Model (LLM)-based multi-agent collaboration has recently garnered widespread attention from both the research community and industry. While utilizing natural language to coordinate multiple agents presents a promising avenue for democratizing agent technology for general users, designing coordination strategies remains challenging with existing coordination frameworks. This difficulty stems from the inherent ambiguity of natural language for specifying the collaboration process and the significant cognitive effort required to extract crucial information (e.g. agent relationship, task dependency, result correspondence) from a vast amount of text-form content during exploration. In this work, we present a visual exploration framework to facilitate the design of coordination strategies in multi-agent collaboration. We first establish a structured representation for LLM-based multi-agent coordination strategy to regularize the ambiguity of natural language. Based on this structure, we devise a three-stage generation method that leverages LLMs to convert a user's general goal into an executable initial coordination strategy. Users can further intervene at any stage of the generation process, utilizing LLMs and a set of interactions to explore alternative strategies. Whenever a satisfactory strategy is identified, users can commence the collaboration and examine the visually enhanced execution result. We develop AgentCoord, a prototype interactive system, and conduct a formal user study to demonstrate the feasibility and effectiveness of our approach.
,
Recent progress in large language model (LLM)-based multi-agent collaboration highlights the power of structured communication in enabling collective intelligence. However, existing methods largely rely on static or graph-based inter-agent topologies, lacking the potential adaptability and flexibility in communication. In this work, we propose a new framework that rethinks multi-agent coordination through a sequential structure rather than a graph structure, offering a significantly larger topology space for multi-agent communication. Our method focuses on two key directions: (1) Next-Agent Prediction, which selects the most suitable agent role at each step, and (2) Next-Context Selection (NCS), which enables each agent to selectively access relevant information from any previous step. Together, these components construct task-adaptive communication pipelines that support both role flexibility and global information flow. Extensive evaluations across multiple benchmarks demonstrate that our approach achieves superior performance while substantially reducing communication overhead.
The communication topology in large language model-based multi-agent systems fundamentally governs inter-agent collaboration patterns, critically shaping both the efficiency and effectiveness of collective decision-making. While recent studies for communication topology automated design tend to construct sparse structures for efficiency, they often overlook why and when sparse and dense topologies help or hinder collaboration. In this paper, we present a causal framework to analyze how agent outputs, whether correct or erroneous, propagate under topologies with varying sparsity. Our empirical studies reveal that moderately sparse topologies, which effectively suppress error propagation while preserving beneficial information diffusion, typically achieve optimal task performance. Guided by this insight, we propose a novel topology design approach, EIB-leanrner, that balances error suppression and beneficial information propagation by fusing connectivity patterns from both dense and sparse graphs. Extensive experiments show the superior effectiveness, communication cost, and robustness of EIB-leanrner.
No abstract available
Multi-agent systems (MAS) have shown great potential in executing complex tasks, but coordination and safety remain significant challenges. Multi-Agent Reinforcement Learning (MARL) offers a promising framework for agent collaboration, but it faces difficulties in handling complex tasks and designing reward functions. The introduction of Large Language Models (LLMs) has brought stronger reasoning and cognitive abilities to MAS, but existing LLM-based systems struggle to respond quickly and accurately in dynamic environments. To address these challenges, we propose LLM-based Graph Collaboration MARL (LGC-MARL), a framework that efficiently combines LLMs and MARL. This framework decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph-based coordination. Specifically, LGC-MARL consists of two main components: an LLM planner and a graph-based collaboration meta policy. The LLM planner transforms complex task instructions into a series of executable subtasks, evaluates the rationality of these subtasks using a critic model, and generates an action dependency graph. The graph-based collaboration meta policy facilitates communication and collaboration among agents based on the action dependency graph, and adapts to new task environments through meta-learning. Experimental results on the AI2-THOR simulation platform demonstrate the superior performance and scalability of LGC-MARL in completing various complex tasks.
The development of agentic artificial intelligence (AI) systems with the capability to perceive environments, plan, and execute multi-step tasks is a paradigmatic change in the deployment of computational intelligence. The paper has offered a synthesis of product and system design patterns that apply to trustworthy agentic AI based on the progress of large language model (LLM)-based agents, deep reinforcement learning (RL), explainable AI (XAI), fairness-aware machine learning, and governance-focused frameworks. The use of agentic AI both by enterprises and for personal purposes has expanded to around 5 per cent. in the year 2019, and is projected to grow to around 73 per cent. by the mid-2025 years, accompanied by a corresponding growth in the number of safety incidents. The scores in trustworthiness dimension are 1532 percent higher in fairness, robustness, and privacy indicators in hybrid agentic architectures in comparison with the purely LLM-based settings. Seven trustworthy AI pillars are safety, robustness, explainability, fairness, privacy, accountability, and transparency, which are aligned to the system layers with a particular design pattern. A framework namedTRiSM (Trust, Risk, and Security Management) has been found as a systematic route to the operationalization of these principles in production deployments, with 94 percent of agent impersonation incidents being reduced.
No abstract available
No abstract available
No abstract available
In this study, we analyzed the demand and pain points of multi-business pipelines integrated decision-making scenarios (such as drug development), including the subjectivity, non-interpretability and low efficiency of traditional expert experience-driven decision-making, and the inability to handle a large amount of data. Then, we designed a framework for the Data Space and Large Language Model (LLM) enabled decision-making support system. By integrating Data Space, the framework helps enterprises to achieve multilateral data-driven evaluation and analysis; By integrating LLM and Ai-Agent, collaborative decisionmaking of multi-business pipelines can be realized. This framework can provide quantitative analysis for experts and improve the accuracy and efficiency of decision-making. The effectiveness of this study has been verified in the decisionmaking scenario of drug development. The follow-up studies will further explore the autonomous decision-making of multi-business pipelines and the intelligent conflict resolution mechanism to reduce the dependence on expert experience.
This study introduces an innovative framework that aligns LLM-based AI agents with enterprise needs by integrating dual-process theory and resource-based task complexity models. Utilizing large language models and Retrieval-Augmented Generation, our system simulates diverse task and decisionmaking scenarios through a Strategy Generator and Evaluator. The results reveal that decision styles-Rational/Intuitive and Participative/Autonomous-vary in performance according to task demands, with effective knowledge management further boosting outcomes. This framework advances efficient AI integration in enterprise settings, enhancing both decision flexibility and execution efficiency.
Large Language Models (LLMs) have shown significant promise in real-world decision-making tasks for embodied artificial intelligence, especially when fine-tuned to leverage their inherent common sense and reasoning abilities while being tailored to specific applications. However, this fine-tuning process introduces considerable safety and security vulnerabilities, especially in safety-critical cyber-physical systems. In this work, we propose the first comprehensive framework for Backdoor Attacks against LLM-based Decision-making systems (BALD) in embodied AI, systematically exploring the attack surfaces and trigger mechanisms. Specifically, we propose three distinct attack mechanisms: word injection, scenario manipulation, and knowledge injection, targeting various components in the LLM-based decision-making pipeline. We perform extensive experiments on representative LLMs (GPT-3.5, LLaMA2, PaLM2) in autonomous driving and home robot tasks, demonstrating the effectiveness and stealthiness of our backdoor triggers across various attack channels, with cases like vehicles accelerating toward obstacles and robots placing knives on beds. Our word and knowledge injection attacks achieve nearly 100% success rate across multiple models and datasets while requiring only limited access to the system. Our scenario manipulation attack yields success rates exceeding 65%, reaching up to 90%, and does not require any runtime system intrusion. We also assess the robustness of these attacks against defenses, revealing their resilience. Our findings highlight critical security vulnerabilities in embodied LLM systems and emphasize the urgent need for safeguarding these systems to mitigate potential risks.
This paper presents a novel framework, (Large Language Model) LLM-Augmented Reinforcement Learning (LLM-RL) to achieve adaptive and intelligent decision-making in dynamic environments. Converting the capacity of Large Language Models (LLMs) in semantic reasoning and producing generalization capabilities of the Visual Form-aided Autonomous Driving Task In contrast to more traditional reinforcement learning algorithms which rely purely on trial and error-based exploration, the proposed approach ensures that the semantic delicate mass idea and the capacity of contextual generation is combined into the exploration for policy optimization. The LLM offers high-level action priors, interpretive state representations and natural language guided reward shaping, reducing the sample inefficiency and advancing convergence. Moreover, LLM-based meta-prompting allows support for adaptability to previously unseen tasks without retraining. To test performance, the framework was tested over multiple benchmarks of multi-agent control, resource allocation and sequential decision-making. Experimental results show that we achieve an average improvement of 21.4% in cumulative reward, faster convergence of 18.7% and 25.2% reduction of catastrophic exploration error over state-of-the-art RL baselines. These results demonstrate the potential of LLM - RL for enabling a new paradigm of trustworthy, scalable, and adaptive decision-making in complex systems.
Swarm robotic systems consist of a large number of distributed autonomous robots that coordinate their actions to accomplish diverse tasks beyond the capabilities of a single robot. These systems have recently been considered for deployment in disaster scenarios, where communication is often unstable, making it necessary to achieve adaptive cooperative behavior without relying on explicit communication between robots. In the context of multi-robot systems—including swarm robotic systems—some studies have explored approaches utilizing large language models (LLMs) or other learning-based methods, but few have proposed systems that enable communication-free coordination. In this paper, we propose a system incorporating a novel method that combines high-level decision-making via LLM-based policy selection—guided by questionnaire-style prompts—with low-level control using multiple MARL-trained policies. We consider a complex task scenario in which robots search for a target object and transport it to a designated destination. To evaluate the method, we define implicit consensus as a condition in which a robot selects the same policy as its nearby robots without any explicit communication. The effectiveness of the proposed method is demonstrated through simulated task execution, with particular emphasis on implicit consensus as a key evaluation metric.
The Internet of Drone Things (IoDT) advances autonomous drone operations by integrating live sensor inputs with environmental and situational awareness and intelligent decision-making capabilities. The full capabilities of IoDT remain limited by the difficulties of dynamic task assignment and path optimization, along with adaptive decision-making, when operating in complex environments such as disaster relief and smart agriculture. Traditional task-scheduling techniques have difficulty adapting to real-time changes caused by dynamic constraints such as weather variations, battery limitations, and drone malfunctions. We present an LLM-based task scheduling framework that uses Large Language Models (LLMs) to improve task prioritization performance and path planning accuracy while minimizing operational failures. We combine heuristic algorithms (A*, Dijkstra) with decision-making processes driven by LLMs to allow drones to adapt to environmental changes while optimizing efficiency and resource consumption. Integrating LLM technology into IoDT operations results in up to 95% task completion rates and improves the scenario completion time by up to 42%, while adding reasonable computational overhead. Our framework demonstrates improved task adaptability, battery efficiency, and stronger system resilience against non-LLM baselines during disaster relief and package delivery operations. Our research shows that LLM-based IoDT task management has transformative potential, leading to the development of more innovative and autonomous drone ecosystems.
No abstract available
e13657 Background: With demanding schedules, oncologists often struggle to stay updated on the latest clinical trial data from publications, especially without access to pre-curated databases. Traditionally, building annotated clinical trial databases required extensive time and manual effort. To address these challenges, we explored assembling an L-SLR library of clinical trial data using extractions generated by an agentic LLM system and evaluated the system’s accuracy and associated time savings. Methods: Agentic LLM systems are autonomous systems where multiple LLMs maintain control over how they accomplish tasks with no human input or supervised training. Our system used three OpenAI LLM models (o1, o3, o1-mini) in a matrix of processes to emulate trained human experts by following an annotation manual, subdividing complex processes into smaller subtasks, and documenting its reasoning for traceable results. This methodology is inspired by the recent development from DeepSeek reinforcement learning and reasoning capabilities. Annotations were created for 4 review variables (population, intervention/comparator, outcome, study design) and 32 extraction variables, including clinical TNM staging, histology, biomarker, associated risk factors, treatment path (line, prior therapy etc.), interventions, intervention type; study randomization, phase and sample size; analysis type, follow-up period; reported outcomes (median and landmark overall survival (OS), progression-free survival (PFS) and other progression measures, response data, quality of life data); subgroup analyses, safety/toxicity. Accuracy of review and extraction was evaluated on publications in three cancer types: NSCLC, PC, BC compared to human results. Results: Our agentic LLM system generated annotations for 4 review variables for 19,407 publications (6,916 NSCLC, 6,978 PC, 5,513 BC) publications, and 32 extraction variables for 2,424 (1,356 NSCLC, 587 PC, 481 BC) publications. Accuracy for the review variables ranged from 93.8% to 97.2%. For extraction variables, accuracy exceeded 90% for all variables (91.5%-99.1%) with 50% of variables above 95%. Our system completed the annotations in 5.34 hours, compared to an estimated 727.39 hours by trained human researchers, resulting in 99.27% time savings. Conclusions: Our living SLR system can accurately review and extract clinical trial publications with performance comparable to human experts. This level of accuracy highlights our system’s potential to deliver real-time clinical data, empowering oncologists to make more informed treatment decisions, with the hopes of ultimately improving patient outcomes.
Obstacle avoidance in multi-lane traffic scenarios remains a critical challenge for autonomous vehicles, requiring robust decision-making and precise path planning to ensure safety and efficiency in dynamic environments. This paper proposes an integrated framework combining a Time-to-Collision (TTC)-based module for rapid risk assessment and a Large Language Model (LLM)-assisted decision-making module to handle complex situations involving conflicting risks. A novel Velocity-Direction Decomposition (VDD) kinematic model is introduced to address the limitations of classical Longitudinal-Lateral Decomposition (LLD) methods, ensuring smooth and dynamically feasible motion. Model Predictive Control (MPC) is employed to generate collision-free trajectories that respect vehicle dynamics while maintaining stability and passenger comfort. Simulations validate the framework across various scenarios, demonstrating its capability to adapt to diverse traffic conditions, enhance path feasibility, and improve overall system safety and efficiency.
Understanding how large language models (LLMs) generalize across diverse traffic scenarios is critical for advancing autonomous driving systems. While previous studies have validated LLMs’ potential in specific driving tasks, evaluations of their scenario adaptability remain limited. This research adopts the Dilu framework as a case study, with the objective of investigating the generalisation performance of LLMs in five typical scenarios: basic highway sections, highway merge area, intersection, racetrack, and roundabout, with varying traffic parameters. Through extensive experiments with 17 configurations in scenarios metioned above, we employ success rate (SR) and success steps (SS) as metrics to quantify LLMs’ generalization capabilities in different driving scenarios. The results reveal significant scenario-dependent performance variations: the LLM achieves a peak SR of 99% at 30 m/s in low-speed merges but declines to 69% at 60 m/s. In intersection scenarios, the LLM outperforms traditional reinforcement learning methods (DQN, PPO) by about three times (61% SR vs. 24% SR). Furthermore, expanding memory entries from 2-shot to 5-shot enhances median SS by 114% in roundabouts and 69% in intersections, highlighting the role of experience accumulation in dynamic environments. These findings provide empirical evidence for LLMs’ scenario-aware generalization capabilities and offer actionable insights for optimizing their deployment in real-world autonomous driving systems.
Recent advancements in autonomous vehicles (AVs) leverage Large Language Models (LLMs) to perform well in normal driving scenarios. However, ensuring safety in dynamic, high-risk environments and managing safety-critical long-tail events remains a significant challenge. To address these issues, we propose SafeDrive, a knowledge- and data-driven risk-sensitive decision-making framework, to enhance AV safety and adaptability. The proposed framework introduces a modular system comprising: (1) a Risk Module for comprehensive quantification of multi-factor coupled risks involving driver, vehicle, and road interactions; (2) a Memory Module for storing and retrieving typical scenarios to improve adaptability; (3) a LLM-powered Reasoning Module for context-aware safety decision-making; and (4) a Reflection Module for refining decisions through iterative learning. By integrating knowledge-driven insights with adaptive learning mechanisms, the framework ensures robust decision-making under uncertain conditions. Extensive evaluations on real-world traffic datasets characterized by dynamic and high-risk scenarios, including highways (HighD), intersections (InD), and roundabouts (RounD), validate the framework's ability to enhance decision-making safety (achieving a 100% safety rate), replicate human-like driving behaviors (with decision alignment exceeding 85%), and adapt effectively to unpredictable scenarios. The proposed framework of SafeDrive establishes a novel paradigm for integrating knowledge- and data-driven methods, highlighting significant potential to improve the safety and adaptability of autonomous driving in long-tail or high-risk traffic scenarios. Project page: https://mezzi33.github.io/SafeDrive/.
In the field of autonomous surface vehicles (ASVs), devising decision-making and obstacle avoidance solutions that address maritime COLREGs (Collision Regulations), primarily defined for human operators, has long been a pressing challenge. Recent advancements in explainable Artificial Intelligence (AI) and machine learning have shown promise in enabling human-like decision-making. Notably, significant developments have occurred in the application of Large Language Models (LLMs) to the decision-making of complex systems, such as self-driving cars. The textual and somewhat ambiguous nature of COLREGs (from an algorithmic perspective), however, poses challenges that align well with the capabilities of LLMs, suggesting that LLMs may become increasingly suitable for this application soon. This paper presents and demonstrates the first application of LLM-based decision-making and control for ASVs. The proposed method establishes a high-level decision-maker that uses online collision risk indices and key measurements to make decisions for safe manoeuvres. A tailored design and runtime structure is developed to support training and real-time action generation on a realistic ASV model. Local planning and control algorithms are integrated to execute the commands for waypoint following and collision avoidance at a lower level. To the authors’ knowledge, this study represents the first attempt to apply explainable AI to the dynamic control problem of maritime systems recognising the COLREGs rules, opening new avenues for research in this challenging area. Results obtained across multiple test scenarios demonstrate the system’s ability to maintain online COLREGs compliance, accurate waypoint tracking, and feasible control, while providing human-interpretable reasoning for each decision.
Understanding and adhering to traffic regulations is essential for autonomous vehicles to ensure safety and trustworthiness. However, traffic regulations are complex, context-dependent, and differ between regions, posing a major challenge to conventional rule-based decision-making approaches. We present an interpretable, regulation-aware decision-making framework, DriveReg, which enables autonomous vehicles to understand and adhere to region-specific traffic laws and safety guidelines. The framework integrates a Retrieval Augmented Generation (RAG)-based Traffic Regulation Retrieval Agent, which retrieves relevant rules from regulatory documents based on the current situation, and a Large Language Model (LLM)-powered Reasoning Agent that evaluates actions for legal compliance and safety. Our design emphasizes interpretability to enhance transparency and trustworthiness. To support systematic evaluation, we introduce DriveReg Scenarios Dataset, a comprehensive dataset of driving scenarios across Boston, Singapore, and Los Angeles, with both hypothesized text-based cases and real-world driving data, specifically constructed and annotated to evaluate models’ capacity for regulation understanding and reasoning. We validate our framework on the DriveReg Scenarios Dataset and real-world deployment, demonstrating strong performance and robustness across diverse environments.
Drones have seen a massive surge in popularity, driven by technological advancements, falling costs and versatility. Despite this, control schemes for drones are still a challenge to learn for most people, one that needs extensive time and training to master. Additionally, requiring an individual operator for every deployed drone limits the number of drones and the speed with which they can be deployed for a given task. Large Language Models (LLMs) are instrumental in passing this hurdle. By integrating their natural language understanding, it is possible to simplify their control scheme for use by a layman, and even to fully automate their operation by assigning an LLM as the operator itself. This paper presents an LLM driven framework for autonomous drone operation. The LLM is assigned the role of the drone operator, responsible for reasoning and decision making. A YOLOE-11L model is responsible for object detection and generating visual encodings to be interpreted by the LLM. The LLM generates actions to be taken in the environment by reasoning with the visual input and the context of the previously taken actions, which are passed along to a PX4 flight control stack. The test environment is a real life environment photogrammatically recreated in a high fidelity simulation in Unreal Engine 5. To evaluate the system, a representative mission "Land on a couch" is performed multiple times to test the robustness of the system. Experimental results show that the system can effectively interpret sensor data, perform spatial reasoning and perform safe trajectories without domain specific training.
With recent advances in multi‐modal foundation models, the previously text‐only large language models (LLM) have evolved to incorporate visual input, opening up unprecedented opportunities for various applications in visualization. Compared to existing work on LLM‐based visualization works that generate and control visualization with textual input and output only, the proposed approach explores the utilization of the visual processing ability of multi‐modal LLMs to develop Autonomous Visualization Agents (AVAs) that can evaluate the generated visualization and iterate on the result to accomplish user‐defined objectives defined through natural language. We propose the first framework for the design of AVAs and present several usage scenarios intended to demonstrate the general applicability of the proposed paradigm. Our preliminary exploration and proof‐of‐concept agents suggest that this approach can be widely applicable whenever the choices of appropriate visualization parameters require the interpretation of previous visual output. Our study indicates that AVAs represent a general paradigm for designing intelligent visualization systems that can achieve high‐level visualization goals, which pave the way for developing expert‐level visualization agents in the future.
Large language models (LLMs) have received considerable interest recently due to their outstanding reasoning and comprehension capabilities. This letter explores applying LLMs to vehicular networks, aiming to jointly optimize vehicle-to-infrastructure (V2I) communications and autonomous driving (AD) policies. We deploy LLMs for AD decision-making to maximize traffic flow and avoid collisions for road safety, and a double deep Q-learning algorithm (DDQN) is used for V2I optimization to maximize the received data rate and reduce frequent handovers. In particular, for LLM-enabled AD, we employ the Euclidean distance to identify previously explored AD experiences, and then LLMs can learn from past good and bad decisions for further improvement. Then, LLM-based AD decisions will become part of states in V2I problems, and DDQN will optimize the V2I decisions accordingly. After that, the AD and V2I decisions are iteratively optimized until convergence. Such an iterative optimization approach can better explore the interactions between LLMs and conventional reinforcement learning techniques, revealing the potential of using LLMs for network optimization and management. Finally, the simulations demonstrate that our proposed hybrid LLM-DDQN approach outperforms the conventional DDQN algorithm, showing faster convergence and higher average rewards.
Large language models (LLMs) are promising for autonomous driving decision‐making, but existing methods mostly rely on cloud‐side deployment, causing high decision latency, privacy concerns and a lack of explicit safety verification for generated actions. To address these challenges, we propose SEDM (safety‐enhanced decision‐making framework) for highway driving scenarios. SEDM comprises an environment encoding module, an edge‐side LLM‐based decision‐making module enhanced through chain‐of‐thought prompting and low‐rank adaptation (LoRA) fine‐tuning, and an XGBoost‐based safety shield module that filters unsafe actions generated by the LLM. Experiments show that SEDM achieves driving success rates of 95%, 82% and 55% under simple, normal and dense traffic conditions, respectively—substantially outperforming such as deep Q‐network and proximal policy optimization. Moreover, it yields a 17‐percentage‐point improvement in success rate over an ablated variant without the safety shield module. Furthermore, decision latency is reduced from 7.80 s (cloud‐side LLM) to 1.01 s.
Large language models (LLMs) exhibit strong semantic reasoning capabilities for autonomous driving decision-making; however, their substantial inference latency poses a critical challenge for real-time closed-loop vehicle control. This study proposes an engineering-oriented framework to enable latency-constrained LLM-based decision-making by integrating bird’s-eye-view (BEV) structured perception with low-bit quantized inference. The BEV perception module compresses multi-view visual inputs into structured semantic representations, thereby reducing input redundancy and enhancing inference efficiency. In addition, 4-bit post-training quantization (PTQ), combined with an optimized inference engine, is employed to alleviate computational and memory bandwidth constraints during autoregressive decoding. Experiments conducted on the CARLA simulation platform under car-following, overtaking, and mixed driving scenarios—validated through 500 independent trials—demonstrate that the proposed framework substantially reduces end-to-end inference latency while maintaining stable decision-making performance. The results indicate that the system satisfies the 10 Hz real-time control requirement and significantly improves control quality, as evidenced by reduced collision rates and lower Average Jerk compared with both traditional imitation learning (Behavioral Cloning, BC) and the Transformer-based TransFuser baseline. Furthermore, sensitivity analyses confirm the robustness of the framework under environmental degradation and perception noise, underscoring the practical feasibility of deploying LLMs for safe and reliable closed-loop autonomous driving.
This study addresses the critical need for enhanced situational awareness in autonomous driving (AD) by leveraging the contextual reasoning capabilities of large language models (LLMs). Unlike traditional perception systems that rely on rigid, label-based annotations, it integrates real-time, multimodal sensor data into a unified, LLMs-readable knowledge base, enabling LLMs to dynamically understand and respond to complex driving environments. To over-come the inherent latency and modality limitations of LLMs, a proactive Retrieval-Augmented Generation (RAG) is designed for AD, combined with a chain-of-thought prompting mechanism, ensuring rapid and context-rich under-standing. Experimental results using real-world Vehicle-to-everything (V2X) datasets demonstrate significant improvements in perception and prediction performance, highlighting the potential of this framework to enhance safety, adaptability, and decision-making in next-generation AD systems.
Recent advances in vision–language models (VLMs) have transformed the field of robotics. Researchers are combining the reasoning capabilities of large language models (LLMs) with the visual information processing capabilities of VLMs in various domains. However, most efforts have focused on terrestrial robots and are limited in their applicability to volatile environments such as ocean surfaces and underwater environments, where real-time judgment is required. We propose a system integrating the cognition, decision making, path planning, and control of autonomous marine surface vehicles in the ROS2–Gazebo simulation environment using a multimodal vision–LLM system with zero-shot prompting for real-time adaptability. In 30 experiments, adding the path plan mode feature increased the success rate from 23% to 73%. The average distance increased from 39 m to 45 m, and the time required to complete the task increased from 483 s to 672 s. These results demonstrate the trade-off between improved reliability and reduced efficiency. Experiments were conducted to verify the effectiveness of the proposed system and evaluate its performance with and without adding a path-planning step. The final algorithm with the path-planning sub-process yields a higher success rate, and better average path length and time. We achieve real-time environmental adaptability and performance improvement through prompt engineering and the addition of a path-planning sub-process in a limited structure, where the LLM state is initialized with every application programming interface call (zero-shot prompting). Additionally, the developed system is independent of the vision–LLM archetype, making it scalable and adaptable to future models.
The integration of human-intuitive interactions into autonomous systems has been limited. Traditional Natural Language Processing (NLP) systems struggle with context and intent understanding, severely restricting human-robot interaction. Recent advancements in Large Language Models (LLMs) have transformed this dynamic, allowing for intuitive and high-level communication through speech and text, and bridging the gap between human commands and robotic actions. Addition-ally, autonomous navigation has emerged as a central focus in robotics research, with artificial intelligence (AI) increasingly being leveraged to enhance these systems. However, existing AI-based navigation algorithms face significant challenges in latency-critical tasks where rapid decision-making is critical. Traditional frame-based vision systems, while effective for high-level decision-making, suffer from high energy consumption and latency, limiting their applicability in real-time scenarios. Neuromorphic vision systems, combining event-based cameras and spiking neural networks (SNNs), offer a promising alternative by enabling energy-efficient, low-latency navigation. Despite their potential, real-world implementations of these systems, particularly on physical platforms such as drones, remain scarce. In this work, we present Neuro-LIFT, a real-time neuromorphic navigation framework implemented on a Parrot Bebop2 quadrotor. Leveraging an LLM for natural language processing, Neuro-LIFT translates human speech into high-level planning commands which are then autonomously executed using event-based neuromorphic vision and physics-driven planning. Our framework demonstrates its capabilities in navigating in a dynamic environment, avoiding obstacles, and adapting to human instructions in real-time. Demonstration images of Neuro-LIFT navigating through a moving ring in an indoor setting is provided, showcasing the system’s interactive, collaborative potential in autonomous robotics.
Connected Autonomous Vehicles (CAVs) are being tested globally, but their performance in complex scenarios remains suboptimal. While cooperative driving improves CAV performance by leveraging vehicle collaboration, its lack of interaction and continuous learning limits current applications to single scenarios and specific Cooperative Driving Automation (CDA). To address these issues, this paper proposes CoDrivingLLM, an interactive and learnable LLM-driven cooperative driving framework for all-scenario and all-CDA applications. First, an environment module updates vehicle positions based on semantic decisions, mitigating errors from LLM-controlled positioning. Second, leveraging the four CDA levels defined in SAE J3216, a centralized-distributed coupled architecture reasoning module is developed to ensure safe and efficient cooperation through centralized negotiation and distributed decision. Finally, by introducing a memory module that employs Retrieval Augmented Generation (RAG), CAVs are endowed with the ability to learn from their past experiences to avoid repeating mistakes. Through ablation studies and comparisons with other cooperative driving methods, the results demonstrate that the proposed CoDrivingLLM significantly enhances safety, efficiency, and adaptability across various scenarios.
Autonomous driving has entered the testing phase, but due to the limited decision-making capabilities of individual vehicle algorithms, safety and efficiency issues have become more apparent in complex scenarios. With the advancement of connected communication technologies, autonomous vehicles equipped with connectivity can leverage vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications, offering a potential solution to the decision-making challenges from individual vehicle's perspective. We propose a multi-level vehicle-infrastructure cooperative decision-making framework for complex conflict scenarios at unsignalized intersections. First, based on vehicle states, we define a method for quantifying vehicle impacts and their propagation relationships, using accumulated impact to group vehicles through motif-based graph clustering. Next, within and between vehicle groups, a pass order negotiation process based on Large Language Models (LLM) is employed to determine the vehicle passage order, resulting in planned vehicle actions. Simulation results from ablation experiments show that our approach reduces negotiation complexity and ensures safer, more efficient vehicle passage at intersections, aligning with natural decision-making logic.
The integration of Large Language Models (LLMs) into flight control systems (FCS) offers a transformative approach to enhance autonomous decision-making, moving beyond traditional rule-based and physics-based methods. LLMs possess potent capabilities in contextual understanding, multi-source information fusion, and multistep reasoning, making them promising for complex tasks like mission planning and anomaly resolution. However, the inherent instability of LLM outputs presents a significant challenge to the reliability and safety crucial for flight operations. To address this, this paper introduces an innovative iterative decision-making framework. This framework leverages snapshot simulation to provide a predictive information as feedback to the LLM. Then iteratively refines its decision-making process based on this feedback loop, aiming to improve the reliability and interpretability of the generated control actions. Preliminary experimental results indicate that the proposed framework enables the LLM to successfully complete complex flight tasks while improving decision consistency and adherence to safety constraints, thereby paving the way for more robust and reliable LLM-based FCS.
Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for solving complex, real-world tasks with tedious workflows and frequent errors. However, current self-evolution MAS approaches require extensive manual tuning and lack dynamic adaptation, precise issue identification, and modular flexibility. This paper presents E3-MAS, a general-purpose self-evolution framework that organizes agents into three interacting teams—Execution, Evaluation, and Evolution. We detail role-level designs (Planner/Executor/Replanner, Critic/Evaluator, Analyzer/Prompt Optimizer) and show how task-aware evaluation drives problem attribution and prompt refinement. Using a school administrative assistance scenario (C-Pilot), we demonstrate intelligent search and pipeline automation. In a leave-application case study, E3-MAS improves the task progress rate from 0.83 to 1.0 after one evolution cycle. More broadly, the framework consistently achieves progress rates exceeding 0.9, reduces manual prompt-tuning time by over 50%, and enables modular deployment of MAS applications across heterogeneous environments. These results highlight the potential of E3-MAS as a scalable and adaptive paradigm for reliable multi-agent collaboration.
The continuous evolution of the Android ecosystem has led to a highly dynamic and fragmented development environment. This constant churn makes building Android projects, especially from open-source repositories, a notoriously difficult task. Developers and researchers encounter a daunting build barrier due to the rapid configuration drift, which results in a cascade of errors. These errors include version incompatibilities, missing dependencies, and inconsistent project configurations, hindering reproducibility and maintainability.To address these issues, we present BuilDroid, an LLM-based agent that automates the build process of Android projects. Operating within a self-contained, isolated environment, BuilDroid runs an iterative, self-correcting loop. Through this operation, BuilDroid captures errors and autonomously resolves them, either through predefined heuristics or by leveraging the reasoning capabilities of its underlying LLM.Across 245 open-source Android projects, BuilDroid effectively resolves complex and evolving build errors, achieving a build success rate of 90.2%, surpassing existing solutions by a margin of over 30.2 percentage points. Consequently, BuilDroid reduces the barrier for researchers and developers, fostering greater software reproducibility and enabling more extensive and reliable empirical research within this rapidly evolving ecosystem.Video demo: https://youtu.be/YAFLu7NSl5E
The rapid advancement of large language models (LLMs) has transformed the landscape of agentic information seeking capabilities through the integration of tools such as search engines and web browsers. However, current mainstream approaches for enabling LLM web search proficiency face significant challenges: supervised fine-tuning struggles with data production in open-search domains, while RL converges quickly, limiting their data utilization efficiency. To address these issues, we propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data. Extensive experiments on seven multi-hop question-answering (MHQA) benchmarks demonstrate that EvolveSearch consistently improves performance across iterations, ultimately achieving an average improvement of 4.7\% over the current state-of-the-art across seven benchmarks, opening the door to self-evolution agentic capabilities in open web search domains.
Diplomacy is one of the most sophisticated activities in human society, involving complex interactions among multiple parties that require skills in social reasoning, negotiation, and long-term strategic planning. Previous AI agents have demonstrated their ability to handle multi-step games and large action spaces in multi-agent tasks. However, diplomacy involves a staggering magnitude of decision spaces, especially considering the negotiation stage required. While recent agents based on large language models (LLMs) have shown potential in various applications, they still struggle with extended planning periods in complex multi-agent settings. Leveraging recent technologies for LLM-based agents, we aim to explore AI's potential to create a human-like agent capable of executing comprehensive multi-agent missions by integrating three fundamental capabilities: 1) strategic planning with memory and reflection; 2) goal-oriented negotiation with social reasoning; and 3) augmenting memory through self-play games for self-evolution without human in the loop.
We introduce FinMem, a novel Large Language Models (LLM)-based agent framework for financial trading, designed to address the need for automated systems that can transform real-time data into executable decisions. FinMem comprises three core modules: Profile for customizing agent characteristics, Memory for hierarchical financial data assimilation, and Decision-making for converting insights into investment choices. The Memory module, which mimics human traders’ cognitive structure, offers interpretability and real-time tuning while handling the critical timing of various information types. It employs a layered approach to process and prioritize data based on its timeliness and relevance, ensuring that the most recent and impactful information is given appropriate weight in decision-making. FinMem’s adjustable cognitive span allows retention of critical information beyond human limits, enabling it to balance historical patterns with current market dynamics. This framework facilitates self-evolution of professional knowledge, agile reactions to investment cues, and continuous refinement of trading decisions in financial environments. When compared against advanced algorithmic agents using a large-scale real-world financial dataset, FinMem demonstrates superior performance across classic metrics like Cumulative Return and Sharpe ratio. Further tuning of the agent’s perceptual span and character setting enhances its trading performance, positioning FinMem as a cutting-edge solution for automated trading.
In the literature, existing human-centric emotional motion generation methods primarily focus on boosting performance within a single scale-fixed dataset, largely neglecting the flexible and scale-increasing motion scenarios (e.g., sports, dance), whereas effectively learning these newly emerging scenarios can significantly enhance the model’s real-world generalization ability. Inspired by this, this paper proposes a new LLM-Centric Lifelong Empathic Motion Generation (L2-EMG) task, which aims to equip LLMs with the capability to continually acquire emotional motion generation knowledge across different unseen scenarios, potentially contributing to building a closed-loop and self-evolving embodied agent equipped with both empathy and intelligence. Further, this paper poses two key challenges in the L2-EMG task, i.e., the emotion decoupling challenge and the scenario adapting challenge. To this end, this paper proposes an Emotion-Transferable and Scenario-Adapted Mixture of Experts (ES-MoE) approach which designs a causal-guided emotion decoupling block and a scenario-adapted expert constructing block to address the two challenges, respectively. Especially, this paper constructs multiple L2-EMG datasets to validate the effectiveness of the ES-MoE approach. Extensive evaluations show that ES-MoE outperforms advanced baselines.
LLM-based agents have shown promise in various cooperative and strategic reasoning tasks, but their effectiveness in competitive multi-agent environments remains underexplored. To address this gap, we introduce PillagerBench, a novel framework for evaluating multi-agent systems in real-time competitive team-vs-team scenarios in Minecraft. It provides an extensible API, multi-round testing, and rule-based built-in opponents for fair, reproducible comparisons. We also propose TactiCrafter, an LLM-based multi-agent system that facilitates teamwork through human-readable tactics, learns causal dependencies, and adapts to opponent strategies. Our evaluation demonstrates that TactiCrafter outperforms baseline approaches and showcases adaptive learning through self-play. Additionally, we analyze its learning process and strategic evolution over multiple game episodes. To encourage further research, we have open-sourced PillagerBench, fostering advancements in multi-agent AI for competitive environments.
This comprehensive article examines the evolution of reasoning capabilities in Large Language Model (LLM) agents, focusing on advanced frameworks and quality improvement approaches. The article explores key developments in agent reasoning mechanisms, including Tree-of-Thought and hierarchical reasoning structures, which have transformed problem-solving capabilities beyond simple input-output paradigms. It analyzes quality hillclimbing techniques such as Self-Refine and OPRO that systematically enhance model outputs through iterative refinement and optimization. The article presents empirical results quantifying improvements in reasoning quality and computational efficiency, followed by practical implementation frameworks and architectural considerations for deploying these systems at scale. Future directions in advanced reasoning paradigms and optimization methods are discussed alongside real-world applications in business decision-making and technical problem-solving that demonstrate the practical impact of these theoretical advances.
Decarbonisation and the proliferation of Distributed Energy Resources (DERs) such as residential solar photovoltaic (PV) systems and electric vehicles (EVs) are creating unprecedented challenges for electricity grid management, necessitating advanced, behaviourally realistic simulation models. This paper proposes a novel framework for simulating household energy consumption using Large Language Model (LLM) agents, moving beyond static simulations to generate dynamic, context-aware individual behaviours. The methodology integrates a structured Chain-of-Thought (CoT) prompting strategy with contextual data to generate transparent daily activity schedules. Its key innovation is a formal mechanism for agent memory evolution, where AI-driven self-reflection on past actions organically updates long-term characteristics, enabling the simulation of learned habits and emergent behaviours. Validation against Pecan Street data suggests the framework produces plausible outputs, representing a significant step toward high-fidelity energy models. This work opens new possibilities for analysing complex policy questions related to demand response, energy equity, and technology adoption. While acknowledging the challenges of LLM-based simulation, this approach offers a novel method for capturing nuanced, context-driven human behaviour and the impacts on energy consumption.
The rapid evolution of Artificial intelligence (AI) from passive “knowledge co-pilots” to autonomous “research partners” is initiating a paradigm shift in scientific discovery, a frontier now termed Agentic Science. However, applying general-purpose AI systems to dynamic, vertically integrated domains such as hydrogen energy reveals critical limitations, including a lack of deep domain knowledge, an inability to process real-time information, and insufficient autonomous planning capabilities. To address these challenges, we introduce Knowledge-Extractor, a self-evolving scientific framework for building domain-expert AI agents, which we implement and evaluate in the hydrogen energy domain via an agent named Hydrogen-Agent. The core of our framework is a Hybrid Knowledge Integration strategy, which synergistically combines a domain-fine-tuned large language model (LLM) as its "cognitive core" with a continuously updated, non-parametric knowledge base.This architecture is augmented by an autonomous toolset comprising a PolicyRetriever (for extracting information from policy documents), a WebBrowser (for retrieving online sources), and an ArxivAnalyzer (for analyzing scientific papers from arXiv). We demonstrate that through an autonomous knowledge loop, Hydrogen-Agent overcomes the static knowledge limitations of traditional models. Our experiments validate a “specialization effect” where domain-specific fine-tuning enhances factual accuracy on our HydroBench benchmark, outperforming its base model and powerful generalist LLMs. Furthermore, three case studies illustrates the ability of the agent to autonomously conduct complex, end-to-end research tasks, from multi-source data gathering to the generation of a strategic analysis report. Hydrogen-Agent serves as a robust prototype for future scientific agents, showcasing a viable path toward creating domain-expert AI that can accelerate discovery in critical scientific fields.
As the primary entry point to modern digital services, Web applications are now subjected to the fastest-evolving threat landscape on the Internet. Consequently, ML-based Web Application Firewall (WAF) exhibits degraded accuracy when exposed to novel attack patterns, while regex-driven solution remains bottlenecked by manual rule crafting, impeding agile response to emergent threats. Large Language Models (LLMs) bring to bear capabilities such as real-time Internet-scale intelligence gathering, symbolic code reasoning, and targeted analytic generation. We introduce STAR-Shield, a LLM-powered adaptive rule-evolution framework engineered for Web Application Firewall-as-a-service (FWaaS). Through a multi-agent choreography, STAR-Shield automates the full cycle: harvesting and analyzing new threats and vulnerabilities, reconstructing attack payloads, synthesizing and refining regular-expression rules, and enforcing runtime interception. Evaluation within a controlled, real-world environment shows that STAR-Shield can empower a cloud-hosted WAF to achieve over 98% interception accuracy against One-day attacks with a false-positive rate below 0.4%.
Multi-agent systems (MAS) have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving; however, current MAS frameworks suffer from poor flexibility and scalability with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process centered on our Collaborative Reward Model that provides fine-grained reward signals to optimize MAS cooperation. We also introduce an automated data synthesis framework for generating MAS benchmarks without any human annotations. Experimental results show that ReSo matches or outperforms existing methods, achieving 33.7 percent accuracy on Math-MAS and 32.3 percent accuracy on SciBench-MAS, where other approaches completely fail.
Social difficulties have become an increasingly serious issue among older adults. For older adults, regular self-disclosure is essential for maintaining mental health and building close relationships. Leveraging conversational agents to encourage self-disclosure in older adults has shown increasing potential. Understanding how LLM-based agents can influence and stimulate self-disclosure across different topics is crucial for designing future agents tailored to older users. This study introduces Disclosure-Agent, an LLM-based conversational agent, and examines its impact on self-disclosure in older adults through a user study involving 20 participants, 8 topics, and two interactive interfaces equipped with Disclosure-Agent. The findings provide valuable insights into how LLM-based agents can promote self-disclosure in older adults and offer design recommendations for future elderly-oriented conversational agents.
The evolution of cooperation has been extensively studied using abstract mathematical models and simulations. Recent advances in Large Language Models (LLMs) and the rise of LLM agents have demonstrated their ability to perform social reasoning, thus providing an opportunity to test the emergence of norms in more realistic agent-based simulations with human-like reasoning using natural language. In this research, we investigate whether the cooperation dynamics presented in Boyd and Richerson's model persist in a more realistic simulation of the Diner's Dilemma using LLM agents compared to the abstract mathematical nature in the work of Boyd and Richerson. Our findings indicate that agents follow the strategies defined in the Boyd and Richerson model, and explicit punishment mechanisms drive norm emergence, reinforcing cooperative behaviour even when the agent strategy configuration varies. Our results suggest that LLM-based Multi-Agent System simulations, in fact, can replicate the evolution of cooperation predicted by the traditional mathematical models. Moreover, our simulations extend beyond the mathematical models by integrating natural language-driven reasoning and a pairwise imitation method for strategy adoption, making them a more realistic testbed for cooperative behaviour in MASs.
Hardware complexity continues to strain verification resources, motivating the adoption of machine learning (ML) methods to improve debug efficiency. However, ML-assisted debugging critically depends on diverse and scalable bug datasets, which existing manual or automated bug insertion methods fail to reliably produce. We introduce BugGen, a first of its kind, fully autonomous, multi-agent pipeline leveraging Large Language Models (LLMs) to systematically generate, insert, and validate realistic functional bugs in RTL. BugGen partitions modules, selects mutation targets via a closed-loop agentic architecture, and employs iterative refinement and rollback mechanisms to ensure syntactic correctness and functional detectability. Evaluated across five OpenTitan IP blocks, BugGen produced 500 unique bugs with 94% functional accuracy and achieved a throughput of 17.7 validated bugs per hour—over five times faster than typical manual expert insertion. Additionally, BugGen identified 104 previously undetected bugs in OpenTitan regressions, highlighting its utility in exposing verification coverage gaps. Compared against Certitude, BugGen demonstrated over twice the syntactic accuracy, deeper exposure of testbench blind spots, and more functionally meaningful and complex bug scenarios. Furthermore, when these BugGen-Generated datasets were employed to train ML-based failure triage models, we achieved high classification accuracy (88.1%–93.2%) across different IP blocks, confirming the practical utility and realism of generated bugs. BugGen thus provides a scalable solution for generating high-quality bug datasets, significantly enhancing verification efficiency and ML-assisted debugging.
No abstract available
To address insufficient contextualization, weak generalization, and poor scenario adaptation in tourism meteorological services, we propose SmartWeatherAgent—a unified three-stage architecture integrating intent recognition, hazard prediction, and reasoning-enhanced generation. The system fuses rule-based methods with large language models to parse queries at multiple granularities and employs a LightGBM model enriched with highland-specific features (e.g., wind speed abruptness rate), achieving an F1-Macro score of 0.605 with 1.60 ms latency on high-wind, precipitation, and low-temperature events. A 12-round micro-step prompt self-optimization loop boosts the composite warning quality score $S_{\text {final }}$ from 4.2(B01) to 8.9(B12,+112%). Key improvements include a sharp rise in B 08 from data source citation $(6.5 \rightarrow 8.5)$, sustained high performance in B10 via physical mechanism explanation, and a peak scientific rigor score of 9.2 in B12 through explicit uncertainty statements. The system autonomously generates structured warnings that integrate causal mechanisms, spatiotemporal evolution, quantitative evidence, regulatory references, and confidence statements—enhancing professional depth, logical rigor, and scientific soundness, and advancing meteorological services toward proactive perception, explainable decision-making, and intelligent agency.
Social media platforms such as Twitter, Reddit, and Sina Weibo playa crucial role in global communication but often encounter strict regulations in geopolitically sensitive regions. This situation has prompted users to ingeniously modify their way of communicating, frequently resorting to coded language in these regulated social media environments. This shift in communication is not merely a strategy to counteract regulation, but a vivid manifestation of language evolution, demonstrating how language naturally evolves under societal and technological pressures. Studying the evolution of language in regulated social media contexts is of significant importance for ensuring freedom of speech, optimizing content moderation, and advancing linguistic research. This paper proposes a multi-agent simulation frame-work using Large Language Models (LLMs) to explore the evolution of user language in regulated social media environments. The framework employs LLM-driven agents: supervisory agent who enforce dialogue supervision and participant agents who evolve their language strategies while engaging in conversation, simulating the evolution of communication styles under strict regulations aimed at evading social media regulation. The study evaluates the framework's effectiveness through a range of scenarios from abstract scenarios to real-world situations. Key findings indicate that LLMs are capable of simulating nuanced language dynamics and interactions in constrained settings, showing improvement in both evading supervision and information accuracy as evolution progresses. Furthermore, it was found that LLM agents adopt different strategies for different scenarios. The reproduction kit can be accessed at https://github.com/BlueLinkXlGA-MAS.
In the domain of Human-Computer Interaction, focus groups represent a widely utilised yet resource-intensive methodology, often demanding the expertise of skilled moderators and meticulous preparatory efforts. This study introduces the ``Focus Agent,'' a Large Language Model (LLM) powered framework that simulates both the focus group (for data collection) and acts as a moderator in a focus group setting with human participants. To assess the data quality derived from the Focus Agent, we ran five focus group sessions with a total of 23 human participants as well as deploying the Focus Agent to simulate these discussions with AI participants. Quantitative analysis indicates that Focus Agent can generate opinions similar to those of human participants. Furthermore, the research exposes some improvements associated with LLMs acting as moderators in focus group discussions that include human participants.
Integrating large language models (LLMs) into personal assistants, like Xiao Ai and Blue Heart V, effectively enhances their ability to interact with humans, solve complex tasks, and manage IoT devices. Such assistants are also termed LLM-driven agents. Upon receiving user requests, the LLM-driven agent generates plans using an LLM, executes these plans through various tools, and then returns the response to the user. During this process, the latency for generating a plan with an LLM can reach tens of seconds, significantly degrading user experience. Real-world dataset analysis shows that about 30% of the requests received by LLM-driven agents are identical or similar, which allows the reuse of previously generated plans to reduce latency. However, it is difficult to accurately define the similarity between the request texts received by the LLM-driven agent through directly evaluating the original request texts. Moreover, the diverse expressions of natural language and the unstructured format of plan texts make implementing plan reuse challenging. To address these issues, we present and implement a plan reuse mechanism for LLM-driven agents called AgentReuse. AgentReuse leverages the similarities and differences among requests' semantics and uses intent classification to evaluate the similarities between requests and enable the reuse of plans. Experimental results based on a real-world dataset demonstrate that AgentReuse achieves a 93% effective plan reuse rate, an F1 score of 0.9718, and an accuracy of 0.9459 in evaluating request similarities, reducing latency by 93.12% compared with baselines without using the reuse mechanism.
In healthcare intelligence, the ability to fuse heterogeneous, multi-intent information from diverse clinical sources is fundamental to building reliable decision-making systems. Large Language Model (LLM)-driven information interaction systems currently showing potential promise in the healthcare domain. Nevertheless, they often suffer from information redundancy and coupling when dealing with complex medical intents, leading to severe hallucinations and performance bottlenecks. To this end, we propose MedAide, an LLM-based medical multi-agent collaboration framework designed to enable intent-aware information fusion and coordinated reasoning across specialized healthcare domains. Specifically, we introduce a regularization-guided module that combines syntactic constraints with retrieval augmented generation to decompose complex queries into structured representations, facilitating fine-grained clinical information fusion and intent resolution. Additionally, a dynamic intent prototype matching module is proposed to utilize dynamic prototype representation with a semantic similarity matching mechanism to achieve adaptive recognition and updating of the agent's intent in multi-round healthcare dialogues. Ultimately, we design a rotation agent collaboration mechanism that introduces dynamic role rotation and decision-level information fusion across specialized medical agents. Extensive experiments are conducted on four medical benchmarks with composite intents. Experimental results from automated metrics and expert doctor evaluations show that MedAide outperforms current LLMs and improves their medical proficiency and strategic reasoning.
Large Language Models (LLMs) have shown promise in automating code generation and software engineering tasks, yet they often struggle with complex, multi-file projects due to context limitations and knowledge gaps. We propose a novel context engineering workflow that combines multiple AI components: an Intent Translator (GPT-5) for clarifying user requirements, an Elicit-powered semantic literature retrieval for injecting domain knowledge, NotebookLM-based document synthesis for contextual understanding, and a Claude Code multi-agent system for code generation and validation. Our integrated approach leverages intent clarification, retrieval-augmented generation, and specialized sub-agents orchestrated via Claude's agent framework. We demonstrate that this method significantly improves the accuracy and reliability of code assistants in real-world repositories, yielding higher single-shot success rates and better adherence to project context than baseline single-agent approaches. Qualitative results on a large Next.js codebase show the multi-agent system effectively plans, edits, and tests complex features with minimal human intervention. We compare our system with recent frameworks like CodePlan, MASAI, and HyperAgent, highlighting how targeted context injection and agent role decomposition lead to state-of-the-art performance. Finally, we discuss the implications for deploying LLM-based coding assistants in production, along with lessons learned on context management and future research directions.
Human-level driving is an ultimate goal of autonomous driving. Conventional approaches formulate autonomous driving as a perception-prediction-planning framework, yet their systems do not capitalize on the inherent reasoning ability and experiential knowledge of humans. In this paper, we propose a fundamental paradigm shift from current pipelines, exploiting Large Language Models (LLMs) as a cognitive agent to integrate human-like intelligence into autonomous driving systems. Our approach, termed Agent-Driver, transforms the traditional autonomous driving pipeline by introducing a versatile tool library accessible via function calls, a cognitive memory of common sense and experiential knowledge for decision-making, and a reasoning engine capable of chain-of-thought reasoning, task planning, motion planning, and self-reflection. Powered by LLMs, our Agent-Driver is endowed with intuitive common sense and robust reasoning capabilities, thus enabling a more nuanced, human-like approach to autonomous driving. We evaluate our approach on the large-scale nuScenes benchmark, and extensive experiments substantiate that our Agent-Driver significantly outperforms the state-of-the-art driving methods by a large margin. Our approach also demonstrates superior interpretability and few-shot learning ability to these methods.
Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.
Physics-informed neural networks (PINNs) provide a powerful approach for solving partial differential equations (PDEs), but constructing a usable PINN remains labor-intensive and error-prone. Scientists must interpret problems as PDE formulations, design architectures and loss functions, and implement stable training pipelines. Existing large language model (LLM) based approaches address isolated steps such as code generation or architecture suggestion, but typically assume a formal PDE is already specified and therefore lack an end-to-end perspective. We present Lang-PINN, an LLM-driven multi-agent system that builds trainable PINNs directly from natural language task descriptions. Lang-PINN coordinates four complementary agents: a PDE Agent that parses task descriptions into symbolic PDEs, a PINN Agent that selects architectures, a Code Agent that generates modular implementations, and a Feedback Agent that executes and diagnoses errors for iterative refinement. This design transforms informal task statements into executable and verifiable PINN code. Experiments show that Lang-PINN achieves substantially lower errors and greater robustness than competitive baselines: mean squared error (MSE) is reduced by up to 3--5 orders of magnitude, end-to-end execution success improves by more than 50\%, and reduces time overhead by up to 74\%.
As large language model (LLM) based agents interact autonomously with one another, a new class of failures emerges that cannot be predicted from single agent performance: behavioral drifts in agent-agent conversations (AxA). Unlike human-agent interactions, where humans ground and steer conversations, AxA lacks such stabilizing signals, making these failures unique. We investigate one such failure, echoing, where agents abandon their assigned roles and instead mirror their conversational partners, undermining their intended objectives. Through experiments across $66$ AxA configurations, $4$ domains (3 transactional, 1 advisory), and $2500+$ conversations (over $250000$ LLM inferences), we show that echoing occurs across major LLM providers, with echoing rates as high as $70\%$ depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning models with substantial rates ($32.8\%$) that are not reduced by reasoning efforts. We analyze prompt, conversation dynamics, showing that echoing arises as interaction grows longer ($7+$ agent turns) and is not merely an artifact of sub-optimal experiment design. Finally, we introduce a protocol-level mitigation where targeted use of structured response reduces echoing to $9\%$.
In today's age, it is becoming increasingly difficult to decipher truth from lies. Every day, politicians, media outlets, and public figures make conflicting claims -- often about topics that can, in principle, be verified against structured data. For instance, statements about crime rates, economic growth or healthcare can all be verified against official public records and structured datasets. Building a system that can automatically do that would have sounded like science fiction just a few years ago. Yet, with the extraordinary progress in LLMs and agentic AI, this is now within reach. Still, there remains a striking gap between what is technically possible and what is being demonstrated by recent work. Most existing verification systems operate only on small, single-table databases -- typically a few hundred rows -- that conveniently fit within an LLM's context window. In this paper we report our progress on Thucy, the first cross-database, cross-table multi-agent claim verification system that also provides concrete evidence for each verification verdict. Thucy remains completely agnostic to the underlying data sources before deployment and must therefore autonomously discover, inspect, and reason over all available relational databases to verify claims. Importantly, Thucy also reports the exact SQL queries that support its verdict (whether the claim is accurate or not) offering full transparency to expert users familiar with SQL. When evaluated on the TabFact dataset -- the standard benchmark for fact verification over structured data -- Thucy surpasses the previous state of the art by 5.6 percentage points in accuracy (94.3% vs. 88.7%).
The integration of tool use into large language models (LLMs) enables agentic systems with real-world impact. In the meantime, unlike standalone LLMs, compromised agents can execute malicious workflows with more consequential impact, signified by their tool-use capability. We propose AgentGuard, a framework to autonomously discover and validate unsafe tool-use workflows, followed by generating safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee at deployment. AgentGuard leverages the LLM orchestrator's innate capabilities - knowledge of tool functionalities, scalable and realistic workflow generation, and tool execution privileges - to act as its own safety evaluator. The framework operates through four phases: identifying unsafe workflows, validating them in real-world execution, generating safety constraints, and validating constraint efficacy. The output, an evaluation report with unsafe workflows, test cases, and validated constraints, enables multiple security applications. We empirically demonstrate AgentGuard's feasibility with experiments. With this exploratory work, we hope to inspire the establishment of standardized testing and hardening procedures for LLM agents to enhance their trustworthiness in real-world applications.
Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. However, LLMs are optimized for language generation instead of tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models, making fine-tuning data scarce and acquiring it both difficult and costly. Discarding failed trajectories also leads to significant wastage of data and resources and limits the possible optimization paths during fine-tuning. In this paper, we argue that unsuccessful trajectories offer valuable insights, and LLMs can learn from these trajectories through appropriate quality control and fine-tuning strategies. By simply adding a prefix or suffix that tells the model whether to generate a successful trajectory during training, we improve model performance by a large margin on mathematical reasoning, multi-hop question answering, and strategic question answering tasks. We further analyze the inference results and find that our method provides a better trade-off between valuable information and errors in unsuccessful trajectories. To our knowledge, we are the first to demonstrate the value of negative trajectories and their application in agent-tunning scenarios. Our findings offer guidance for developing better agent-tuning methods and low-resource data usage techniques.
Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open-source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge-driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. This agentic framework collectively generates and refines domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.
Artificial intelligence is reshaping scientific exploration, but most methods automate procedural tasks without engaging in scientific reasoning, limiting autonomy in discovery. We introduce Materials Agents for Simulation and Theory in Electronic-structure Reasoning (MASTER), an active learning framework where large language models autonomously design, execute, and interpret atomistic simulations. In MASTER, a multimodal system translates natural language into density functional theory workflows, while higher-level reasoning agents guide discovery through a hierarchy of strategies, including a single agent baseline and three multi-agent approaches: peer review, triage-ranking, and triage-forms. Across two chemical applications, CO adsorption on Cu-surface transition metal (M) adatoms and on M-N-C catalysts, reasoning-driven exploration reduces required atomistic simulations by up to 90% relative to trial-and-error selection. Reasoning trajectories reveal chemically grounded decisions that cannot be explained by stochastic sampling or semantic bias. Altogether, multi-agent collaboration accelerates materials discovery and marks a new paradigm for autonomous scientific exploration.
This tutorial explores the advancements and challenges in the development of Large Language Models (LLMs) such as ChatGPT and Gemini. It addresses inherent limitations like temporal knowledge cutoffs, mathematical inaccuracies, and the generation of incorrect information, proposing solutions like Retrieval Augmented Generation (RAG), Program-Aided Language Models (PAL), and frameworks such as ReAct and LangChain. The integration of these techniques enhances LLM performance and reliability, especially in multi-step reasoning and complex task execution. The paper also covers fine-tuning strategies, including instruction fine-tuning, parameter-efficient methods like LoRA, and Reinforcement Learning from Human Feedback (RLHF) as well as Reinforced Self-Training (ReST). Additionally, it provides a comprehensive survey of transformer architectures and training techniques for LLMs. The source code can be accessed by contacting the author via email for a request.
Reasoning is a cognitive process of using evidence to reach a sound conclusion. The reasoning capability is essential for large language models (LLMs) to serve as the brain of the artificial general intelligence agent. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. However, we find that the fine-tuned LLMs suffer from an \textit{Assessment Misalignment} problem, i.e., they frequently assign higher scores to subpar COTs, leading to potential limitations in their reasoning abilities. To address this problem, we introduce an \textit{Alignment Fine-Tuning (AFT)} paradigm, which involves three steps: 1) fine-tuning LLMs with COT training data; 2) generating multiple COT responses for each question, and categorizing them into positive and negative ones based on whether they achieve the correct answer; 3) calibrating the scores of positive and negative responses given by LLMs with a novel constraint alignment loss. Specifically, the constraint alignment loss has two objectives: a) Alignment, which guarantees that positive scores surpass negative scores to encourage answers with high-quality COTs; b) Constraint, which keeps the negative scores confined to a reasonable range to prevent the model degradation. Beyond just the binary positive and negative feedback, the constraint alignment loss can be seamlessly adapted to the ranking situations when ranking feedback is accessible. Furthermore, we also delve deeply into recent ranking-based alignment methods, such as DPO, RRHF, and PRO, and discover that the constraint, which has been overlooked by these approaches, is also crucial for their performance. Extensive experiments on four reasoning benchmarks with both binary and ranking feedback demonstrate the effectiveness of AFT.
Large Language Models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a new LLM-based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous approaches, Retrospex does not directly integrate experiences into the LLM's context. Instead, it combines the LLM's action likelihood with action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline ''retrospection'' process. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience-based values for tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments, demonstrating its advantages over strong, contemporary baselines.
Large language model (LLM) services have recently begun offering a plugin ecosystem to interact with third-party API services. This innovation enhances the capabilities of LLMs, but it also introduces risks, as these plugins developed by various third parties cannot be easily trusted. This paper proposes a new attacking framework to examine security and safety vulnerabilities within LLM platforms that incorporate third-party services. Applying our framework specifically to widely used LLMs, we identify real-world malicious attacks across various domains on third-party APIs that can imperceptibly modify LLM outputs. The paper discusses the unique challenges posed by third-party API integration and offers strategic possibilities to improve the security and safety of LLM ecosystems moving forward. Our code is released at https://github.com/vk0812/Third-Party-Attacks-on-LLMs.
Multi-agent LLM frameworks are widely used to accelerate the development of agent systems powered by large language models (LLMs). These frameworks impose distinct architectural structures that govern how agents interact, store information, and coordinate tasks. However, their impact on system performance remains poorly understood. This gap is critical, as architectural choices alone can induce order-of-magnitude differences in latency and throughput, as well as substantial variation in accuracy and scalability. Addressing this challenge requires (i) jointly evaluating multiple capabilities, such as orchestration overhead, memory behavior, planning, specialization, and coordination, and (ii) conducting these evaluations under controlled, framework-level conditions to isolate architectural effects. Existing benchmarks focus on individual capabilities and lack standardized framework-level evaluation. We address these limitations by (i) introducing an architectural taxonomy for systematically comparing multi-agent LLM frameworks along fundamental dimensions, and (ii) developing MAFBench, a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline. Using MAFBench, we conduct a controlled empirical study across several widely used frameworks. Our results show that framework-level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%. Finally, we translate our findings into concrete architectural design principles and framework selection guidance, and outline promising future research directions.
In transportation system demand modeling and simulation, agent-based models and microsimulations are current state-of-the-art approaches. However, existing agent-based models still have some limitations on behavioral realism and resource demand that limit their applicability. In this study, leveraging the emerging technology of large language models (LLMs) and LLM-based agents, we propose a general LLM-agent-based modeling framework for transportation systems. We argue that LLM agents not only possess the essential capabilities to function as agents but also offer promising solutions to overcome some limitations of existing agent-based models. Our conceptual framework design closely replicates the decision-making and interaction processes and traits of human travelers within transportation networks, and we demonstrate that the proposed systems can meet critical behavioral criteria for decision-making and learning behaviors using related studies and a demonstrative example of LLM agents' learning and adjustment in the bottleneck setting. Although further refinement of the LLM-agent-based modeling framework is necessary, we believe that this approach has the potential to improve transportation system modeling and simulation.
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.
As Large Language Models (LLMs) transition from static tools to autonomous agents, traditional evaluation benchmarks that measure performance on downstream tasks are becoming insufficient. These methods fail to capture the emergent social and cognitive dynamics that arise when agents communicate, persuade, and collaborate in interactive environments. To address this gap, we introduce a novel evaluation framework that uses multi-agent debate as a controlled "social laboratory" to discover and quantify these behaviors. In our framework, LLM-based agents, instantiated with distinct personas and incentives, deliberate on a wide range of challenging topics under the supervision of an LLM moderator. Our analysis, enabled by a new suite of psychometric and semantic metrics, reveals several key findings. Across hundreds of debates, we uncover a powerful and robust emergent tendency for agents to seek consensus, consistently reaching high semantic agreement (μ > 0.88) even without explicit instruction and across sensitive topics. We show that assigned personas induce stable, measurable psychometric profiles, particularly in cognitive effort, and that the moderators persona can significantly alter debate outcomes by structuring the environment, a key finding for external AI alignment. This work provides a blueprint for a new class of dynamic, psychometrically grounded evaluation protocols designed for the agentic setting, offering a crucial methodology for understanding and shaping the social behaviors of the next generation of AI agents. We have released the code and results at https://github.com/znreza/multi-agent-LLM-eval-for-debate.
Evaluating large language model (LLM)-based multi-agent systems remains a critical challenge, as these systems must exhibit reliable coordination, transparent decision-making, and verifiable performance across evolving tasks. Existing evaluation approaches often limit themselves to single-response scoring or narrow benchmarks, which lack stability, extensibility, and automation when deployed in enterprise settings at multi-agent scale. We present AEMA (Adaptive Evaluation Multi-Agent), a process-aware and auditable framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under human oversight. Compared to a single LLM-as-a-Judge, AEMA achieves greater stability, human alignment, and traceable records that support accountable automation. Our results on enterprise-style agent workflows simulated using realistic business scenarios demonstrate that AEMA provides a transparent and reproducible pathway toward responsible evaluation of LLM-based multi-agent systems. Keywords Agentic AI, Multi-Agent Systems, Trustworthy AI, Verifiable Evaluation, Human Oversight
This paper introduces Project Synapse, a novel agentic framework designed for the autonomous resolution of last-mile delivery disruptions. Synapse employs a hierarchical multi-agent architecture in which a central Resolution Supervisor agent performs strategic task decomposition and delegates subtasks to specialized worker agents responsible for tactical execution. The system is orchestrated using LangGraph to manage complex and cyclical workflows. To validate the framework, a benchmark dataset of 30 complex disruption scenarios was curated from a qualitative analysis of over 6,000 real-world user reviews. System performance is evaluated using an LLM-as-a-Judge protocol with explicit bias mitigation.
The integration of Large Language Models (LLMs) and knowledge graphs (KGs) has achieved remarkable success in various natural language processing tasks. However, existing methodologies that integrate LLMs and KGs often navigate the task-solving process solely based on the LLM's analysis of the question, overlooking the rich cognitive potential inherent in the vast knowledge encapsulated in KGs. To address this, we introduce Observation-Driven Agent (ODA), a novel AI agent framework tailored for tasks involving KGs. ODA incorporates KG reasoning abilities via global observation, which enhances reasoning capabilities through a cyclical paradigm of observation, action, and reflection. Confronting the exponential explosion of knowledge during observation, we innovatively design a recursive observation mechanism. Subsequently, we integrate the observed knowledge into the action and reflection modules. Through extensive experiments, ODA demonstrates state-of-the-art performance on several datasets, notably achieving accuracy improvements of 12.87% and 8.9%.
The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk. We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context. Evaluated across 30 real-world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs.
Self-evolving LLM agents update their internal state across sessions, often by writing and reusing long-term memory. This design improves performance on long-horizon tasks but creates a security risk: untrusted external content observed during a benign session can be stored as memory and later treated as instruction. We study this risk and formalize a persistent attack we call a Zombie Agent, where an attacker covertly implants a payload that survives across sessions, effectively turning the agent into a puppet of the attacker. We present a black-box attack framework that uses only indirect exposure through attacker-controlled web content. The attack has two phases. During infection, the agent reads a poisoned source while completing a benign task and writes the payload into long-term memory through its normal update process. During trigger, the payload is retrieved or carried forward and causes unauthorized tool behavior. We design mechanism-specific persistence strategies for common memory implementations, including sliding-window and retrieval-augmented memory, to resist truncation and relevance filtering. We evaluate the attack on representative agent setups and tasks, measuring both persistence over time and the ability to induce unauthorized actions while preserving benign task quality. Our results show that memory evolution can convert one-time indirect injection into persistent compromise, which suggests that defenses focused only on per-session prompt filtering are not sufficient for self-evolving agents.
Driven by the rapid development of Large Language Models (LLMs), LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct. (2) Furthermore, the former category can be divided into two subcategories based on trigger locations, in which the backdoor trigger can either be hidden in the user query or appear in an intermediate observation returned by the external environment. We implement the above variations of agent backdoor attacks on two typical agent tasks including web shopping and tool utilization. Extensive experiments show that LLM-based agents suffer severely from backdoor attacks and such backdoor vulnerability cannot be easily mitigated by current textual backdoor defense algorithms. This indicates an urgent need for further research on the development of targeted defenses against backdoor attacks on LLM-based agents. Warning: This paper may contain biased content.
This chapter argues that the reliability of agentic and generative AI is chiefly an architectural property. We define agentic systems as goal-directed, tool-using decision makers operating in closed loops, and show how reliability emerges from principled componentisation (goal manager, planner, tool-router, executor, memory, verifiers, safety monitor, telemetry), disciplined interfaces (schema-constrained, validated, least-privilege tool calls), and explicit control and assurance loops. Building on classical foundations, we propose a practical taxonomy-tool-using agents, memory-augmented agents, planning and self-improvement agents, multi-agent systems, and embodied or web agents - and analyse how each pattern reshapes the reliability envelope and failure modes. We distil design guidance on typed schemas, idempotency, permissioning, transactional semantics, memory provenance and hygiene, runtime governance (budgets, termination conditions), and simulate-before-actuate safeguards.
The rapid proliferation of artificial intelligence (AI) technologies has led to a dynamic regulatory landscape, where legislative frameworks strive to keep pace with technical advancements. As AI paradigms shift towards greater autonomy, specifically in the form of agentic AI, it becomes increasingly challenging to precisely articulate regulatory stipulations. This challenge is even more acute in the domains of security and privacy, where the capabilities of autonomous agents often blur traditional legal and technical boundaries. This paper reviews the evolving European Union (EU) AI regulatory provisions via analyzing 24 relevant documents published between 2024 and 2025. From this review, we provide a clarification of critical definitions. We deconstruct the regulatory interpretations of security, privacy, and agentic AI, distinguishing them from closely related concepts to resolve ambiguity. We synthesize the reviewed documents to articulate the current state of regulatory provisions targeting different types of AI, particularly those related to security and privacy aspects. We analyze and reflect on the existing provisions in the regulatory dimension to better align security and privacy obligations with AI and agentic behaviors. These insights serve to inform policymakers, developers, and researchers on the compliance and AI governance in the society with increasing algorithmic agencies.
Autonomous AI agents powered by large language models are being deployed in production with capabilities including shell execution, file system access, database queries, and multi-party communication. Recent red teaming research demonstrates that these agents exhibit critical vulnerabilities in realistic settings: unauthorized compliance with non-owner instructions, sensitive information disclosure, identity spoofing, cross-agent propagation of unsafe practices, and indirect prompt injection through external resources [7]. In healthcare environments processing Protected Health Information, every such vulnerability becomes a potential HIPAA violation. This paper presents a security architecture deployed for nine autonomous AI agents in production at a healthcare technology company. We develop a six-domain threat model for agentic AI in healthcare covering credential exposure, execution capability abuse, network egress exfiltration, prompt integrity failures, database access risks, and fleet configuration drift. We implement four-layer defense in depth: (1) kernel level workload isolation using gVisor on Kubernetes, (2) credential proxy sidecars preventing agent containers from accessing raw secrets, (3) network egress policies restricting each agent to allowlisted destinations, and (4) a prompt integrity framework with structured metadata envelopes and untrusted content labeling. We report results from 90 days of deployment including four HIGH severity findings discovered and remediated by an automated security audit agent, progressive fleet hardening across three VM image generations, and defense coverage mapped to all eleven attack patterns from recent literature. All configurations, audit tooling, and the prompt integrity framework are released as open source.
This paper examines the evolution, architecture, and practical applications of AI agents from their early, rule-based incarnations to modern sophisticated systems that integrate large language models with dedicated modules for perception, planning, and tool use. Emphasizing both theoretical foundations and real-world deployments, the paper reviews key agent paradigms, discusses limitations of current evaluation benchmarks, and proposes a holistic evaluation framework that balances task effectiveness, efficiency, robustness, and safety. Applications across enterprise, personal assistance, and specialized domains are analyzed, with insights into future research directions for more resilient and adaptive AI agent systems.
Even as AI capabilities improve, most enterprises do not consider agents trustworthy enough to work on production data. In this paper, we argue that the path to trustworthy agentic workflows begins with solving the infrastructure problem first: traditional lakehouses are not suited for agent access patterns, but if we design one around transactions, governance follows. In particular, we draw an operational analogy to MVCC in databases and show why a direct transplant fails in a decoupled, multi-language setting. We then propose an agent-first design, Bauplan, that reimplements data and compute isolation in the lakehouse. We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan, which seamlessly couples agent reasoning with all the desired guarantees for correctness and trust.
Enterprise back office workflows require agentic systems that are auditable, policy-aligned, and operationally predictable, capabilities that generic multi-agent setups often fail to deliver. We present POLARIS (Policy-Aware LLM Agentic Reasoning for Integrated Systems), a governed orchestration framework that treats automation as typed plan synthesis and validated execution over LLM agents. A planner proposes structurally diverse, type checked directed acyclic graphs (DAGs), a rubric guided reasoning module selects a single compliant plan, and execution is guarded by validator gated checks, a bounded repair loop, and compiled policy guardrails that block or route side effects before they occur. Applied to document centric finance tasks, POLARIS produces decision grade artifacts and full execution traces while reducing human intervention. Empirically, POLARIS achieves a micro F1 of 0.81 on the SROIE dataset and, on a controlled synthetic suite, achieves 0.95 to 1.00 precision for anomaly routing with preserved audit trails. These evaluations constitute an initial benchmark for governed Agentic AI. POLARIS provides a methodological and benchmark reference for policy-aligned Agentic AI. Keywords Agentic AI, Enterprise Automation, Back-Office Tasks, Benchmarks, Governance, Typed Planning, Evaluation
Identity Security Posture Management (ISPM) is a core challenge for modern enterprises operating across cloud and SaaS environments. Answering basic ISPM visibility questions, such as understanding identity inventory and configuration hygiene, requires interpreting complex identity data, motivating growing interest in agentic AI systems. Despite this interest, there is currently no standardized way to evaluate how well such systems perform ISPM visibility tasks on real enterprise data. We introduce the Sola Visibility ISPM Benchmark, the first benchmark designed to evaluate agentic AI systems on foundational ISPM visibility tasks using a live, production-grade identity environment spanning AWS, Okta, and Google Workspace. The benchmark focuses on identity inventory and hygiene questions and is accompanied by the Sola AI Agent, a tool-using agent that translates natural-language queries into executable data exploration steps and produces verifiable, evidence-backed answers. Across 77 benchmark questions, the agent achieves strong overall performance, with an expert accuracy of 0.84 and a strict success rate of 0.77. Performance is highest on AWS hygiene tasks, where expert accuracy reaches 0.94, while results on Google Workspace and Okta hygiene tasks are more moderate, yet competitive. Overall, this work provides a practical and reproducible benchmark for evaluating agentic AI systems in identity security and establishes a foundation for future ISPM benchmarks covering more advanced identity analysis and governance tasks.
This chapter presents perspectives for challenges and future development in building reliable AI systems, particularly, agentic AI systems. Several open research problems related to mitigating the risks of cascading failures are discussed. The chapter also sheds lights on research challenges and opportunities in aspects including dynamic environments, inconsistent task execution, unpredictable emergent behaviors, as well as resource-intensive reliability mechanisms. In addition, several research directions along the line of testing and evaluating reliability of agentic AI systems are also discussed.
Agentic Artificial Intelligence (AI) represents a fundamental shift in the design of intelligent systems, characterized by interconnected components that collectively enable autonomous perception, reasoning, planning, action, and learning. Recent research on agentic AI has largely focused on technical foundations, including system architectures, reasoning and planning mechanisms, coordination strategies, and application-level performance across domains. However, the societal, ethical, economic, environmental, and governance implications of agentic AI remain weakly integrated into these technical treatments. This paper addresses this gap by presenting a socio-technical analysis of agentic AI that explicitly connects core technical components with societal context. We examine how architectural choices in perception, cognition, planning, execution, and memory introduce dependencies related to data governance, accountability, transparency, safety, and sustainability. To structure this analysis, we adopt the MAD-BAD-SAD construct as an analytical lens, capturing motivations, applications, and moral dilemmas (MAD); biases, accountability, and dangers (BAD); and societal impact, adoption, and design considerations (SAD). Using this lens, we analyze ethical considerations, implications, and challenges arising from contemporary agentic AI systems and assess their manifestation across emerging applications, including healthcare, education, industry, smart and sustainable cities, social services, communications and networking, and earth observation and satellite communications. The paper further identifies open challenges and suggests future research directions, framing agentic AI as an integrated socio-technical system whose behavior and impact are co-produced by algorithms, data, organizational practices, regulatory frameworks, and social norms.
How can humans remain in control of artificial intelligence (AI)-based systems designed to perform tasks autonomously? Such systems are increasingly ubiquitous, creating benefits - but also undesirable situations where moral responsibility for their actions cannot be properly attributed to any particular person or group. The concept of meaningful human control has been proposed to address responsibility gaps and mitigate them by establishing conditions that enable a proper attribution of responsibility for humans; however, clear requirements for researchers, designers, and engineers are yet inexistent, making the development of AI-based systems that remain under meaningful human control challenging. In this paper, we address the gap between philosophical theory and engineering practice by identifying, through an iterative process of abductive thinking, four actionable properties for AI-based systems under meaningful human control, which we discuss making use of two applications scenarios: automated vehicles and AI-based hiring. First, a system in which humans and AI algorithms interact should have an explicitly defined domain of morally loaded situations within which the system ought to operate. Second, humans and AI agents within the system should have appropriate and mutually compatible representations. Third, responsibility attributed to a human should be commensurate with that human's ability and authority to control the system. Fourth, there should be explicit links between the actions of the AI agents and actions of humans who are aware of their moral responsibility. We argue that these four properties will support practically-minded professionals to take concrete steps toward designing and engineering for AI systems that facilitate meaningful human control.
The rapid emergence of multi-agent AI systems (MAS), including LangChain, CrewAI, and AutoGen, has shaped how large language model (LLM) applications are developed and orchestrated. However, little is known about how these systems evolve and are maintained in practice. This paper presents the first large-scale empirical study of open-source MAS, analyzing over 42K unique commits and over 4.7K resolved issues across eight leading systems. Our analysis identifies three distinct development profiles: sustained, steady, and burst-driven. These profiles reflect substantial variation in ecosystem maturity. Perfective commits constitute 40.8% of all changes, suggesting that feature enhancement is prioritized over corrective maintenance (27.4%) and adaptive updates (24.3%). Data about issues shows that the most frequent concerns involve bugs (22%), infrastructure (14%), and agent coordination challenges (10%). Issue reporting also increased sharply across all frameworks starting in 2023. Median resolution times range from under one day to about two weeks, with distributions skewed toward fast responses but a minority of issues requiring extended attention. These results highlight both the momentum and the fragility of the current ecosystem, emphasizing the need for improved testing infrastructure, documentation quality, and maintenance practices to ensure long-term reliability and sustainability.
Large Language Models (LLMs) and other foundation models are increasingly used as the core of AI agents. In agentic workflows, these agents plan tasks, interact with humans and peers, and influence scientific outcomes across federated and heterogeneous environments. However, agents can hallucinate or reason incorrectly, propagating errors when one agent's output becomes another's input. Thus, assuring that agents' actions are transparent, traceable, reproducible, and reliable is critical to assess hallucination risks and mitigate their workflow impacts. While provenance techniques have long supported these principles, existing methods fail to capture and relate agent-centric metadata such as prompts, responses, and decisions with the broader workflow context and downstream outcomes. In this paper, we introduce PROV-AGENT, a provenance model that extends W3C PROV and leverages the Model Context Protocol (MCP) and data observability to integrate agent interactions into end-to-end workflow provenance. Our contributions include: (1) a provenance model tailored for agentic workflows, (2) a near real-time, open-source system for capturing agentic provenance, and (3) a cross-facility evaluation spanning edge, cloud, and HPC environments, demonstrating support for critical provenance queries and agent reliability analysis.
The electricity sector transition requires substantial increases in residential demand response capacity, yet Home Energy Management Systems (HEMS) adoption remains limited by user interaction barriers requiring translation of everyday preferences into technical parameters. While large language models have been applied to energy systems as code generators and parameter extractors, no existing implementation deploys LLMs as autonomous coordinators managing the complete workflow from natural language input to multi-appliance scheduling. This paper presents an agentic AI HEMS where LLMs autonomously coordinate multi-appliance scheduling from natural language requests to device control, achieving optimal scheduling without example demonstrations. A hierarchical architecture combining one orchestrator with three specialist agents uses the ReAct pattern for iterative reasoning, enabling dynamic coordination without hardcoded workflows while integrating Google Calendar for context-aware deadline extraction. Evaluation across three open-source models using real Austrian day-ahead electricity prices reveals substantial capability differences. Llama-3.3-70B successfully coordinates all appliances across all scenarios to match cost-optimal benchmarks computed via mixed-integer linear programming, while other models achieve perfect single-appliance performance but struggle to coordinate all appliances simultaneously. Progressive prompt engineering experiments demonstrate that analytical query handling without explicit guidance remains unreliable despite models' general reasoning capabilities. We open-source the complete system including orchestration logic, agent prompts, tools, and web interfaces to enable reproducibility, extension, and future research.
The integration of Artificial Intelligence (AI) into clinical settings presents a software engineering challenge, demanding a shift from isolated models to robust, governable, and reliable systems. However, brittle, prototype-derived architectures often plague industrial applications and a lack of systemic oversight, creating a ``responsibility vacuum'' where safety and accountability are compromised. This paper presents an industry case study of the ``Maria'' platform, a production-grade AI system in primary healthcare that addresses this gap. Our central hypothesis is that trustworthy clinical AI is achieved through the holistic integration of four foundational engineering pillars. We present a synergistic architecture that combines Clean Architecture for maintainability with an Event-driven architecture for resilience and auditability. We introduce the Agent as the primary unit of modularity, each possessing its own autonomous MLOps lifecycle. Finally, we show how a Human-in-the-Loop governance model is technically integrated not merely as a safety check, but as a critical, event-driven data source for continuous improvement. We present the platform as a reference architecture, offering practical lessons for engineers building maintainable, scalable, and accountable AI-enabled systems in high-stakes domains.
Agentic AI applications increasingly rely on multiple agents with distinct roles, specialized tools, and access to memory layers to solve complex tasks -- closely resembling service-oriented architectures. Yet, in the rapid evolving landscape of programming frameworks and new protocols, deploying and testing AI agents as distributed systems remains a daunting and labor-intensive task. We present DMAS-Forge, a framework designed to close this gap. DMAS-Forge decouples application logic from specific deployment choices, and aims at transparently generating the necessary glue code and configurations to spawn distributed multi-agent applications across diverse deployment scenarios with minimal manual effort. We present our vision, design principles, and a prototype of DMAS-Forge. Finally, we discuss the opportunities and future work for our approach.
The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.
Utilizing large language models (LLMs) for tool planning has emerged as a promising avenue for developing general AI systems, where LLMs automatically schedule external tools (e.g., vision models) to tackle complex tasks based on task descriptions. To push this paradigm toward practical applications, it is crucial for LLMs to consider tool execution costs (e.g., execution time) for tool planning. Unfortunately, prior studies overlook the tool execution costs, leading to the generation of expensive plans whose costs outweigh their benefits in terms of task performance. To fill this gap, we propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework, which for the first time provides a coherent design to empower LLMs for cost-aware tool planning. Specifically, To facilitate efficient concurrent tool execution and cost reduction, we design a tool planning language to enhance the LLM for creating multi-branch non-sequential plans. Moreover, we propose a cost-aware offline reinforcement learning algorithm to fine-tune the LLM to optimize the performance-cost trade-off in tool planning. In the lack of public cost-related datasets, we further present OpenCATP, the first dataset for cost-aware planning, which comprises 11,100 evaluation samples from diverse tasks. Extensive experiments show that CATP-LLM outperforms GPT-4 even when using Llama2-7B as its backbone, with the average improvement of 1.5%-93.9% in terms of plan quality. Codes and dataset are available at: https://github.com/duowuyms/OpenCATP-LLM.
Classical planners are powerful systems, but modeling tasks in input formats such as PDDL is tedious and error-prone. In contrast, planning with Large Language Models (LLMs) allows for almost any input text, but offers no guarantees on plan quality or even soundness. In an attempt to merge the best of these two approaches, some work has begun to use LLMs to automate parts of the PDDL creation process. However, these methods still require various degrees of expert input or domain-specific adaptations. We present NL2Plan, the first fully automatic system for generating complete PDDL tasks from minimal natural language descriptions. NL2Plan uses an LLM to incrementally extract the necessary information from the short text input before creating a complete PDDL description of both the domain and the problem which is finally solved by a classical planner. We evaluate NL2Plan on seven planning domains, five of which are novel and thus not in the LLM training data, and find that NL2Plan outperforms directly generating the files with an LLM+validator combination. As such, NL2Plan is a powerful tool for assistive PDDL modeling and a step towards solving natural language planning task with interpretability and guarantees.
Bimanual robotic manipulation provides significant versatility, but also presents an inherent challenge due to the complexity involved in the spatial and temporal coordination between two hands. Existing works predominantly focus on attaining human-level manipulation skills for robotic hands, yet little attention has been paid to task planning on long-horizon timescales. With their outstanding in-context learning and zero-shot generation abilities, Large Language Models (LLMs) have been applied and grounded in diverse robotic embodiments to facilitate task planning. However, LLMs still suffer from errors in long-horizon reasoning and from hallucinations in complex robotic tasks, lacking a guarantee of logical correctness when generating the plan. Previous works, such as LLM+P, extended LLMs with symbolic planners. However, none have been successfully applied to bimanual robots. New challenges inevitably arise in bimanual manipulation, necessitating not only effective task decomposition but also efficient task allocation. To address these challenges, this paper introduces LLM+MAP, a bimanual planning framework that integrates LLM reasoning and multi-agent planning, automating effective and efficient bimanual task planning. We conduct simulated experiments on various long-horizon manipulation tasks of differing complexity. Our method is built using GPT-4o as the backend, and we compare its performance against plans generated directly by LLMs, including GPT-4o, V3 and also recent strong reasoning models o1 and R1. By analyzing metrics such as planning time, success rate, group debits, and planning-step reduction rate, we demonstrate the superior performance of LLM+MAP, while also providing insights into robotic reasoning. Code is available at https://github.com/Kchu/LLM-MAP.
Recent advances in Large Language Models (LLMs) are fostering their integration into several reasoning-related fields, including Automated Planning (AP). However, their integration into Hierarchical Planning (HP), a subfield of AP that leverages hierarchical knowledge to enhance planning performance, remains largely unexplored. In this preliminary work, we propose a roadmap to address this gap and harness the potential of LLMs for HP. To this end, we present a taxonomy of integration methods, exploring how LLMs can be utilized within the HP life cycle. Additionally, we provide a benchmark with a standardized dataset for evaluating the performance of future LLM-based HP approaches, and present initial results for a state-of-the-art HP planner and LLM planner. As expected, the latter exhibits limited performance (3\% correct plans, and none with a correct hierarchical decomposition) but serves as a valuable baseline for future approaches.
Large Language Models (LLMs) have enabled agents to move beyond conversation toward end-to-end task execution and become more helpful. However, this helpfulness introduces new security risks stem less from direct interface abuse than from acting on user-provided content. Existing studies on agent security largely focus on model-internal vulnerabilities or adversarial access to agent interfaces, overlooking attacks that exploit users as unintended conduits. In this paper, we study user-mediated attacks, where benign users are tricked into relaying untrusted or attacker-controlled content to agents, and analyze how commercial LLM agents respond under such conditions. We conduct a systematic evaluation of 12 commercial agents in a sandboxed environment, covering 6 trip-planning agents and 6 web-use agents, and compare agent behavior across scenarios with no, soft, and hard user-requested safety checks. Our results show that agents are too helpful to be safe by default. Without explicit safety requests, trip-planning agents bypass safety constraints in over 92% of cases, converting unverified content into confident booking guidance. Web-use agents exhibit near-deterministic execution of risky actions, with 9 out of 17 supported tests reaching a 100% bypass rate. Even when users express soft or hard safety intent, constraint bypass remains substantial, reaching up to 54.7% and 7% for trip-planning agents, respectively. These findings reveal that the primary issue is not a lack of safety capability, but its prioritization. Agents invoke safety checks only conditionally when explicitly prompted, and otherwise default to goal-driven execution. Moreover, agents lack clear task boundaries and stopping rules, frequently over-executing workflows in ways that lead to unnecessary data disclosure and real-world harm.
Urban regeneration presents significant challenges within the context of urbanization, requiring adaptive approaches to tackle evolving needs. Leveraging advancements in large language models (LLMs), we propose Cyclical Urban Planning (CUP), a new paradigm that continuously generates, evaluates, and refines urban plans in a closed-loop. Specifically, our multi-agent LLM-based framework consists of three key components: (1) Planning, where LLM agents generate and refine urban plans based on contextual data; (2) Living, where agents simulate the behaviors and interactions of residents, modeling life in the urban environment; and (3) Judging, which involves evaluating plan effectiveness and providing iterative feedback for improvement. The cyclical process enables a dynamic and responsive planning approach. Experiments on the real-world dataset demonstrate the effectiveness of our framework as a continuous and adaptive planning process.
The tool-use Large Language Models (LLMs) that integrate with external Python interpreters have significantly enhanced mathematical reasoning capabilities for open-source LLMs, while tool-free methods chose another track: augmenting math reasoning data. However, a great method to integrate the above two research paths and combine their advantages remains to be explored. In this work, we firstly include new math questions via multi-perspective data augmenting methods and then synthesize code-nested solutions to them. The open LLMs (i.e., Llama-2) are finetuned on the augmented dataset to get the resulting models, MuMath-Code ($μ$-Math-Code). During the inference phase, our MuMath-Code generates code and interacts with the external python interpreter to get the execution results. Therefore, MuMath-Code leverages the advantages of both the external tool and data augmentation. To fully leverage the advantages of our augmented data, we propose a two-stage training strategy: In Stage-1, we finetune Llama-2 on pure CoT data to get an intermediate model, which then is trained on the code-nested data in Stage-2 to get the resulting MuMath-Code. Our MuMath-Code-7B achieves 83.8 on GSM8K and 52.4 on MATH, while MuMath-Code-70B model achieves new state-of-the-art performance among open methods -- achieving 90.7% on GSM8K and 55.1% on MATH. Extensive experiments validate the combination of tool use and data augmentation, as well as our two-stage training strategy. We release the proposed dataset along with the associated code for public use.
We present a framework for uncovering and exploiting dependencies among tools and documents to enhance exemplar artifact generation. Our method begins by constructing a tool knowledge graph from tool schemas,including descriptions, arguments, and output payloads, using a DeepResearch-inspired analysis. In parallel, we derive a complementary knowledge graph from internal documents and SOPs, which is then fused with the tool graph. To generate exemplar plans, we adopt a deep-sparse integration strategy that aligns structural tool dependencies with procedural knowledge. Experiments demonstrate that this unified framework effectively models tool interactions and improves plan generation, underscoring the benefits of linking tool graphs with domain knowledge graphs for tool-augmented reasoning and planning.
We explore the usage of large language models (LLM) in human-in-the-loop human-in-the-plant cyber-physical systems (CPS) to translate a high-level prompt into a personalized plan of actions, and subsequently convert that plan into a grounded inference of sequential decision-making automated by a real-world CPS controller to achieve a control goal. We show that it is relatively straightforward to contextualize an LLM so it can generate domain-specific plans. However, these plans may be infeasible for the physical system to execute or the plan may be unsafe for human users. To address this, we propose CPS-LLM, an LLM retrained using an instruction tuning framework, which ensures that generated plans not only align with the physical system dynamics of the CPS but are also safe for human users. The CPS-LLM consists of two innovative components: a) a liquid time constant neural network-based physical dynamics coefficient estimator that can derive coefficients of dynamical models with some unmeasured state variables; b) the model coefficients are then used to train an LLM with prompts embodied with traces from the dynamical system and the corresponding model coefficients. We show that when the CPS-LLM is integrated with a contextualized chatbot such as BARD it can generate feasible and safe plans to manage external events such as meals for automated insulin delivery systems used by Type 1 Diabetes subjects.
Large Language Model (LLM) agents are increasingly applied to complex, multi-step tasks that require interaction with diverse external tools across various domains. However, current LLM agent tool planning methods typically rely on greedy, reactive tool selection strategies that lack foresight and fail to account for inter-tool dependencies. In this paper, we present ToolTree, a novel Monte Carlo tree search-inspired planning paradigm for tool planning. ToolTree explores possible tool usage trajectories using a dual-stage LLM evaluation and bidirectional pruning mechanism that enables the agent to make informed, adaptive decisions over extended tool-use sequences while pruning less promising branches before and after the tool execution. Empirical evaluations across both open-set and closed-set tool planning tasks on 4 benchmarks demonstrate that ToolTree consistently improves performance while keeping the highest efficiency, achieving an average gain of around 10\% compared to the state-of-the-art planning paradigm.
Large Language Models (LLMs) have shown remarkable performance in various basic natural language tasks. For completing the complex task, we still need a plan for the task to guide LLMs to generate the specific solutions step by step. LLMs can directly generate task plans, but these plans may still contain factual errors or are incomplete. A high-quality task plan contains correct step-by-step solutions for solving all situations and behavioral instructions for avoiding mistakes. To obtain it, we propose the Learning to Plan method, which involves two phases: (1) In the first learning task plan phase, it iteratively updates the task plan with new step-by-step solutions and behavioral instructions, which are obtained by prompting LLMs to derive from training error feedback. (2) In the subsequent test phase, the LLM uses the learned task plan to guide the inference of LLM on the test set. We demonstrate the effectiveness of our method on the five different reasoning type tasks (8 datasets). Further, our analysis experiment shows that the task plan learned by one LLM can directly guide another LLM to improve its performance, which reveals a new transfer learning paradigm. We release the code at \url{https://github.com/Eureka6174/LearnNLPlan}
Large Language Models (LLM) with reasoning capabilities offer a promising path for improving candidate evaluation in planning frameworks, but their relative performance against traditional non-reasoning models remains largely underexplored. In this study, we benchmark a distilled 1.5B parameter reasoning model (DeepSeek-R1) against several state-of-the-art non-reasoning LLMs within a generator-discriminator LLM planning framework for the text-to-SQL task. For this, we introduce a novel method for extracting soft scores from the chain-of-thought (CoT) outputs from reasoning that enables fine-grained ranking of candidates. Our central hypothesis is that reasoning models are more effective discriminators than non-reasoning LLMs. Our results show that distilled DeepSeek-R1-1.5B achieves up to $87\%$ higher F1 and $3.7\%$ better discrimination accuracy than CodeLlama-7B, as well as $3.7\%$ higher execution accuracy than CodeLlama-13B, despite having significantly fewer parameters. Furthermore, we find that there is a limit to the logical capabilities of reasoning models, and only providing more context or allowing more compute budget for reasoning is not enough to improve their discrimination performance. Finally, we demonstrate that, unlike non-reasoning LLMs, reasoning models find generation more challenging than discrimination and may underperform as generators compared to smaller non-reasoning LLMs. Our work highlights the potential of reasoning models as discriminators in agentic frameworks, far outweighing their capabilities as generators, offering insights into their optimal role within LLM planning infrastructures.
Recently, Large Language Models (LLMs) have emerged as an alternative to training task-specific dialog agents, due to their broad reasoning capabilities and performance in zero-shot learning scenarios. However, many LLM-based dialog systems fall short in planning towards an overarching dialog goal and therefore cannot steer the conversation appropriately. Furthermore, these models struggle with hallucination, making them unsuitable for information access in sensitive domains, such as legal or medical domains, where correctness of information given to users is critical. The recently introduced task Conversational Tree Search (CTS) proposes the use of dialog graphs to avoid hallucination in sensitive domains, however, state-of-the-art agents are Reinforcement Learning (RL) based and require long training times, despite excelling at dialog strategy. This paper introduces a novel zero-shot method for controllable CTS agents, where LLMs guide the dialog planning through domain graphs by searching and pruning relevant graph nodes based on user interaction preferences. We show that these agents significantly outperform state-of-the-art CTS agents ($p<0.0001$; Barnard Exact test) in simulation. This generalizes to all available CTS domains. Finally, we perform user evaluation to test the agent's performance in the wild, showing that our policy significantly ($p<0.05$; Barnard Exact) improves task-success compared to the state-of-the-art RL-based CTS agent.
Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications
AI agents powered by large language models (LLMs) have shown strong capabilities in problem solving. Through combining many intelligent agents, multi-agent collaboration has emerged as a promising approach to tackle complex, multi-faceted problems that exceed the capabilities of single AI agents. However, designing the collaboration protocols and evaluating the effectiveness of these systems remains a significant challenge, especially for enterprise applications. This report addresses these challenges by presenting a comprehensive evaluation of coordination and routing capabilities in a novel multi-agent collaboration framework. We evaluate two key operational modes: (1) a coordination mode enabling complex task completion through parallel communication and payload referencing, and (2) a routing mode for efficient message forwarding between agents. We benchmark on a set of handcrafted scenarios from three enterprise domains, which are publicly released with the report. For coordination capabilities, we demonstrate the effectiveness of inter-agent communication and payload referencing mechanisms, achieving end-to-end goal success rates of 90%. Our analysis yields several key findings: multi-agent collaboration enhances goal success rates by up to 70% compared to single-agent approaches in our benchmarks; payload referencing improves performance on code-intensive tasks by 23%; latency can be substantially reduced with a routing mechanism that selectively bypasses agent orchestration. These findings offer valuable guidance for enterprise deployments of multi-agent systems and advance the development of scalable, efficient multi-agent collaboration frameworks.
Shared experiences are fundamental to social connection, yet media consumption is increasingly solitary. While AI companions offer real-time reactions and emotional regulation, existing systems either rely on single-agent designs or lack the social awareness and multi-party interaction required to replicate authentic group dynamics. We present CompanionCast, a general framework for orchestrating multiple specialized AI agents as social collaborators within a live shared context. CompanionCast integrates multimodal event detection, rolling context caching for improved grounding, and spatial audio to enhance co-presence. We validate CompanionCast through sports viewing, a domain with rich dynamics and strong social traditions. Pilot studies with soccer fans demonstrate that CompanionCast significantly improves perceived social presence and emotional sharing compared to solitary viewing. We conclude by discussing implications and open challenges for multi-agent systems as social collaborators in shared experiences.
While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a "language model graph" that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.
While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents' ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. https://github.com/Roihn/EinsteinPuzzles
Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks. However, human-agent interaction remains pointwise and reactive: users approve or correct individual actions to mitigate immediate risks, without visibility into subsequent consequences. This forces users to mentally simulate long-term effects, a cognitively demanding and often inaccurate process. Users have control over individual steps but lack the foresight to make informed decisions. We argue that effective collaboration requires foresight, not just control. We propose simulation-in-the-loop, an interaction paradigm that enables users and agents to explore simulated future trajectories before committing to decisions. Simulation transforms intervention from reactive guesswork into informed exploration, while helping users discover latent constraints and preferences along the way. This perspective paper characterizes the limitations of current paradigms, introduces a conceptual framework for simulation-based collaboration, and illustrates its potential through concrete human-agent collaboration scenarios.
Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.
Large Language Models (LLMs) can generate persuasive influence strategies that shift cooperative behavior in multi-agent populations, but a critical question remains: does the resulting cooperation reflect genuine prosocial alignment, or does it mask erosion of agent autonomy, epistemic integrity, and distributional fairness? We introduce Constitutional Multi-Agent Governance (CMAG), a two-stage framework that interposes between an LLM policy compiler and a networked agent population, combining hard constraint filtering with soft penalized-utility optimization that balances cooperation potential against manipulation risk and autonomy pressure. We propose the Ethical Cooperation Score (ECS), a multiplicative composite of cooperation, autonomy, integrity, and fairness that penalizes cooperation achieved through manipulative means. In experiments on scale-free networks of 80 agents under adversarial conditions (70% violating candidates), we benchmark three regimes: full CMAG, naive filtering, and unconstrained optimization. While unconstrained optimization achieves the highest raw cooperation (0.873), it yields the lowest ECS (0.645) due to severe autonomy erosion (0.867) and fairness degradation (0.888). CMAG attains an ECS of 0.741, a 14.9% improvement, while preserving autonomy at 0.985 and integrity at 0.995, with only modest cooperation reduction to 0.770. The naive ablation (ECS = 0.733) confirms that hard constraints alone are insufficient. Pareto analysis shows CMAG dominates the cooperation-autonomy trade-off space, and governance reduces hub-periphery exposure disparities by over 60%. These findings establish that cooperation is not inherently desirable without governance: constitutional constraints are necessary to ensure that LLM-mediated influence produces ethically stable outcomes rather than manipulative equilibria.
As Large Language Models (LLMs) get integrated into diverse workflows, they are increasingly being regarded as "collaborators" with humans, and required to work in coordination with other AI systems. If such AI collaborators are to reliably coordinate their actions and behaviors with humans or other AIs, their properties and behaviors over multi-turn interactions must be known and predictable. This paper examines how different alignment methods affect LLM agents' effectiveness as partners in multi-turn, multi-party collaborations. We study this question through the lens of intervention agents that insert themselves into group dialogues not to provide answers, but to encourage the collaborative group to slow down and reflect upon their reasoning for deliberative decision-making. Common alignment techniques are typically developed under simplified single-user settings and assume the optimality of the underlying token MDP. Using the theoretical lens of the modified-action MDP, we show how they do not account for the dynamics of long-horizon multi-party interactions. We present a novel roleplay simulation methodology, where we align LLMs according to different methods and then deploy them in collaborative task dialogues to quantify how interventions affect the trajectory of group collaboration, belief alignment, and coordination. Our results show that an intervention agent that is robust to action modification significantly outperforms common alignment baselines in supporting correct task outcomes.
Various methods have been proposed for utilizing Large Language Models (LLMs) in autonomous driving. One strategy of using LLMs for autonomous driving involves inputting surrounding objects as text prompts to the LLMs, along with their coordinate and velocity information, and then outputting the subsequent movements of the vehicle. When using LLMs for such purposes, capabilities such as spatial recognition and planning are essential. In particular, two foundational capabilities are required: (1) spatial-aware decision making, which is the ability to recognize space from coordinate information and make decisions to avoid collisions, and (2) the ability to adhere to traffic rules. However, quantitative research has not been conducted on how accurately different types of LLMs can handle these problems. In this study, we quantitatively evaluated these two abilities of LLMs in the context of autonomous driving. Furthermore, to conduct a Proof of Concept (POC) for the feasibility of implementing these abilities in actual vehicles, we developed a system that uses LLMs to drive a vehicle.
This paper introduces SOLID (Synergizing Optimization and Large Language Models for Intelligent Decision-Making), a novel framework that integrates mathematical optimization with the contextual capabilities of large language models (LLMs). SOLID facilitates iterative collaboration between optimization and LLMs agents through dual prices and deviation penalties. This interaction improves the quality of the decisions while maintaining modularity and data privacy. The framework retains theoretical convergence guarantees under convexity assumptions, providing insight into the design of LLMs prompt. To evaluate SOLID, we applied it to a stock portfolio investment case with historical prices and financial news as inputs. Empirical results demonstrate convergence under various scenarios and indicate improved annualized returns compared to a baseline optimizer-only method, validating the synergy of the two agents. SOLID offers a promising framework for advancing automated and intelligent decision-making across diverse domains.
Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. These tasks require multiple interactions with a volatile environment for every decision, demanding sufficient intelligence to maximize returns and manage risks. Although LLMs have been used to develop agent systems that surpass human teams and yield impressive investment returns, opportunities to enhance multi-sourced information synthesis and optimize decision-making outcomes through timely experience refinement remain unexplored. Here, we introduce the FinCon, an LLM-based multi-agent framework with CONceptual verbal reinforcement tailored for diverse FINancial tasks. Inspired by effective real-world investment firm organizational structures, FinCon utilizes a manager-analyst communication hierarchy. This structure allows for synchronized cross-functional agent collaboration towards unified goals through natural language interactions and equips each agent with greater memory capacity than humans. Additionally, a risk-control component in FinCon enhances decision quality by episodically initiating a self-critiquing mechanism to update systematic investment beliefs. The conceptualized beliefs serve as verbal reinforcement for the future agent's behavior and can be selectively propagated to the appropriate node that requires knowledge updates. This feature significantly improves performance while reducing unnecessary peer-to-peer communication costs. Moreover, FinCon demonstrates strong generalization capabilities in various financial tasks, including single stock trading and portfolio management.
Traditional reinforcement learning and planning typically requires vast amounts of data and training to develop effective policies. In contrast, large language models (LLMs) exhibit strong generalization and zero-shot capabilities, but struggle with tasks that require detailed planning and decision-making in complex action spaces. We introduce STRATEGIST, a novel approach that integrates the strengths of both methods. Our approach leverages LLMs to search and update high-level strategies (as text), which are then refined and executed by low-level Monte Carlo Tree Search (MCTS). STRATEGIST is a generalizable framework to optimize the strategy through population-based self-play simulations without the need for any training data. We demonstrate the effectiveness of STRATEGIST in learning optimal strategies for competitive, multi-turn games with partial information, including Game of Pure Strategy (GOPS) and multi-agent, hidden-identity discussion games like The Resistance: Avalon. Our results show that agents equipped with STRATEGIST outperform those trained with traditional RL methods, other LLM-based skill acquisition techniques, pre-existing LLM agents across both game environments and achieves comparable performance against human players.
Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.
As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark both open and closed-source LLMs. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide limited defenses against alignment tipping. These findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.
We present STAgent, an agentic large language model tailored for spatio-temporal understanding, designed to solve complex tasks such as constrained point-of-interest discovery and itinerary planning. STAgent is a specialized model capable of interacting with ten distinct tools within spatio-temporal scenarios, enabling it to explore, verify, and refine intermediate steps during complex reasoning. Notably, STAgent effectively preserves its general capabilities. We empower STAgent with these capabilities through three key contributions: (1) a stable tool environment that supports over ten domain-specific tools, enabling asynchronous rollout and training; (2) a hierarchical data curation framework that identifies high-quality data like a needle in a haystack, curating high-quality queries by retaining less than 1\% of the raw data, emphasizing both diversity and difficulty; and (3) a cascaded training recipe that starts with a seed SFT stage acting as a guardian to measure query difficulty, followed by a second SFT stage fine-tuned on queries with high certainty, and an ultimate RL stage that leverages data of low certainty. Initialized with Qwen3-30B-A3B to establish a strong SFT foundation and leverage insights into sample difficulty, STAgent yields promising performance on TravelBench while maintaining its general capabilities across a wide range of general benchmarks, thereby demonstrating the effectiveness of our proposed agentic model.
Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.
本报告对 Large Language Models Agentic AI 领域的文献进行了系统性梳理,划分为五大逻辑板块:架构与工程框架确保系统的稳健性与可扩展性;多智能体协作与社会动力学研究核心在于交互与群体智慧;推理、规划与强化学习探讨智能核心的自主进化能力;安全、可靠与评估治理确立了应用边界与可信度底线;最后通过广泛的垂直领域与具身智能应用展现了其在真实物理与数字场景中的工业落地价值。