ai agent、大模型
多智能体协作架构与通信机制
集中研究多智能体系统(MAS)的通用框架、任务分解、拓扑结构及智能体间的通信与协作协议,旨在实现大规模复杂环境下的协同效率。
- Engineering LLM Powered Multi-Agent Framework for Autonomous CloudOps(Kannan Parthasarathy, Karthik Vaidhyanathan, Rudra Dhar, Venkat Krishnamachari, Basil Muhammed, Adyansh Kakran, Sreemaee Akshathala, Shrikara Arun, Sumant Dubey, Mohan Veerubhotla, Amey Karan, 2025, 2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN))
- LLM Agents for Smart City Management: Enhancing Decision Support Through Multi-Agent AI Systems(A. Kalyuzhnaya, Sergey Mityagin, E. Lutsenko, Andrey Getmanov, Yaroslav Aksenkin, Kamil Fatkhiev, Kirill Fedorin, Nikolay O. Nikitin, Natalia Chichkova, V. Vorona, A. Boukhanovsky, 2025, Smart Cities)
- The Multi-agent System based on LLM for Online Discussions(Yihan Dong, 2024, Adaptive Agents and Multi-Agent Systems)
- MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs(Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, Bingsheng He, 2024, Findings of the Association for Computational Linguistics: ACL 2025)
- Synergizing Logical Reasoning, Knowledge Management and Collaboration in Multi-Agent LLM System(Adam Kostka, Jaroslaw A. Chudziak, 2025, Pacific Asia Conference on Language, Information and Computation)
- Collaborative Problem-Solving with LLM: A Multi-agent System Approach to Solve Complex Tasks Using Autogen(R. Barbosa, Ricardo Santos, Paulo Novais, 2024, Communications in Computer and Information Science)
- AutoHMA-LLM: Efficient Task Coordination and Execution in Heterogeneous Multi-Agent Systems Using Hybrid Large Language Models(Tinging Yang, Ping Feng, Qixin Guo, Jindi Zhang, Xiufeng Zhang, Jiahong Ning, Xinghan Wang, Zhongyang Mao, 2025, IEEE Transactions on Cognitive Communications and Networking)
- Search Swarm: Multiagent Large Language Models Framework for E-commerce Product Search(Nagim Isyanbaev, Ilya Makarov, 2024, Proceedings of the Thirty-ThirdInternational Joint Conference on Artificial Intelligence)
- Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System(Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, Nanqing Dong, 2024, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks(Shubham Gandhi, Manasi S. Patwardhan, L. Vig, Gautam M. Shroff, 2024, Proceedings of the 4th International Conference on AI-ML Systems)
- MASTER: A Multi-Agent System with LLM Specialized MCTS(Bingzheng Gan, Yufan Zhao, Tianyi Zhang, Jing Huang, Yusu Li, Shu Xian Teo, Changwang Zhang, Wei Shi, 2025, North American Chapter of the Association for Computational Linguistics)
- A LLM-informed multi-agent AI system for drone-based visual inspection for infrastructure(Jiucai Liu, Haijiang Li, Chengzhang Chai, Kehong Chen, Dalei Wang, 2025, Advanced Engineering Informatics)
- Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System(Weize Chen, Jiarui Yuan, Cheng Qian, Cheng Yang, Zhiyuan Liu, Maosong Sun, 2024, Annual Meeting of the Association for Computational Linguistics)
- Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-Based Planner and Graph-Based Policy(Ziqi Jia, Junjie Li, Xiaoyang Qu, Jianzong Wang, 2025, 2025 IEEE International Conference on Robotics and Automation (ICRA))
- Scaling Large-Language-Model-based Multi-Agent Collaboration(Cheng Qian, Zihao Xie, Yifei Wang, Wei Liu, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun, 2024, International Conference on Learning Representations)
- FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making(Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yupeng Cao, Zhi Chen, Jordan W. Suchow, Rong Liu, Zhenyu Cui, Denghui Zhang, K. Subbalakshmi, Guojun Xiong, Yueru He, Jimin Huang, Dong Li, Qianqian Xie, 2024, Neural Information Processing Systems)
- Mixture-of-Agents Enhances Large Language Model Capabilities(Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, James Zou, 2024, International Conference on Learning Representations)
- CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society(G. Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem, 2023, Advances in Neural Information Processing Systems 36)
- ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks(Heng Zhou, Hejia Geng, Xiangyuan Xue, Li Kang, Zhenfei Yin, Lei Bai, 2025, Conference on Empirical Methods in Natural Language Processing)
- AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System(Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, Wei Han, 2025, Conference on Empirical Methods in Natural Language Processing)
推理增强、反射机制与自主规划
关注LLM Agent的底层逻辑范式,涵盖推理能力提升、自我反思、思维链(CoT)、规划能力以及在复杂决策环境下的逻辑优化。
- Reinforce LLM Reasoning through Multi-Agent Reflection(Yurun Yuan, Tengyang Xie, 2025, International Conference on Machine Learning)
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search(Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Greg Wornell, Subhro Das, David D. Cox, Chuang Gan, 2025, International Conference on Machine Learning)
- Enhancing LLM Reasoning Capabilities Through Brokered Multi-Expert Reflection(T. Sheokand, Garveet Jain, Arshdeep Bahga, Vijay K. Madisetti, 2025, IEEE Access)
- Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph(Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Sai Wang, Chen Lin, Yeyun Gong, Lionel M. Ni, H. Shum, Jian Guo, 2023, International Conference on Learning Representations)
- A Reflective Architecture for LLM-Based Systems(Parisa Salmani, Peter R. Lewis, 2025, 2025 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C))
- SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search(Yifan Zhang, Giridhar Ganapavarapu, Srideepika Jayaraman, Bhavna Agrawal, Dhaval Patel, Achille Fokoue, 2025, AAAI Conference on Artificial Intelligence)
- DMA-MCTS: Dynamic Memory-Augmented Monte-Carlo Tree Search for LLM Task Planning(Jiakang Wang, Qi Wang, Mengxia Li, Tingting Li, Yongjun Xu, 2025, 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC))
- Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback(Sanjiban Choudhury, Paloma Sodhi, 2024, International Conference on Learning Representations)
- Building Energy Management Systems with LLM Agent Enhanced Natural Language Policy Explanation(Shuhua Zhang, Jiale Wei, Dejun Xiang, Yuheng Cheng, Huan Zhao, Yulu Xie, Xinlei Cai, Junhua Zhao, 2025, 2025 5th Power System and Green Energy Conference (PSGEC))
- Thinktank: Leveraging LLM Reasoning for Advanced Task Execution in CI/CD(T. Keller, 2024, 2024 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW))
- MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning(Zikang Guo, Benfeng Xu, Xiaorui Wang, Zhendong Mao, 2025, International Joint Conference on Artificial Intelligence)
- ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection(Jeonghye Kim, S. Rhee, Minbeom Kim, Dohyung Kim, Sangmook Lee, Youngchul Sung, Kyomin Jung, 2025, Conference on Empirical Methods in Natural Language Processing)
- WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback(Minda Hu, Tianqing Fang, Jianshu Zhang, Junyu Ma, Zhisong Zhang, J. Zhou, Hongming Zhang, Haitao Mi, Dong Yu, Irwin King, 2025, Conference on Empirical Methods in Natural Language Processing)
- Temporal Consistency for LLM Reasoning Process Error Identification(Jiacheng Guo, Yue Wu, Jiahao Qiu, Kaixuan Huang, Xinzhe Juan, Ling Yang, Mengdi Wang, 2025, Conference on Empirical Methods in Natural Language Processing)
- HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model(Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, Ping Luo, 2024, Annual Meeting of the Association for Computational Linguistics)
- Efficient Tool Use with Chain-of-Abstraction Reasoning(Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, Tianlu Wang, 2024, International Conference on Computational Linguistics)
- LLM-Guided Reinforcement Learning for Interactive Environments(Fuxue Yang, Jiawen Liu, Kan Li, 2025, Mathematics)
- PRACT: Optimizing Principled Reasoning and Acting of LLM Agent(Zhiwei Liu, Weiran Yao, Jianguo Zhang, Rithesh Murthy, Liangwei Yang, Zuxin Liu, Tian Lan, Ming Zhu, Juntao Tan, Shirley Kokane, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong, 2024, Conference on Computational Natural Language Learning)
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning(Xiaolong Wei, Yuehu Dong, Xingliang Wang, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin, 2025, AAAI Conference on Artificial Intelligence)
工具使用、自我进化与任务执行
探讨智能体调用外部工具的机制,包括API适配、工具学习、任务执行的自动化范式以及通过反馈机制实现智能体的自我进化。
- agentAR: Creating Augmented Reality Applications with Tool-Augmented LLM-based Autonomous Agents(Chenfei Zhu, Shao-Kang Hsia, Xiyun Hu, Ziyi Liu, Jingyu Shi, K. Ramani, 2025, Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology)
- DRC-Coder: Automated DRC Checker Code Generation Using LLM Autonomous Agent(Chen-Chia Chang, Chia-Tung Ho, Yaguang Li, Yiran Chen, Haoxing Ren, 2024, Proceedings of the 2025 International Symposium on Physical Design)
- El Agente: An Autonomous Agent for Quantum Chemistry(Yunheng Zou, Austin H. Cheng, Abdulrahman Aldossary, Jiaru Bai, Shi Xuan Leong, Jorge A. Campos Gonzalez Angulo, Chang-Min Choi, Cher Tian Ser, Gary Tom, Andrew Wang, Zijian Zhang, Ilya Yakavets, Han Hao, Chris Crebolder, Varinia Bernales, Al'an Aspuru-Guzik, 2025, Matter)
- LLM-guided chemical process optimization with a multi-agent approach(Tong Zeng, Srivathsan Badrinarayanan, Janghoon Ock, Cheng-Kai Lai, A. Farimani, 2025, Machine Learning: Science and Technology)
- A Three-Stage Pipeline using ReAct and Reflexion for Reliable LLM-based Java Unit Test Case Generator(Reza P. Ubaidillah, Yani Widyani, 2025, 2025 IEEE International Conference on Data and Software Engineering (ICoDSE))
- RepairAgent: An Autonomous, LLM-Based Agent for Program Repair(Islem Bouzenia, Prem Devanbu, Michael Pradel, 2024, 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE))
- ASIC-Agent: An Autonomous Multi-Agent System for ASIC Design with Benchmark Evaluation(Ahmed Allam, Youssef Mansour, Mohamed Shalan, 2025, 2025 IEEE International Conference on LLM-Aided Design (ICLAD))
- Learning to Ask: When LLM Agents Meet Unclear Instruction(Wenxuan Wang, Juluan Shi, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-Tse Huang, Michael R. Lyu, 2024, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing)
- RTBAgent: A LLM-based Agent System for Real-Time Bidding(Leng Cai, Junxuan He, Yikai Li, Junjie Liang, Yuanping Lin, Ziming Quan, Yawen Zeng, Jin Xu, 2025, Companion Proceedings of the ACM on Web Conference 2025)
- GoNoGo: An Efficient LLM-based Multi-Agent System for Streamlining Automotive Software Release Decision-Making(Arsham Gholamzadeh Khoee, Yinan Yu, R. Feldt, Andris Freimanis, P. Andersson, Dhasarathy Parthasarathy, 2024, International Conference on Testing Software and Systems)
- A feasibility study of automating radiotherapy planning with large language model agents(QingXing Wang, Zhongqiu Wang, Minghua Li, Xinye Ni, Rong Tan, Wenwen Zhang, Maitudi Wubulaishan, Wei Wang, Zhiyong Yuan, Zhen Zhang, Cong Liu, 2025, Physics in Medicine & Biology)
- StockSage: Multi-Agent LLM Powered Inventory Management System for Intelligent Supply Chain Optimization(Pranita Pingale, Faiz Asif Shaikh, Om Sanjay Bhongale, Sanvesh Satish Patil, 2026, International Journal of Innovative Science and Research Technology)
- API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs(Minghao Li, Feifan Song, Yu Bowen, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li, 2023, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing)
- Fine-tuning a Large Language Model for Automating Computational Fluid Dynamics Simulations(Zhehao Dong, Zhen Lu, Yue Yang, 2025, Theoretical and Applied Mechanics Letters)
- First Field Trial of LLM-Powered AI Agent for Lifecycle Management of Autonomous Driving Optical Networks(Xiaomin Liu, Qizhi Qiu, Yihao Zhang, Yuming Cheng, L. Yi, Weisheng Hu, Q. Zhuge, 2024, Optical Fiber Communications Conference and Exhibition)
- Small LLMs Are Weak Tool Learners: A Multi-LLM Agent(Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, Fei Huang, 2024, Conference on Empirical Methods in Natural Language Processing)
- SMART: Self-Aware Agent for Tool Overuse Mitigation(Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tur, Gokhan Tur, Heng Ji, 2025, Annual Meeting of the Association for Computational Linguistics)
- MT-Mol:Multi Agent System with Tool-based Reasoning for Molecular Optimization(Hyomin Kim, Yunhui Jang, Sungsoo Ahn, 2025, Conference on Empirical Methods in Natural Language Processing)
- Procedural Environment Generation for Tool-Use Agents(Michael Sullivan, Mareike Hartmann, Alexander Koller, 2025, Conference on Empirical Methods in Natural Language Processing)
- LLM experiments with simulation: Large Language Model Multi-Agent System for Simulation Model Parametrization in Digital Twins(Yuchen Xia, Daniel Dittler, Nasser Jazdi, Haonan Chen, M. Weyrich, 2024, 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA))
- AutoReview: An LLM-based Multi-Agent System for Security Issue-Oriented Code Review(Yujia Chen, 2025, Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering)
- AgentPress: A Multi-Agent Framework for News Topic Classification with Retrieval, Reasoning, and Reflection(Qi Li, 2025, Applied and Computational Engineering)
- GTA: A Benchmark for General Tool Agents(Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, Xinyi Le, 2024, Neural Information Processing Systems)
- VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things(Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma, 2025, AAAI Conference on Artificial Intelligence)
- Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning(Zhiyuan Ma, Jiayu Liu, Xianzhen Luo, Zhenya Huang, Qingfu Zhu, Wanxiang Che, 2025, Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2)
- Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents(Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Suzan Verberne, Zhaochun Ren, 2024, Proceedings of the ACM on Web Conference 2025)
- AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning(Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, V. Ioannidis, Karthik Subbian, J. Leskovec, James Zou, 2024, Advances in Neural Information Processing Systems 37)
- Self-Training Large Language Models for Tool-Use Without Demonstrations(Ne Luo, Aryo Pradipta Gema, Xuanli He, Emile van Krieken, Pietro Lesci, Pasquale Minervini, 2025, North American Chapter of the Association for Computational Linguistics)
- SAND: Boosting LLM Agents with Self-Taught Action Deliberation(Yu Xia, Yiran Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Julian J. McAuley, 2025, Conference on Empirical Methods in Natural Language Processing)
- Towards Tool Use Alignment of Large Language Models(Zhi-Yuan Chen, Shiqi Shen, Guangyao Shen, Gong Zhi, Xu Chen, Yankai Lin, 2024, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing)
- MLLM-Tool: A Multimodal Large Language Model for Tool Agent Learning(Chenyu Wang, Weixin Luo, Qianyu Chen, Haonan Mai, Jindi Guo, Sixun Dong, Xi Xuan, Zhengxin Li, Lin Ma, Shenghua Gao, 2024, 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))
领域应用驱动的系统开发
展示Agent在特定行业场景中的落地实践,包括软件工程、科研、金融、医疗、物联网、自动驾驶及多媒体内容生产等。
- AIoT Smart Home via Autonomous LLM Agents(D. Rivkin, F. Hogan, Amal Feriani, Abhisek Konar, Adam Sigal, Xue Liu, Gregory Dudek, 2025, IEEE Internet of Things Journal)
- Agent Laboratory: Using LLM Agents as Research Assistants(Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, E. Barsoum, 2025, Findings of the Association for Computational Linguistics: EMNLP 2025)
- LLM-Assisted Reinforcement Learning: Leveraging Lightweight Large Language Model Capabilities for Efficient Task Scheduling in Multi-Cloud Environment(Xuhao Tang, Fagui Liu, Dishi Xu, Jun Jiang, Quan Tang, Bin Wang, Qingbo Wu, C. L. Philip Chen, 2025, IEEE Transactions on Consumer Electronics)
- Doc-React: Multi-page Heterogeneous Document Question-answering(Junda Wu, Yu Xia, Tong Yu, Xiang Chen, Sai Sree Harsha, Akash V. Maharaj, Ruiyi Zhang, Victor S. Bursztyn, Sungchul Kim, Ryan A. Rossi, Julian J. McAuley, Yunyao Li, Ritwik Sinha, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers))
- Multi-Agent LLM-powered AI for Autonomous Optical Power Commissioning of OMS Links(Yujiao Hao, Mahdi Hemmati, Mehrad Vaezi, Yuren You, Christopher Janz, 2025, 2025 European Conference on Optical Communications (ECOC))
- LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code(Shahbaz Siddeeq, Muhammad Waseem, Zeeshan Rasheed, Mahade Hasan, Jussi Rasku, Mika Saari, H. Terho, Kalle Mäkelä, Kai-Kristian Kemell, Pekka Abrahamsson, 2025, International Conference on Product Focused Software Process Improvement)
- Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation(Haoyuan Wu, Haisheng Zheng, Zhuolun He, Bei Yu, 2025, North American Chapter of the Association for Computational Linguistics)
- SAR: A Structure-Aligned Reasoning Framework for Temporal Knowledge Graph Question Answering(Qianyi Hu, Jiaxue Liu, Xinhui Tu, Shoujin Wang, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- Adaptive Domain Modeling with Language Models: A Multi-Agent Approach to Task Planning(Harisankar Babu, Philipp Schillinger, Tamim Asfour, 2025, 2025 IEEE 21st International Conference on Automation Science and Engineering (CASE))
- ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework(Vali Tawosi, Keshav Ramani, Salwa Alamir, Xiaomo Liu, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW))
- Automating agentic collaborative ontology engineering with role-playing simulation of LLM-powered agents and RAG technology(Andreas Soularidis, Dimitrios Doumanas, Konstantinos Kotis, G. Vouros, 2025, The Knowledge Engineering Review)
- Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents(Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, Yanfeng Wang, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- A general AI agent framework for smart buildings based on large language models and ReAct strategy(Xia Yan, Xincong Yang, Nan Jin, Yu Chen, Jiaqi Li, 2025, Smart Construction)
- Controllable Traffic Simulation through LLM-Guided Hierarchical Reasoning and Refinement(Zhiyuan Liu, Leheng Li, Yuning Wang, Haotian Lin, Hao Chen, Zhizhe Liu, Lei He, Jianqiang Wang, 2024, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Coaching Copilot: Blended Form of an LLM-Powered Chatbot and a Human Coach to Effectively Support Self-Reflection for Leadership Growth(Riku Arakawa, Hiromu Yakura, 2024, ACM Conversational User Interfaces 2024)
- Research on the Construction Technology of Electric Energy Metering Big Model and Intelligent Agent(Jiaming Zhang, Ji Xiao, Ke Zheng, Xiaoyang Dong, Ningtao Liu, Shishun Tan, Jianming Hu, Bingling Chen, 2025, 2025 IEEE 7th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC))
- Unified Conversational Agent Using Transformer Models for General and E-Commerce Contexts(A. Naveen, Saleem Dudekula, Saakshi Mahantesh Alase, Ummareddy Hruthika, Satrughan Kumar, 2025, 2025 IEEE International Conference on Advances in Computing Research On Science Engineering and Technology (ACROSET))
- E-GPT: A Multi-Agent LLM Framework for Intelligent Educational Assistance(Tabassum Ara, Shreyansu Panda, Jason Samuel Das, Amit Das, Sidharth Vivek Prabhugoankar, 2025, 2025 1st International Conference on Advancement in Futuristic Technologies (ICAFT))
- Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis(Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, Dan Pei, 2025, Companion Proceedings of the ACM on Web Conference 2025)
- Text2Reaction : Enabling Reactive Task Planning Using Large Language Models(Zejun Yang, Li Ning, Haitao Wang, Tianyu Jiang, Shaolin Zhang, Shaowei Cui, Hao Jiang, Chunpeng Li, Shuo Wang, Zhaoqi Wang, 2024, IEEE Robotics and Automation Letters)
- Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning(Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Qi Cheems Wang, Chang Liu, Xiangyang Ji, 2024, AAAI Conference on Artificial Intelligence)
- MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World(Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- DataFactory: Collaborative multi-agent framework for advanced table question answering(Tong Wang, Chi Jin, Yongkang Chen, Huan Deng, Xiaohui Kuang, Gang Zhao, 2026, Information Processing & Management)
- Leveraging Fine-Tuned LLMs, RAG and ReACT for Enhanced Academic Document Analysis and Automated Research Proposal Generation(Ammar Helmey Iskandar, Ezzatul Akmal Kamaru Zaman, Azliza Mohd Ali, Farah Syazwani Mohamed Rashid, 2025, 2025 6th International Conference on Artificial Intelligence and Data Sciences (AiDAS))
- VideoAgent: Long-form Video Understanding with Large Language Model as Agent(Xiaohan Wang, Yuhui Zhang, Orr Zohar, S. Yeung-Levy, 2024, European Conference on Computer Vision)
- TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning(Chuang Jiang, Mingyue Cheng, Xiaoyu Tao, Qingyang Mao, Ouyang Jie, Qi Liu, 2025, Web Search and Data Mining)
- An LLM Agent-Based Complex Semantic Table Annotation Approach(Yilin Geng, Shujing Wang, Chuan-Ju Wang, Keqing He, Yanfei Lv, Ying Wang, Zaiwen Feng, Xiaoying Bai, 2025, International Conference on Advanced Data Mining and Applications)
- Exploring Applicability of LLM-Powered Autonomous Agents to Solve Real-life Problems: Microsoft Entra ID Administration Agent (MEAN)(Roberto Rodriguez, Nestori Syynimaa, 2024, Proceedings of the 26th International Conference on Enterprise Information Systems)
- UrbanKGent: A Unified Large Language Model Agent Framework for Urban Knowledge Graph Construction(Yansong NING, Hao Liu, 2024, Neural Information Processing Systems)
- On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software(Ali Nouri, Johan Andersson, Kailash De Jesus Hornig, Zhennan Fei, Emil Knabe, Håkan Sivencrona, Beatriz Cabrero-Daniel, Christian Berger, 2025, Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering)
- RS_DeepReason: LLM-Driven Deep Reasoning for Multigranularity Remote Sensing Scene Interpretation(Cheng Yang, Jia Zhang, Qiujun Li, Wang Guo, Haifeng Li, 2026, IEEE Geoscience and Remote Sensing Letters)
- Resource Allocation for IRS-Assisted V2I Anti-Jamming Communications in Interweave CIoV Networks: A Transformer-Enhanced Multi-Agent DRL Method(Jun Wang, Feng Wu, Rong Wang, Ruiquan Lin, Liang Wu, Feng Shu, 2026, IEEE Transactions on Wireless Communications)
- Simulating Social Behavior of LLM-Based Autonomous Negotiator Agents in a Game-Theoretical Framework Using Multi-Agent Systems(Ahmad Mouri Zadeh Khaki, Ahyoung Choi, Laleh Seyyed-Kalantari, 2025, International Journal of Human–Computer Interaction)
- TableZoomer: a collaborative agent framework for large-scale table question answering(Sishi Xiong, Ziyang He, Zhongjiang He, Yu Zhao, Changzai Pan, Jie Zhang, Shuangyong Song, Yongxiang Li, 2025, Vicinagearth)
- Thematic-LM: A LLM-based Multi-agent System for Large-scale Thematic Analysis(Tingrui Qiao, Caroline Walker, Chris Cunningham, Yun Sing Koh, 2025, Proceedings of the ACM on Web Conference 2025)
- OrcaLoca: An LLM Agent Framework for Software Issue Localization(Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, Jishen Zhao, 2025, International Conference on Machine Learning)
- RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing(Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, Xiangyu Zhang, 2025, International Conference on Machine Learning)
- RTLFixer: Automatically Fixing RTL Syntax Errors with Large Language Models(Yun-Da Tsai, Mingjie Liu, Haoxing Ren, 2023, Proceedings of the 61st ACM/IEEE Design Automation Conference)
- CurriculumPT: LLM-Based Multi-Agent Autonomous Penetration Testing with Curriculum-Guided Task Scheduling(Xingyu Wu, Yunzhe Tian, Yuanwan Chen, Ping Ye, Xiaoshu Cui, Jingqi Jia, Shouyang Li, Jiqiang Liu, Wenjia Niu, 2025, Applied Sciences)
- First Field-Trial Demonstration of L4 Autonomous Optical Network for Distributed AI Training Communication: An Llm-Powered Multi-AI-Agent Solution(Yihao Zhang, Qizhi Qiu, Xiaomin Liu, Dianxuan Fu, Xingyu Liu, Leyan Fei, Yuming Cheng, Lilin Yi, Weisheng Hu, Q. Zhuge, 2025, 2025 European Conference on Optical Communications (ECOC))
- MuaLLM: A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation(Pravallika Abbineni, Saoud Aldowaish, Colin Liechty, Soroosh Noorzad, Ali Ghazizadeh, Morteza Fayazi, 2025, 2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC))
- ChatEDA: A Large Language Model Powered Autonomous Agent for EDA(Haoyuan Wu, Zhuolun He, Xinyun Zhang, Xufeng Yao, Su Zheng, Haisheng Zheng, Bei Yu, 2023, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
- LayoutCopilot: An LLM-Powered Multiagent Collaborative Framework for Interactive Analog Layout Design(Bingyang Liu, Haoyi Zhang, Xiaohan Gao, Zichen Kong, Xiyuan Tang, Yibo Lin, Runsheng Wang, Ru Huang, 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
- An LLM-Enabled Multi-Agent Autonomous Mechatronics Design Framework(Zeyu Wang, Frank P.-W. Lo, Qian Chen, Yongqi Zhang, Chen Lin, Xu Chen, Zhenhua Yu, Alex J. Thompson, Eric M. Yeatman, Benny P. L. Lo, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning(Alireza Ghafarollahi, Markus J. Buehler, 2024, Digital Discovery)
- Large Language Model Agent as a Mechanical Designer(Yayati Jadhav, A. Farimani, 2024, Journal of engineering design)
- Large Language Model-Enabled Multi-Agent Manufacturing Systems(Jonghan Lim, B. Vogel-Heuser, Ilya Kovalenko, 2024, 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE))
- OpenFOAMGPT: A retrieval-augmented large language model (LLM) agent for OpenFOAM-based computational fluid dynamics(Sandeep Pandey, Ran Xu, Wenkang Wang, Xu Chu, 2025, Physics of Fluids)
- FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading(Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xueqing Peng, Mingquan Lin, Kaleb E Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, Qianqian Xie, 2025, Annual Meeting of the Association for Computational Linguistics)
- Can Large Language Model Agents Simulate Human Trust Behaviors?(Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Shiyang Lai, Kai Shu, Adel Bibi, Ziniu Hu, Philip H. S. Torr, Bernard Ghanem, G. Li, 2024, Neural Information Processing Systems)
- ElliottAgents: A Natural Language-Driven Multi-Agent System for Stock Market Analysis and Prediction(Jarosław A. Chudziak, Michał Wawer, 2025, Pacific Asia Conference on Language, Information and Computation)
- TrendSim: Simulating Trending Topics in Social Media Under Poisoning Attacks with LLM-based Multi-agent System(Zeyu Zhang, Jianxun Lian, Chen Ma, Yaning Qu, Ye Luo, Lei Wang, Rui Li, Xu Chen, Yankai Lin, Ledell Wu, Xing Xie, Ji-Rong Wen, 2024, North American Chapter of the Association for Computational Linguistics)
- Large Language Model-Based Bidding Behavior Agent and Market Sentiment Agent-Assisted Electricity Price Prediction(Xin Lu, Jing Qiu, Yi Yang, Chenxi Zhang, Jiafeng Lin, Sihai An, 2025, IEEE Transactions on Energy Markets, Policy and Regulation)
- Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents(Tianmi Ma, Jiawei Du, Wenxin Huang, Wenjie Wang, Liang Xie, Xian Zhong, Joey Tianyi Zhou, 2025, Findings of the Association for Computational Linguistics: EMNLP 2025)
- Nicer Than Humans: How do Large Language Models Behave in the Prisoner's Dilemma?(Nicoló Fontana, Francesco Pierri, L. Aiello, 2024, International Conference on Web and Social Media)
- AutoWebGLM: A Large Language Model-based Web Navigating Agent(Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang, 2024, Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
- WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration(Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, Volker Tresp, 2024, AAAI Conference on Artificial Intelligence)
- DeepTx: Real-Time Transaction Risk Analysis via Multi-Modal Features and LLM Reasoning(Yixuan Liu, Xinlei Li, Yi Li, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
- Conversational health agents: a personalized large language model-powered agent framework(Mahyar Abbasian, Iman Azimi, Amir M. Rahmani, Ramesh C. Jain, 2025, JAMIA Open)
- MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling(Yakun Zhu, Shaohang Wei, Xu Wang, Kui Xue, Xiaofan Zhang, Shaoting Zhang, 2024, North American Chapter of the Association for Computational Linguistics)
- ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning(Ling Yue, Sixue Xing, Jintai Chen, Tianfan Fu, 2024, Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics)
- Advancing Healthcare Automation: Multi-Agent System for Medical Necessity Justification(Himanshu Pandey, Akhil Amod, Shivang Kumar, 2024, Workshop on Biomedical Natural Language Processing)
- An LLM-Based Agentic Network Traffic Incident-Report Approach Towards Explainable-AI Network Defense(Chia-Hong Chou, Arjun Sudheer, Younghee Park, 2026, Journal of Sensor and Actuator Networks)
- When Large Language Model Agents Meet 6G Networks: Perception, Grounding, and Alignment(Minrui Xu, D. Niyato, Jiawen Kang, Zehui Xiong, Shiwen Mao, Zhu Han, Dong In Kim, K. B. Letaief, 2024, IEEE Wireless Communications)
- Task-Oriented Communications for Agentic IoT: An LLM-Driven QoS/Security Policy Generation via Dynamic Model Context Protocol(Shuaishuai Guo, Jiabing Zhu, Jia Ye, Anbang Zhang, Geyong Min, 2025, 2025 Seventeenth International Conference on Wireless Communications and Signal Processing (WCSP))
- LLM Enabled Multi-Agent System for 6G Networks: Framework and Method of Dual-Loop Edge-Terminal Collaboration(Zheyan Qu, Wenbo Wang, Zitong Yu, Boquan Sun, Yang Li, Xing Zhang, 2025, IEEE Communications Magazine)
- G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems(Shilong Wang, Gui-Min Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, Yang Wang, 2025, Annual Meeting of the Association for Computational Linguistics)
- Insight Agents: An LLM-Based Multi-Agent System for Data Insights(Jincheng Bai, Zhenyu Zhang, Jennifer Zhang, Jason Zhu, 2025, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)
- Evaluating the Efficacy of LLM-Based Reasoning for Multiobjective HPC Job Scheduling(Prachi Jadhav, Hongwei Jin, Ewa Deelman, Prasanna Balaprakash, 2025, Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis)
- An LLM-Powered Agent for Real-Time Analysis of the Vietnamese IT Job Market(Minh-Thuan Nguyen, T. Vo-Thanh, Thai-Duy Dinh, Xuan-Quang Phan, Tan-Ha Mai, L. Lê, 2025, 2025 19th International Conference on Advanced Computing and Analytics (ACOMPA))
- Agentic Workflows for Improving Large Language Model Reasoning in Robotic Object-Centered Planning(Jesús Moncada-Ramírez, José Luis Matez-Bandera, Javier González Jiménez, J. Ruiz-Sarmiento, 2025, Robotics)
- Cognitive Agents in Urban Mobility: Integrating LLM Reasoning into Multi-Agent Simulations(Christian Calderón, P. Martí, Jaume Jordán, Javier Palanca, V. Julián, 2025, Sensors)
- See Widely, Think Wisely: Toward Designing a Generative Multi-agent System to Burst Filter Bubbles(Yu Zhang, Jingwei Sun, Li Feng, Cen Yao, Mingming Fan, Liuxin Zhang, Qianying Wang, Xin Geng, Yong Rui, 2024, Proceedings of the CHI Conference on Human Factors in Computing Systems)
- Transformer-Based Intelligent Tutoring System for Communication Skill Development(S. Venkatalakshmi, M.Swetha, A. Valarmathi, E.Sam Abishek, C.Suja, N. Deepa, 2025, 2025 IEEE 5th International Conference on ICT in Business Industry & Government (ICTBIG))
- Large Language Model Agents for Radio Map Generation and Wireless Network Planning(Hong Quan, Wanli Ni, Tong Zhang, Xiangyu Ye, Ziyi Xie, Shuai Wang, Yuanwei Liu, Hui Song, 2025, IEEE Networking Letters)
- Leveraging Collective Intelligence in Agile Sprint Planning: A Comparative Study of LLM Architectures(Tenaaz Aqthari Syed Muneer, 2025, 2025 International Conference on Smart & Sustainable Technology (INCSST))
- EditDuet: A Multi-Agent System for Video Non-Linear Editing(Marcelo Sandoval-Castañeda, Bryan Russell, Josef Sivic, Gregory Shakhnarovich, Fabian Caba Heilbron, 2025, Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers)
- CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception(Rujia Wang, Xiangbo Gao, Hao Xiang, Runsheng Xu, Zhengzhong Tu, 2025, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- "Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents(Rongwu Xu, Xiaojian Li, Shuo Chen, Wei Xu, 2025, Annual Meeting of the Association for Computational Linguistics)
- Chat Demeter: a multi-agent system for plant disease diagnosis integrating CNN-transformer models(Sainan Zhang, 2026, Frontiers in Plant Science)
- AGILE: A Novel Reinforcement Learning Framework of LLM Agents(Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, Hang Li, 2024, Advances in Neural Information Processing Systems 37)
- Safe Offline-to-Online Multi-Agent Decision Transformer: A Safety Conscious Sequence Modeling Approach(Aamir Bader Shah, Yu Wen, Jiefu Chen, Xuqing Wu, Xin Fu, 2024, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance(Ya-Ting Lu, Shenzhi Yang, Cheng Qian, Gui-Fang Chen, Qinyu Luo, Yesai Wu, Huadong Wang, X. Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, Zhiyuan Liu, Fangming Liu, Maosong Sun, 2024, International Conference on Learning Representations)
- LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent(Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai, 2023, 2024 IEEE International Conference on Robotics and Automation (ICRA))
- Large‐Language‐Model‐Based AI Agent for Organic Semiconductor Device Research(Qian Zhang, Yongxu Hu, Jiaxin Yan, Hengyue Zhang, Xinyi Xie, Jie Zhu, Huchao Li, Xinxin Niu, Liqiang Li, Yajing Sun, Wenping Hu, 2024, Advanced Materials)
安全性、可靠性评估与分析框架
研究Agent系统在部署中的鲁棒性、不确定性、对抗防御、安全合规性以及针对Agent的通用评估基准与可视化分析方法。
- MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration(Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See-Kiong Ng, Jiashi Feng, 2023, Conference on Empirical Methods in Natural Language Processing)
- LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation(Guobin Zhu, Rui Zhou, Wenkang Ji, Shiyu Zhao, 2025, IEEE Robotics and Automation Letters)
- AgentLens: Visual Analysis for Agent Behaviors in LLM-Based Autonomous Systems(Jiaying Lu, Bo Pan, Jieyi Chen, Yingchaojie Feng, Jingyuan Hu, Yuchen Peng, Wei Chen, 2024, IEEE Transactions on Visualization and Computer Graphics)
- An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring(Sana Ebrahimi, Mohsen Dehghankar, Abolfazl Asudeh, 2025, No journal)
- Autonomous Intersection Management via Prior-Enhanced Multi-Agent Constrained Decision Transformer(Rui Zhao, Yuze Fan, Yun Li, Kui Wang, Chengyuan Zheng, Fei Gao, Zhenhai Gao, 2025, IEEE Transactions on Intelligent Transportation Systems)
- 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer(Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- ColaCare: Enhancing Electronic Health Record Modeling through Large Language Model-Driven Multi-Agent Collaboration(Zixiang Wang, Yinghao Zhu, Huiya Zhao, Xiaochen Zheng, Tianlong Wang, Wen Tang, Yasha Wang, Chengwei Pan, Ewen M. Harrison, Junyi Gao, Liantao Ma, 2024, Proceedings of the ACM on Web Conference 2025)
- Toward Interpretable and Persistent Personalization: A Memory-Augmented Agent Framework for LLM-Based Travel Planning(Ke Wang, Shuai Yan, Hao Yuan, Yanling Huang, Yuhang Wu, Fei Li, Shengying Yang, Huan Deng, 2025, IEEE Access)
- An Advanced Driving Agent with the Multimodal Large Language Model for Autonomous Vehicles(Junzhou Chen, Sidi Lu, 2024, 2024 IEEE International Conference on Mobility, Operations, Services and Technologies (MOST))
- Equipping Language Models with Tool Use Capability for Tabular Data Analysis in Finance(A. Theuma, Ehsan Shareghi, 2024, Conference of the European Chapter of the Association for Computational Linguistics)
- An LLM-Assisted AUV 3-D Path Planning Scheme Under Ocean Current Interference via Reinforcement Learning(Jiabao Wen, Zhen Li, Meng Xi, Jingyi He, 2025, IEEE Internet of Things Journal)
- AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation(Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jian-Guang Lou, Qingwei Lin, Ping Luo, S. Rajmohan, Dongmei Zhang, 2024, Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1)
- ReAct-Driven SOC Agent with Integrated Detection Engineering for AI-Enhanced Autonomous Alert Handling(Tarek Radah, H. Chaoui, Chaimae Saadi, 2025, Journal of Information Systems Engineering and Management)
- AI-powered Automatic Item Generation for Psychological Tests: A Conceptual Framework for an LLM-based Multi-Agent AIG System(Philseok Lee, Mi-Kyung Son, Zihao Jia, 2025, Journal of Business and Psychology)
- Personality-Driven Decision-Making in LLM-Based Autonomous Agents(Lewis Newsham, Daniel Prince, 2025, Adaptive Agents and Multi-Agent Systems)
- Transformer-Based Task-Oriented Joint Source-Channel Coding for Edge AI Agent(Jian Du, Changyang She, Zhaoquan Geng, Fuchun Zheng, 2025, 2025 IEEE/CIC International Conference on Communications in China (ICCC Workshops))
- Transformer-Based Reinforcement Learning for Scalable Multi-UAV Area Coverage(Dezhi Chen, Qi Qi, Qianlong Fu, Jingyu Wang, Jianxin Liao, Zhu Han, 2024, IEEE Transactions on Intelligent Transportation Systems)
- The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective(Jiin Kim, Byeong-Gon Shin, Jin-Won Chung, Minsoo Rhu, 2025, 2026 IEEE International Symposium on High Performance Computer Architecture (HPCA))
- Richelieu: Self-Evolving LLM-Based Agents for AI Diplomacy(Zhenyu Guan, Xiangyu Kong, Fangwei Zhong, Yizhou Wang, 2024, Advances in Neural Information Processing Systems 37)
- LLM-powered Multi-agent Framework for Goal-oriented Learning in Intelligent Tutoring System(Tianfu Wang, Yi Zhan, Jianxun Lian, Zhengyu Hu, Nicholas Jing Yuan, Qi Zhang, Xing Xie, Hui Xiong, 2025, Companion Proceedings of the ACM on Web Conference 2025)
- An Intelligent Maneuver Decision-Making Approach for Air Combat Based on Deep Reinforcement Learning and Transformer Networks(Wentao Li, Feng Fang, Dongliang Peng, Shuning Han, 2024, Entropy)
- Design and evaluation of an Autonomous Cyber Defence agent using DRL and an augmented LLM(Johannes F. Loevenich, Erik Adler, Tobias Hürten, R. R. F. Lopes, 2025, Computer Networks)
- Dual-Stream Hierarchical Mixed-Routing Graph Attention Network Integrating ReAct Agent-Driven Embeddings for Phage-Host Interaction Prediction(Song Jiang, Yue Huang, Xianjun Shen, Weizhong Zhao, 2025, 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM))
- An Agentic Framework for Social Event Forecasting: Approaches Using Causality Contextualized Chain of Thought(A. Thakur, Aditya Sampath, Siddharth Krishnan, 2025, 2025 IEEE International Conference on Data Mining Workshops (ICDMW))
- Enhancing AI Systems with Agentic Workflows Patterns in Large Language Model(Aditi Singh, Abul Ehtesham, Saket Kumar, T. T. Khoei, 2024, 2024 IEEE World AI IoT Congress (AIIoT))
- Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents(Michael Kirchhof, Gjergji Kasneci, Enkelejda Kasneci, 2025, International Conference on Machine Learning)
- Vision-Language Models Can Self-Improve Reasoning via Reflection(Kanzhi Cheng, Yantao Li, Fangzhi Xu, Jianbing Zhang, Hao Zhou, Yang Liu, 2024, North American Chapter of the Association for Computational Linguistics)
- Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance(Lifang Zheng, Jiawei Chen, Qinghong Yin, Jingyuan Zhang, Xinyi Zeng, Yu Tian, 2025, AAAI Conference on Artificial Intelligence)
- Hierarchical agent reflection for aligning LLM reasoning with clinical diagnostic processes(Xinda Wang, Xiaotong Li, Dengkang Zhao, Kehua Feng, Lei Liang, Zhiqiang Zhang, Keyan Ding, Huajun Chen, Bo Wan, Qiang Zhang, 2026, Health Information Science and Systems)
- DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics(Luke Yoffe, Alfonso Amayuelas, W. Wang, 2024, Findings of the Association for Computational Linguistics: EMNLP 2025)
- Large Language Model-based Human-Agent Collaboration for Complex Task Solving(Xueyang Feng, Zhi-Yuan Chen, Yujia Qin, Yankai Lin, Xu Chen, Zhiyuan Liu, Jirong Wen, 2024, Conference on Empirical Methods in Natural Language Processing)
- Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents(Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, Yongfeng Zhang, 2024, International Conference on Learning Representations)
- Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification(Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, Yang Zhang, 2024, Conference on Empirical Methods in Natural Language Processing)
- Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks(Rana Muhammad Shahroz Khan, Zhen Tan, Sukwon Yun, Charles Flemming, Tianlong Chen, 2025, Annual Meeting of the Association for Computational Linguistics)
- AgentSquare: Automatic LLM Agent Search in Modular Design Space(Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li, 2024, International Conference on Learning Representations)
- Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering(Krishna Ronanki, 2025, Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering)
- Advanced Smart Contract Vulnerability Detection via LLM-Powered Multi-Agent Systems(Zhiyuan Wei, Jing Sun, Yuqiang Sun, Ye Liu, Daoyuan Wu, Zijian Zhang, Xianhao Zhang, Meng Li, Yang Liu, Chunmiao Li, Mingchao Wan, Jin Dong, Liehuang Zhu, 2025, IEEE Transactions on Software Engineering)
- Blended RAG-Enhanced LLM Multi-Agent Framework for Anomaly Detection and Alert Management(Hao Wang, Zhiying Wang, Lei Wang, Fangfang Dang, Shuhui Wang, Peng Lin, 2025, International Conference Smart Grid and Smart Cities)
- RecMind: Large Language Model Powered Agent For Recommendation(Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, Yingzhen Yang, 2023, Findings of the Association for Computational Linguistics: NAACL 2024)
- ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities(Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang, 2024, North American Chapter of the Association for Computational Linguistics)
- Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression(Peijie Dong, Zhenheng Tang, Xiang-Hong Liu, Lujun Li, Xiaowen Chu, Bo Li, 2025, International Conference on Machine Learning)
- ACEBench: A Comprehensive Evaluation of LLM Tool Usage(Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Yue-Run Huang, Xiangcheng Liu, Xinzhi Wang, Wulong Liu, 2025, Findings of the Association for Computational Linguistics: EMNLP 2025)
- ToolQA: A Dataset for LLM Question Answering with External Tools(Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, Chao Zhang, 2023, Neural Information Processing Systems)
- Offline Reinforcement Learning for LLM Multi-Step Reasoning(Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yiling Bao, Ziran Yang, Yi Wu, 2024, Annual Meeting of the Association for Computational Linguistics)
- Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning(Zhaohui Yang, Shilei Jiang, Chen Hu, Linjing Li, Shihong Deng, Daxin Jiang, 2025, Conference on Empirical Methods in Natural Language Processing)
- Cerebrum (AIOS SDK): A Platform for Agent Development, Deployment, Distribution, and Discovery(Balaji Rama, Kai Mei, Yongfeng Zhang, 2025, North American Chapter of the Association for Computational Linguistics)
本次综合报告将AI Agent与大模型的研究划分为五大核心维度:首先是多智能体架构与协作通信,探讨复杂系统中的交互机制;其次是推理、反射与规划能力的底层逻辑增强;第三是工具使用与任务执行,侧重于 Agent 的自主工具调用及自我迭代能力;第四是广泛的垂直行业落地应用,涵盖了工程、金融、科研等关键生产领域;最后是针对Agent系统的安全性保障、可靠性评估框架及性能基准建设,旨在推动Agent从单一功能验证向可信、可控的工业级应用演进。
总计205篇相关文献
Code auditing is the process of reviewing code with the aim of identifying bugs. Large Language Models (LLMs) have demonstrated promising capabilities for this task without requiring compilation, while also supporting user-friendly customization. However, auditing a code repository with LLMs poses significant challenges: limited context windows and hallucinations can degrade the quality of bug reports, and analyzing large-scale repositories incurs substantial time and token costs, hindering efficiency and scalability. This work introduces an LLM-based agent, RepoAudit, designed to perform autonomous repository-level code auditing. Equipped with agent memory, RepoAudit explores the codebase on demand by analyzing data-flow facts along feasible program paths within individual functions. It further incorporates a validator module to mitigate hallucinations by verifying data-flow facts and checking the satisfiability of path conditions associated with potential bugs, thereby reducing false positives. RepoAudit detects 40 true bugs across 15 real-world benchmark projects with a precision of 78.43%, requiring on average only 0.44 hours and $2.54 per project. Also, it detects 185 new bugs in high-profile projects, among which 174 have been confirmed or fixed. We have open-sourced RepoAudit at https://github.com/PurCL/RepoAudit.
Existing LLM-enabled multi-agent frameworks are predominantly limited to digital or simulated environments and confined to narrowly focused knowledge domain, constraining their applicability to complex engineering tasks that require the design of physical embodiment, cross-disciplinary integration, and constraint-aware reasoning. This work proposes a multi-agent autonomous mechatronics design framework, integrating expertise across mechanical design, optimization, electronics, and software engineering to autonomously generate functional prototypes with minimal direct human design input. Operating primarily through a language-driven workflow, the framework incorporates structured human feedback to ensure robust performance under real-world constraints. To validate its capabilities, the framework is applied to a real-world challenge involving autonomous water-quality monitoring and sampling, where traditional methods are labor-intensive and ecologically disruptive. Leveraging the proposed system, a fully functional autonomous vessel was developed with optimized propulsion, cost-effective electronics, and advanced control. The design process was carried out by specialized agents, including a high-level planning agent responsible for problem abstraction and dedicated agents for structural, electronics, control, and software development. This approach demonstrates the potential of LLM-based multiagent systems to automate real-world engineering workflows and reduce reliance on extensive domain expertise.
While autonomous driving systems and intelligent transportation infrastructures become increasingly software-defined and network-connected, ensuring their cybersecurity has become a critical component of traffic safety. Large language models (LLMs) have recently shown promise in automating aspects of penetration testing, yet most existing approaches remain limited to simple, single-step exploits. They struggle to handle complex, multi-stage vulnerabilities that demand precise coordination, contextual reasoning, and knowledge reuse. This is particularly problematic in safety-critical domains, such as autonomous vehicles, where subtle software flaws can cascade across interdependent subsystems. In this work, we present CurriculumPT, a novel LLM-based penetration testing framework specifically designed for the security of intelligent systems. CurriculumPT combines curriculum learning and a multi-agent system to enable LLM agents to progressively acquire and apply exploitation skills across common vulnerabilities and exposures-based tasks. Through a structured progression from simple to complex vulnerabilities, agents build and refine an experience knowledge base that supports generalization to new attack surfaces without requiring model fine-tuning. We evaluate CurriculumPT on 15 real-world vulnerabilities scenarios and demonstrate that it outperforms three state-of-the-art baselines by up to 18 percentage points in exploit success rate, while achieving superior efficiency in execution time and resource usage. Our results confirm that CurriculumPT is capable of autonomous, scalable penetration testing and knowledge transfer, laying the groundwork for intelligent security auditing of modern autonomous driving systems and other cyberphysical transportation platforms.
Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization -- precisely identifying software problems by navigating to relevant code sections -- remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.
We introduce DriveAgent, a modular multi-agent autonomous driving framework that leverages large language model (LLM) reasoning combined with multimodal sensor fusion for autonomous driving. DriveAgent orchestrates specialized agents operating on camera, Light Detection and Ranging (LiDAR), Inertial Measurement Unit (IMU), and Global Positioning System (GPS) with LLM-driven analytical processes to deliver temporally aligned perception, causal reasoning, and action recommendations. The framework operates through a modular agent-based pipeline comprising four principal modules: (i) a descriptive analysis agent identifying critical sensor data events based on filtered timestamps, (ii) dedicated vehicle-level analysis conducted by LiDAR and vision agents that collaboratively assess vehicle conditions and movements, (iii) environmental reasoning and causal analysis agents explaining contextual changes and their underlying mechanisms, and (iv) an urgency-aware decision-generation agent prioritizing insights and proposing timely maneuvers. This modular design empowers the LLM to effectively coordinate specialized perception and reasoning agents, delivering cohesive, interpretable insights into complex autonomous driving scenarios. Extensive experiments demonstrate that DriveAgent substantially outperforms baseline methods, achieving a 26.31% improvement in vehicle reasoning and consistent enhancements of up to 2.85% in environmental reasoning. These results highlight the effectiveness of our LLM-driven multi-agent sensor fusion framework in boosting the robustness and reliability of autonomous driving systems.
Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces Repair Agent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. Repair Agent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable Repair Agent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates Repair Agent's effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270k tokens per bug, which, under the current pricing of OpenAI's GPT-3.5 model, translates to 14 cents per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.
No abstract available
Cloud Operations (CloudOps) is a rapidly growing field focused on the automated management and optimization of cloud infrastructure which is essential for organizations nav-igating increasingly complex cloud environments. MontyCloud Inc. is one of the major companies in the CloudOps domain that leverages autonomous bots to manage cloud compliance, security, and continuous operations. To make the platform more accessible and effective to the customers, we leveraged the use of GenAl. Developing a GenAl-based solution for autonomous CloudOps for the existing MontyCloud system presented us with various challenges such as i) diverse data sources; ii) orchestration of multiple processes and iii) handling complex workflows to automate routine tasks. To this end, we developed MOYA, a multi-agent framework that leverages GenAI and balances autonomy with the necessary human control. This framework integrates various internal and external systems and is optimized for factors like task orchestration, security, and error mitigation while producing accurate, reliable, and relevant insights by utilizing Retrieval Augmented Generation (RAG). Evaluations of our multi-agent system with the help of practitioners as well as using automated checks demonstrate enhanced accuracy, responsiveness, and effectiveness over non-agentic approaches across complex workflows.
Multi-agent Large Language Model (LLM) systems have been leading the way in applied LLM research across a number of fields. One notable area is software development, where researchers have advanced the automation of code implementation, code testing, code maintenance, inter alia, using LLM agents. However, software development is a multifaceted environment that extends beyond just code. As such, a successful LLM system must factor in multiple stages of the software development life-cycle (SDLC). In this paper, we propose a vision for ALMAS, an Autonomous LLM-based Multi-Agent Software Engineering framework, which follows the above SDLC philosophy such that it may work within an agile software development team to perform several tasks end-to-end. ALMAS aligns its agents with agile roles, and can be used in a modular fashion to seamlessly integrate with human developers and their development environment. We showcase the progress towards ALMAS through our published works and a use case demonstrating the framework, where ALMAS is able to seamlessly generate an application and add new feature.
We demonstrate the first cross-domain cross-layer level-4 autonomous optical network via a multi-AI-agent system. Field trials show ~98% task completion rate across the distributed AI training lifecycle©3.2© higher than single agents using advanced LLMs. ©2025 The Author(s)
Abstract Simulation is a widely used approach for evaluating system performance, robustness, and potential issues during design and testing. Large Language Models (LLMs) have recently shown strong potential in autonomous agent systems, including negotiation tasks—a core aspect of commerce. This paper evaluates LLM-based autonomous negotiator agents (LANAs) in a buyer-seller bargaining game to assess their decision-making and reasoning. We simulate interactions between agents embodying contrasting social behaviors: (a) Cunning vs. Kind, and (b) Greedy vs. Generous. By analyzing both the game outcomes and the agents’ internal reasoning, we find that LLMs can effectively simulate distinct social behaviors in both dialogue and decision-making. Our results offer insights into how social traits affect negotiation dynamics, emphasizing the importance of clear policy design to ensure fairness and reliability in LANA-based systems.
LLM-based multi-agent systems (MAS) have shown promise in tackling complex tasks. However, existing solutions often suffer from limited agent coordination and heavy reliance on predefined Standard Operating Procedures (SOPs), which demand extensive human input. To address these limitations, we propose MegaAgent, a large-scale autonomous LLM-based multi-agent system. MegaAgent generates agents based on task complexity and enables dynamic task decomposition, parallel execution, efficient communication, and comprehensive system monitoring of agents. In evaluations, MegaAgent demonstrates exceptional performance, successfully developing a Gobang game within 800 seconds and scaling up to 590 agents in a national policy simulation to generate multi-domain policies. It significantly outperforms existing systems, such as MetaGPT, in both task completion efficiency and scalability. By eliminating the need for predefined SOPs, MegaAgent demonstrates exceptional scalability and autonomy, setting a foundation for advancing true autonomy in MAS. Our code is available at https://github.com/Xtra-Computing/MegaAgent .
Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.
Recently, Large Language Model based Autonomous System (LLMAS) has gained great popularity for its potential to simulate complicated behaviors of human societies. One of its main challenges is to present and analyze the dynamic events evolution of LLMAS. In this work, we present a visualization approach to explore the detailed statuses and agents’ behavior within LLMAS. Our approach outlines a general pipeline that organizes raw execution events from LLMAS into a structured behavior model. We leverage a behavior summarization algorithm to create a hierarchical summary of these behaviors, arranged according to their sequence over time. Additionally, we design a cause trace method to mine the causal relationship between agent behaviors. We then develop AgentLens, a visual analysis system that leverages a hierarchical temporal visualization for illustrating the evolution of LLMAS, and supports users to interactively investigate details and causes of agents’ behaviors. Two usage scenarios and a user study demonstrate the effectiveness and usability of our AgentLens.
We design and demonstrate the first field trial of LLM-powered AI Agent for ADON. Three operation modes of the Agent are proposed for network lifecycle management to process wavelength add/drop, soft/hard failures, and power optimizations.
In the advanced technology nodes, the integrated design rule checker (DRC) is often utilized in place and route tools for fast optimization loops for power-performance-area. Implementing integrated DRC checkers to meet the standard of commercial DRC tools demands extensive human expertise to interpret foundry specifications, analyze layouts, and debug code iteratively. However, this labor-intensive process, requiring to be repeated by every update of technology nodes, prolongs the turnaround time of designing circuits. In this paper, we present DRC-Coder, a multi-agent framework with vision capabilities for automated DRC code generation. By incorporating vision language models and large language models (LLM), DRC-Coder can effectively process textual, visual, and layout information to perform rule interpretation and coding by two specialized LLMs. We also design an auto-evaluation function for LLMs to enable DRC code debugging. Experimental results show that targeting on a sub-3nm technology node for a state-of-the-art standard cell layout tool, DRC-Coder achieves perfect F1 score 1.000 in generating DRC codes for meeting the standard of a commercial DRC tool, highly outperforming standard prompting techniques (F1=0.631). DRC-Coder can generate code for each design rule within four minutes on average, which significantly accelerates technology advancement and reduces engineering costs.
The common-sense reasoning abilities and vast general knowledge of large language models (LLMs) make them a natural fit for interpreting user requests in a smart home assistant context. LLMs, however, lack specific knowledge about the user and their home, which limits their potential impact. Smart home agent with grounded execution (SAGE), overcomes these and other limitations by using a scheme in which a user request triggers an LLM-controlled sequence of discrete actions. These actions can be used to retrieve information, interact with the user, or manipulate device states. SAGE controls this process through a dynamically constructed tree of LLM prompts, which help it decide which action to take next, whether an action was successful, and when to terminate the process. The SAGE action set augments an LLM’s capabilities to support some of the most critical requirements for a smart home assistant. These include: flexible and scalable user preference management (“Is my team playing tonight?”), access to any smart device’s full functionality without device-specific code via API reading (“Turn down the screen brightness on my dryer”), persistent device state monitoring (“Remind me to throw out the milk when I open the fridge”), natural device references using only a photo of the room (“Turn on the lamp on the dresser”), and more. We introduce a benchmark of 50 new and challenging smart home tasks where SAGE achieves a 76% success rate, significantly outperforming existing LLM-enabled baselines (30% success rate).
: Microsoft Entra ID is Microsoft’s identity and access management solution used by many public and private sector organisations globally. In March 2023, Microsoft retired two PowerShell modules which have enabled automation of administrative tasks, such as user management. The replacement module is based on Microsoft Graph API, and its effective usage would require administrators to learn software development skills. In this paper, we will report the results of work-in-progress research on exploring the applicability of LLM-powered autonomous agents to solve real-life problems. We describe the design and proof-of-concept implementation of MEAN, an agent that performs Entra ID administrative tasks using Microsoft Graph API based on natural language prompts. The results show that LLM-powered autonomous agents can perform at least simple Entra ID administrative tasks. This indicates that the agents could ease the administrative burden by removing the need to learn software development skills.
Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem-solving to autonomous decision-making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS, which leverages graph neural networks to detect anomalies on the multi-agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees. The code is available at https://github.com/wslong20/G-safeguard.
Large language models (LLMs) are evolving into autonomous decision-makers, raising concerns about catastrophic risks in high-stakes scenarios, particularly in Chemical, Biological, Radiological and Nuclear (CBRN) domains. Based on the insight that such risks can originate from trade-offs between the agent's Helpful, Harmlessness and Honest (HHH) goals, we build a novel three-stage evaluation framework, which is carefully constructed to effectively and naturally expose such risks. We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis. Results reveal that LLM agents can autonomously engage in catastrophic behaviors and deception, without being deliberately induced. Furthermore, stronger reasoning abilities often increase, rather than mitigate, these risks. We also show that these agents can violate instructions and superior commands. On the whole, we empirically prove the existence of catastrophic risks in autonomous LLM agents. We release our code to foster further research.
Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non-specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM-based multi-agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is built on a novel cognitive architecture featuring a hierarchical memory framework that enables flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling and submission. El Agente Q is benchmarked on six university-level course exercises and two case studies, demonstrating robust problem-solving performance (averaging>87% task success) and adaptive error handling through in situ debugging. It also supports longer-term, multi-step task execution for more complex workflows, while maintaining transparency through detailed action trace logs. Together, these capabilities lay the foundation for increasingly autonomous and accessible quantum chemistry.
The embedding of Large Language Models (LLMs) into autonomous agents is a rapidly developing field which enables dynamic, configurable behaviours without the need for extensive domain-specific training. In our previous work, we introduced SANDMAN, a Deceptive Agent architecture leveraging the Five-Factor OCEAN personality model, demonstrating that personality induction significantly influences agent task planning. Building on these findings, this study presents a novel method for measuring and evaluating how induced personality traits affect task selection processes-specifically planning, scheduling, and decision-making-in LLM-based agents. Our results reveal distinct task-selection patterns aligned with induced OCEAN attributes, underscoring the feasibility of designing highly plausible Deceptive Agents for proactive cyber defense strategies.
Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 27 different types of attack/defense methods, and 7 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30\%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. We also introduce a new metric to evaluate the agents' capability to balance utility and security. Our code can be found at https://github.com/agiresearch/ASB.
Chemical process optimization is crucial to maximize production efficiency and economic performance. Optimization algorithms, including gradient-based solvers, numerical methods, and parameter grid searches, become impractical when operating constraints are ill-defined or unavailable, requiring engineers to rely on subjective heuristics to estimate feasible parameter ranges. To address this constraint definition bottleneck, we present a multi-agent framework of large language model (LLM) agents that autonomously infer operating constraints from minimal process descriptions, then collaboratively guide optimization using the inferred constraints. Our AutoGen-based agentic framework employs OpenAI’s o3 model, with specialized agents for constraint generation, parameter validation, simulation execution, and optimization guidance. Through two phases: (i) autonomous constraint generation using embedded domain knowledge, and (ii) iterative multi-agent optimization, the framework eliminates the need for predefined operational bounds. Validated on the hydrodealkylation process across cost, yield, and yield-to-cost ratio metrics, the framework demonstrated competitive performance with conventional optimization methods while achieving a 31-fold reduction in wall-time relative to grid search, converging in under 20 min and requiring far fewer iterations to converge. Beyond computational efficiency, the framework’s reasoning-guided search demonstrates sophisticated process understanding, correctly identifying utility trade-offs, and applying domain-informed heuristics. Unlike conventional optimization methods like Bayesian optimization that require predefined constraints, our approach uniquely combines autonomous constraint generation with interpretable, reasoning-guided parameter exploration. Reproducibility analysis across five independent trials demonstrates consistent convergence behavior, while model comparison reveals that reasoning-capable LLM architectures (o3, o1) are essential for successful optimization, with standard models failing to converge effectively. This approach shows significant potential for optimization scenarios where operational constraints are poorly characterized or unavailable, particularly for emerging processes and retrofit applications.
Scene simulation in autonomous driving has gained significant attention because of its huge potential for generating customized data. However, existing editable scene simulation approaches face limitations in terms of user interaction efficiency, multi-camera photo-realistic rendering and external digital assets integration. To address these challenges, this paper introduces ChatSim, the first system that enables editable photo-realistic 3D driving scene simulations via natural language commands with external digital assets. To enable editing with high command flexibility, ChatSim leverages a large language model (LLM) agent collaboration framework. To generate photo-realistic outcomes, ChatSim employs a novel multi-camera neural radiance field method. Furthermore, to unleash the potential of extensive high-quality digital assets, ChatSim employs a novel multi-camera lighting estimation method to achieve scene-consistent assets' rendering. Our experiments on Waymo Open Dataset demonstrate that ChatSim can handle complex language commands and generate corresponding photo-realistic scene videos. Code can be accessed at: https://github.com/yifanlu0227/chatSim.
Recently, autonomous agents built on large language models (LLMs) have experienced significant development and are being deployed in real-world applications. These agents can extend the base LLM's capabilities in multiple ways. For example, a well-built agent using GPT-3.5-Turbo as its core can outperform the more advanced GPT-4 model by leveraging external components. More importantly, the usage of tools enables these systems to perform actions in the real world, moving from merely generating text to actively interacting with their environment. Given the agents' practical applications and their ability to execute consequential actions, it is crucial to assess potential vulnerabilities. Such autonomous systems can cause more severe damage than a standalone language model if compromised. While some existing research has explored harmful actions by LLM agents, our study approaches the vulnerability from a different perspective. We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions. We conduct comprehensive evaluations using various attack methods, surfaces, and properties to pinpoint areas of susceptibility. Our experiments reveal that these attacks can induce failure rates exceeding 80\% in multiple scenarios. Through attacks on implemented and deployable agents in multi-agent scenarios, we accentuate the realistic risks associated with these vulnerabilities. To mitigate such attacks, we propose self-examination detection methods. However, our findings indicate these attacks are difficult to detect effectively using LLMs alone, highlighting the substantial risks associated with this vulnerability.
Table reasoning requires models to jointly perform comprehensive semantic understanding and precise numerical operations. Although recent large language model (LLM)-based methods have achieved promising results, most of them still rely on a single-turn reasoning paradigm that processes flattened tables in a single forward pass. This paradigm suffers from inherent limitations, including context overflow on large tables, weak sensitivity to continuous numerical values, and the absence of explicit tool-use and reflection. In this paper, we propose TableMind, a tuning-based autonomous programmatic table agent that simulates the human-like cognitive schema of multi-turn interaction within a lightweight LLM. Instead of adopting a training-free workflow design, TableMind learns to internalize planning, action, and reflection through a principled two-stage training strategy. To bootstrap structured table reasoning capabilities, we construct and filter high-quality reasoning data for the supervised fine-tuning (SFT) stage. To enable precise code generation, we introduce a designed multi-perspective reward scheme and a novel optimization objective in the reinforcement learning (RL) stage. Extensive experiments on diverse benchmarks demonstrate that TableMind consistently outperforms previous baselines, validating the effectiveness of training autonomous agents to improve overall performance.
Creating Augmented Reality (AR) applications requires expertise in both design and implementation, posing significant barriers to entry for non-expert users. While existing methods reduce some of this burden, they often fall short in flexibility or usability for complex or varied use cases. To address this, we introduce agentAR, an AR authoring system that leverages a tool-augmented large language model (LLM)–based autonomous agent to support end-to-end, in-situ AR application creation from natural language input. Built on an application structure and tool library derived from state-of-the-art AR research, the agent autonomously creates AR applications from natural language dialogue. We demonstrate the effectiveness of agentAR through a case study of six AR applications and a user study with twelve participants, showing that it significantly reduces user effort while supporting the creation of diverse and functional AR experiences.
Agents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data-driven approach for this problem. Firstly, we collect real-world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.
Multi-agent autonomous systems (MAS) are better at addressing challenges that spans across multiple domains than singular autonomous agents. This holds true within the field of software engineering (SE) as well. The state-of-the-art research on MAS within SE focuses on integrating LLMs at the core of autonomous agents to create LLM-based multi-agent autonomous (LMA) systems. However, the introduction of LMA systems into SE brings a plethora of challenges. One of the major challenges is the strategic allocation of tasks between humans and the LMA system in a trustworthy manner. To address this challenge, a RACI-based framework is proposed in this work in progress article, along with implementation guidelines and an example implementation of the framework. The proposed framework can facilitate efficient collaboration, ensure accountability, and mitigate potential risks associated with LLM-driven automation while aligning with the Trustworthy AI guidelines. The future steps for this work delineating the planned empirical validation method are also presented.
Automated Driving System (ADS) is a safety-critical software system responsible for the interpretation of the vehicle’s environment and making decisions accordingly. The unbounded complexity of the driving context, including unforeseeable events, necessitate continuous improvement, often achieved through iterative DevOps processes. However, DevOps processes are themselves complex, making these improvements both time- and resource-intensive. Automation in code generation for ADS using Large Language Models (LLM) is one potential approach to address this challenge. Nevertheless, the development of ADS requires rigorous processes to verify, validate, assess, and qualify the code before it can be deployed in the vehicle and used. In this study, we developed and evaluated a prototype for automatic code generation and assessment using a designed pipeline of a LLM-based agent, simulation model, and rule-based feedback generator in an industrial setup. The LLM-generated code is evaluated automatically in a simulation model against multiple critical traffic scenarios, and an assessment report is provided as feedback to the LLM for modification or bug fixing. We report about the experimental results of the prototype employing Codellama:34b, DeepSeek (r1:32b and Coder:33b), CodeGemma:7b, Mistral:7b, and GPT4 for Adaptive Cruise Control (ACC) and Unsupervised Collision Avoidance by Evasive Manoeuvre (CAEM). We finally assessed the tool with 11 experts at two Original Equipment Manufacturers (OEMs) by conducting an interview study.
Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.
Conventional mechanical design follows an iterative process in which initial concepts are refined through cycles of expert assessment and resource-intensive Finite Element Method (FEM) analysis to meet performance goals. While machine learning models have been developed to assist in parts of this process, they typically require large datasets, extensive training, and are often tailored to specific tasks, limiting their generalizability. To address these limitations, we propose a framework that leverages a pretrained Large Language Model (LLM) in conjunction with an FEM module to autonomously generate, evaluate, and refine structural designs based on performance specifications and numerical feedback. The LLM operates without domain-specific fine-tuning, using general reasoning to propose design candidates, interpret FEM-derived performance metrics, and apply structurally sound modifications. Using 2D truss structures as a testbed, we show that the LLM can effectively navigate highly discrete and multi-faceted design spaces, balance competing objectives, and identify convergence when further optimization yields diminishing returns. Compared to Non-dominated Sorting Genetic Algorithm II (NSGA-II), our method achieves faster convergence and fewer FEM evaluations. Experiments with varying temperature settings (0.5, 1.0, 1.2) and model sizes (GPT-4.1 and GPT-4.1-mini) indicate that smaller models yield higher constraint satisfaction with fewer steps, while lower temperatures enhance design consistency. These results establish LLMs as a promising new class of reasoning-based, natural language-driven optimizers for autonomous design and iterative structural refinement.
Urban knowledge graph has recently worked as an emerging building block to distill critical knowledge from multi-sourced urban data for diverse urban application scenarios. Despite its promising benefits, urban knowledge graph construction (UrbanKGC) still heavily relies on manual effort, hindering its potential advancement. This paper presents UrbanKGent, a unified large language model agent framework, for urban knowledge graph construction. Specifically, we first construct the knowledgeable instruction set for UrbanKGC tasks (such as relational triplet extraction and knowledge graph completion) via heterogeneity-aware and geospatial-infused instruction generation. Moreover, we propose a tool-augmented iterative trajectory refinement module to enhance and refine the trajectories distilled from GPT-4. Through hybrid instruction fine-tuning with augmented trajectories on Llama 2 and Llama 3 family, we obtain UrbanKGC agent family, consisting of UrbanKGent-7/8/13B version. We perform a comprehensive evaluation on two real-world datasets using both human and GPT-4 self-evaluation. The experimental results demonstrate that UrbanKGent family can not only significantly outperform 31 baselines in UrbanKGC tasks, but also surpass the state-of-the-art LLM, GPT-4, by more than 10% with approximately 20 times lower cost. Compared with the existing benchmark, the UrbanKGent family could help construct an UrbanKG with hundreds of times richer relationships using only one-fifth of the data. Our data and code are available at https://github.com/usail-hkust/UrbanKGent.
Multi-agent debates have been introduced to improve the accuracy of Large Language Models (LLMs) by having multiple agents discuss solutions to a problem over several rounds of debate. However, models often generate incorrect yet confident-sounding responses, which can mislead others. This issue arises partly because agents do not consider how confident their peers are. To address this, we propose DebUnc, a debate framework that uses uncertainty metrics to assess agent confidence. Confidence is then conveyed through a modified attention mechanism that adjusts token weights, or through textual prompts. Evaluations across benchmarks show that attention-based methods are particularly effective and that performance continues to improve as uncertainty estimation becomes more reliable. The code is available at https://github.com/lukeyoffe/debunc.
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.
This work presents a large language model (LLM)-based agent OpenFOAMGPT tailored for OpenFOAM-centric computational fluid dynamics (CFD) simulations, leveraging two foundation models from OpenAI: the GPT-4o (GPT means Generative Pre-trained Transformer) and a chain-of-thought–enabled o1 preview model. Both agents demonstrate success across multiple tasks. While the price of token with o1 model is six times as that of GPT-4o, it consistently exhibits superior performance in handling complex tasks, from zero-shot/few-shot case setup to boundary condition modifications, zero-shot turbulence model adjustments, and zero-shot code translation. Through an iterative correction loop, the agent efficiently addressed single-phase and multiphase flow, heat transfer, Reynolds-averaged Navier–Stokes modeling, large eddy simulation, and other engineering scenarios, often converging in a limited number of iterations at low token costs. To embed domain-specific knowledge, we employed a retrieval-augmented generation pipeline, demonstrating how preexisting simulation setups can further specialize the agent for subdomains such as energy and aerospace. Despite the great performance of the agent, human oversight remains crucial for ensuring accuracy and adapting to shifting contexts. Fluctuations in model performance over time suggest the need for monitoring in mission-critical applications. Although our demonstrations focus on OpenFOAM, the adaptable nature of this framework opens the door to developing LLM-driven agents into a wide range of solvers and codes. By streamlining CFD simulations, this approach has the potential to accelerate both fundamental research and industrial engineering advancements.
Day-ahead electricity price prediction is crucial for market participants to make optimal trading decisions. The implementation of the five-minute settlement (5MS) process in the Australian National Electricity Market (NEM) on October 1, 2021, reduced the settlement interval from 30 minutes to 5 minutes. This change has led to more frequent adjustments in pricing, allowing for a more accurate reflection of real-time supply and demand conditions. However, this increased frequency has significantly heightened the complexity of price fluctuations in the wholesale market. Consequently, conventional machine learning and deep learning methods struggle to provide accurate predictions at this higher resolution. Since electricity prices are fundamentally determined by the supply-demand balance and the bidding behaviors of market participants, this work introduces individual participant's bidding behaviors into the prediction model. We fine-tune a pre-trained Large Language Model (LLM) to create bidding behavior agents, which forecasts day-ahead bidding behaviors. Moreover, market sentiment plays a significant role in electricity price volatility, yet it remains challenging to quantify and assess its impact. To address this, we employ a pre-trained LLM to analyze online resources, incorporating market sentiment into the price prediction model. Additionally, to enhance the accuracy of spike predictions, we improve the conditional time series generative adversarial network (CTSGAN) model by utilizing a spike confusion matrix and further strengthen the model by integrating bidding behavior and market sentiment as inputs. Case studies demonstrate that the proposed model significantly improves both electricity price and spike prediction accuracy, offering a robust tool for market participants to navigate the complexities of the modern electricity market.
Recent breakthroughs in large language model-driven autonomous agents have revealed that multi-agent collaboration often surpasses each individual through collective reasoning. Inspired by the neural scaling law--increasing neurons enhances performance, this study explores whether the continuous addition of collaborative agents can yield similar benefits. Technically, we utilize directed acyclic graphs to organize agents into a multi-agent collaboration network (MacNet), upon which their interactive reasoning is topologically orchestrated for autonomous task solving. Extensive evaluations reveal that it effectively supports collaboration among over a thousand agents, with irregular topologies outperforming regular ones. We also identify a collaborative scaling law--the overall performance follows a logistic growth pattern as agents scale, with collaborative emergence occurring earlier than traditional neural emergence. We speculate this may be because scaling agents catalyzes their multidimensional considerations during interactive reflection and refinement, thereby producing more comprehensive artifacts. The code is available at https://github.com/OpenBMB/ChatDev/tree/macnet.
Abstract Objective Conversational Health Agents (CHAs) are interactive systems providing healthcare services, such as assistance and diagnosis. Current CHAs, especially those utilizing Large Language Models (LLMs), primarily focus on conversation aspects. However, they offer limited agent capabilities, specifically needing more multistep problem-solving, personalized conversations, and multimodal data analysis. We aim to overcome these limitations. Materials and methods We propose openCHA, an open-source LLM-powered framework, designed to enable the development of conversational agents. OpenCHA offers a foundational and structured architecture and codebase, enabling researchers and developers to build and customize their CHA based on the specifics of their intended application. The framework leverages knowledge acquisition, problem-solving capabilities, multilingual, and multimodal conversations, and allows interaction with various AI platforms. We have released the framework as open source for the community on GitHub (https://github.com/Institute4FutureHealth/CHA and https://opencha.com). Results We demonstrated the openCHA’s capability to develop CHAs across multiple health domains using 2 demos and 5 use cases. In diabetic patient management, developed CHA achieved a 92.1% accuracy rate, surpassing GPT4’s 51.8%. In food recommendations, developed CHA outperformed GPT4. The developed CHA excelled as an evaluator for mental health chatbots, recording the lowest Mean Absolute Error at 0.31, compared to competitors like GPT, Misteral, Gemini, and Claude. Additionally, the empathy enabled CHA identified emotional states with 89% accuracy, and in physiological data analysis of heart rate from Photoplethysmography (PPG) signals, the developed CHA achieved an mean absolute error of 2.83, far lower than GPT-4o’s 8.93. Discussion The openCHA framework enhances CHAs by enabling features such as explainability, personalization, and reliability through its integration with LLMs and external data sources. The developed CHAs face challenges like latency, token limits, and scalability. Future efforts will focus on improving planning robustness, enhancing accuracy and evaluation methods, and resolving user query ambiguity to further refine the framework’s effectiveness. Conclusion The diverse demos and use cases of openCHA demonstrate the framework’s capacity to empower the development of a wide range of CHAs for various healthcare tasks.
Large language models (LLMs) have fueled many intelligent web agents, but most existing ones perform far from satisfying in real-world web navigation tasks due to three factors: (1) the complexity of HTML text data (2) versatility of actions on webpages, and (3) task difficulty due to the open-domain nature of the web. In light of these challenges, we develop the open AutoWebGLM based on ChatGLM3-6B. AutoWebGLM can serve as a powerful automated web navigation agent that outperform GPT-4. Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages with vital information preserved succinctly. We then employ a hybrid human-AI method to build web browsing data for curriculum training. Finally, we bootstrap the model by reinforcement learning and rejection sampling to further facilitate webpage comprehension, browser operations, and efficient task decomposition by itself. For comprehensive evaluation, we establish a bilingual benchmark---AutoWebBench---for real-world web navigation tasks. We evaluate AutoWebGLM across diverse web navigation benchmarks, demonstrating its potential to tackle challenging tasks in real environments. Related code, model, and data are released at https://github.com/THUDM/AutoWebGLM.
Large Language Model (LLM) based agents have garnered significant attention and are becoming increasingly popular. Furthermore, planning ability is a crucial component of an LLM-based agent, involving interaction with the environment and executing actions to complete a planning task, which generally entails achieving a desired goal from an initial state. This paper investigates enhancing the planning abilities of LLM-based agents through instruction tuning, referred to as agent training. Recent studies on agent training have demonstrated that utilizing expert-level trajectory data (sequences of action-observation pairs) for instruction-tuning LLMs effectively enhances their planning capabilities. However, existing work primarily focuses on synthesizing trajectories from manually designed planning tasks and environments. The labor-intensive nature of creating these environments and tasks impedes the generation of sufficiently varied and extensive trajectories for agent training. To address this limitation, this paper explores the automated synthesis of diverse environments and a gradual range of planning tasks, from easy to difficult. We introduce a framework, AgentGen, that leverages LLMs first to generate environments and subsequently generate planning tasks conditioned on these environments. Specifically, to improve environmental diversity, we propose using an inspiration corpus composed of various domain-specific text segments as the context for synthesizing environments. Moreover, to increase the difficulty diversity of generated planning tasks, we propose a bidirectional evolution method, Bi-Evol, that evolves planning tasks from easier and harder directions to synthesize a task set with a smoother difficulty curve, thereby enhancing the learning process of LLMs more effectively. These methods collectively contribute to the generation of diverse trajectory data for instruction-tuning. Based on AgentGen, we greatly expanded the number of environments and planning tasks available for agent training. The evaluation results from AgentBoard indicate that AgentGen greatly enhances the planning capabilities of LLMs. For instance, the AgentGen instruction-tuned Llama-3.1-8B outperforms GPT-3.5 in overall performance. Moreover, the AgentGen-tuned Llama-3.1-70B model achieves state-of-the-art results in planning tasks. Project page: https://agent-gen.github.io/.
Designing de novo proteins beyond those found in nature holds significant promise for advancements in both scientific and engineering applications. Current methodologies for protein design often rely on AI-based models, such as surrogate models that address end-to-end problems by linking protein structure to material properties or vice versa. However, these models frequently focus on specific material objectives or structural properties, limiting their flexibility when incorporating out-of-domain knowledge into the design process or comprehensive data analysis is required. In this study, we introduce ProtAgents, a platform for de novo protein design based on Large Language Models (LLMs), where multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment. The versatility in agent development allows for expertise in diverse domains, including knowledge retrieval, protein structure analysis, physics-based simulations, and results analysis. The dynamic collaboration between agents, empowered by LLMs, provides a versatile approach to tackling protein design and analysis problems, as demonstrated through diverse examples in this study. The problems of interest encompass designing new proteins, analyzing protein structures and obtaining new first-principles data – natural vibrational frequencies – via physics simulations. The concerted effort of the system allows for powerful automated and synergistic design of de novo proteins with targeted mechanical properties. The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment on the other hand, unleashes great potentials of LLMs in addressing multi-objective materials problems and opens up new avenues for autonomous materials discovery and design.
Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs' ability to perceive tool use is limited to a single text query, which may result in ambiguity in understanding the users' real intentions. LLMs are expected to eliminate that by perceiving the information in the visual-or auditory-grounded instructions. Therefore, in this paper, we propose MLLM-Tool, a system incorporating open-source LLMs and multi-modal encoders so that the learned LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model's capability, we collect a dataset featuring multi-modal input tools from HuggingFace. Another essential feature of our dataset is that it also contains multiple potential choices for the same instruction due to the existence of identical functions and synonymous functions, which provides more potential solutions for the same query. The experiments reveal that our MLLM-Tool is capable of recommending appropriate tools for multi-modal instructions. Codes and data are available at github.com/MLLM-Tool/MLLM-Tool.
The rapid advancement of chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents, and provides insight into their"cognitive"processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following cooperation in multi-agent settings. Our contributions include introducing a novel communicative agent framework, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing our library to support research on communicative agents and beyond: https://github.com/camel-ai/camel.
Large Language Model (LLM)-based agents exhibit significant potential across various domains, operating as interactive systems that process environmental observations to generate executable actions for target tasks. The effectiveness of these agents is significantly influenced by their memory mechanism, which records historical experiences as sequences of action-observation pairs. We categorize memory into two types: cross-trial memory, accumulated across multiple attempts, and in-trial memory (working memory), accumulated within a single attempt. While considerable research has optimized performance through cross-trial memory, the enhancement of agent performance through improved working memory utilization remains underexplored. Instead, existing approaches often involve directly inputting entire historical action-observation pairs into LLMs, leading to redundancy in long-horizon tasks. Inspired by human problem-solving strategies, this paper introduces HiAgent, a framework that leverages subgoals as memory chunks to manage the working memory of LLM-based agents hierarchically. Specifically, HiAgent prompts LLMs to formulate subgoals before generating executable actions and enables LLMs to decide proactively to replace previous subgoals with summarized observations, retaining only the action-observation pairs relevant to the current subgoal. Experimental results across five long-horizon tasks demonstrate that HiAgent achieves a twofold increase in success rate and reduces the average number of steps required by 3.8. Additionally, our analysis shows that HiAgent consistently improves performance across various steps, highlighting its robustness and generalizability. Project Page: https://github.com/HiAgent2024/HiAgent .
Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics.
In recent developments within the research community, the integration of Large Language Models (LLMs) in creating fully autonomous agents has garnered significant interest. Despite this, LLM-based agents frequently demonstrate notable shortcomings in adjusting to dynamic environments and fully grasping human needs. In this work, we introduce the problem of LLM-based human-agent collaboration for complex task-solving, exploring their synergistic potential. In addition, we propose a Reinforcement Learning-based Human-Agent Collaboration method, ReHAC. This approach includes a policy model designed to determine the most opportune stages for human intervention within the task-solving process. We construct a human-agent collaboration dataset to train this policy model in an offline reinforcement learning environment. Our validation tests confirm the model's effectiveness. The results demonstrate that the synergistic efforts of humans and LLM-based agents significantly improve performance in complex tasks, primarily through well-planned, limited human intervention. Datasets and code are available at: https://github.com/XueyangFeng/ReHAC.
The rapid development of the large language model (LLM) presents huge opportunities for 6G communications – for example, network optimization and management – by allowing users to input task requirements to LLMs with natural language. However, directly applying native LLMs in 6G encounters various challenges, such as a lack of communication data and knowledge, and limited logical reasoning, evaluation, and refinement abilities. Integrating LLMs with the capabilities of retrieval, planning, memory, evaluation, and reflection in agents can greatly enhance the potential of LLMs for 6G communications. To this end, we propose CommLLM, a multi-agent system with customized communication knowledge and tools for solving communication-related tasks using natural language. This system consists of three components: multi-agent data retrieval (MDR), which employs the condensate and inference agents to refine and summarize communication knowledge from the knowledge base, expanding the knowledge boundaries of LLMs in 6G communications; multi-agent collaborative planning (MCP), which utilizes multiple planning agents to generate feasible solutions for the communication-re-lated task from different perspectives based on the retrieved knowledge; and multi-agent evaluation and reflection (MER), which utilizes the evaluation agent to assess the solutions, and applies the reflection agent and refinement agent to provide improvement suggestions for current solutions. Finally, we validate the effectiveness of the proposed multi-agent system by designing a semantic communication system as a case study of 6G communications.
The integration of a complex set of electronic design automation (EDA) tools to enhance interoperability is a critical concern for circuit designers. Recent advancements in large language models (LLMs) have showcased their exceptional capabilities in natural language processing and comprehension, offering a novel approach to interfacing with EDA tools. This research article introduces ChatEDA, an autonomous agent for EDA empowered by an LLM, AutoMage, complemented by EDA tools serving as executors. ChatEDA streamlines the design flow from the register-transfer level (RTL) to the graphic data system version II (GDSII) by effectively managing task decomposition, script generation, and task execution. Through comprehensive experimental evaluations, ChatEDA has demonstrated its proficiency in handling diverse requirements, and our fine-tuned AutoMage model has exhibited superior performance compared to GPT-4 and other similar LLMs.
Large language models (LLMs) have attracted widespread attention recently, however, their application in specialized scientific fields still requires deep adaptation. Here, an artificial intelligence (AI) agent for organic field‐effect transistors (OFETs) is designed by integrating the generative pre‐trained transformer 4 (GPT‐4) model with well‐trained machine learning (ML) algorithms. It can efficiently extract the experimental parameters of OFETs from scientific literature and reshape them into a structured database, achieving precision and recall rates both exceeding 92%. Combined with well‐trained ML models, this AI agent can further provide targeted guidance and suggestions for device design. With prompt engineering and human‐in‐loop strategies, the agent extracts sufficient information of 709 OFETs from 277 research articles across different publishers and gathers them into a standardized database containing more than 10 000 device parameters. Using this database, a ML model based on Extreme Gradient Boosting is trained for device performance judgment. Combined with the interpretation of the high‐precision model, the agent has provided a feasible optimization scheme that has tripled the charge transport properties of 2,6‐diphenyldithieno[3,2‐b:2′,3′‐d]thiophene OFETs. This work is an effective practice of LLMs in the field of organic optoelectronic devices and expands the research paradigm of organic optoelectronic materials and devices.
Large Language Models (LLMs) and multi-agent systems have shown impressive capabilities in natural language tasks but face challenges in clinical trial applications, primarily due to limited access to external knowledge. Recognizing the potential of advanced clinical trial tools that aggregate and predict based on the latest medical data, we propose an integrated solution to enhance their accessibility and utility. We introduce Clinical Agent System (ClinicalAgent), a clinical multi-agent system designed for clinical trial tasks, leveraging GPT-4, multi-agent architectures, LEAST-TO-MOST, and ReAct reasoning technology. This integration not only boosts LLM performance in clinical contexts but also introduces novel functionalities. The proposed method achieves competitive predictive performance in clinical trial outcome prediction (0.7908 PR-AUC), obtaining a 0.3326 improvement over the standard prompt Method. Publicly available code can be found at https://github.com/LeoYML/clinical-agent.
We introduce ColaCare, a framework that enhances Electronic Health Record (EHR) modeling through multi-agent collaboration driven by Large Language Models (LLMs). Our approach seamlessly integrates domain-specific expert models with LLMs to bridge the gap between structured EHR data and text-based reasoning. Inspired by the Multidisciplinary Team (MDT) approach used in clinical settings, ColaCare employs two types of agents: DoctorAgents and a MetaAgent, which collaboratively analyze patient data. Expert models process and generate predictions from numerical EHR data, while LLM agents produce reasoning references and decision-making reports within the MDT-driven collaborative consultation framework. The MetaAgent orchestrates the discussion, facilitating consultations and evidence-based debates among DoctorAgents, simulating diverse expertise in clinical decision-making. We additionally incorporate the Merck Manual of Diagnosis and Therapy (MSD) medical guideline within a retrieval-augmented generation (RAG) module for medical evidence support, addressing the challenge of knowledge currency. Extensive experiments conducted on three EHR datasets demonstrate ColaCare's superior performance in clinical mortality outcome and readmission prediction tasks, underscoring its potential to revolutionize clinical decision support systems and advance personalized precision medicine. All code, case studies and a questionnaire are available at the project website: https://colacare.netlify.app.
While the recommendation system (RS) has advanced significantly through deep learning, current RS approaches usually train and fine-tune models on task-specific datasets, limiting their generalizability to new recommendation tasks and their ability to leverage external knowledge due to model scale and data size constraints. Thus, we designed an LLM-powered autonomous recommender agent, RecMind, which is capable of leveraging external knowledge, utilizing tools with careful planning to provide zero-shot personalized recommendations. We propose a Self-Inspiring algorithm to improve the planning ability. At each intermediate step, the LLM self-inspires to consider all previously explored states to plan for the next step. This mechanism greatly improves the model's ability to comprehend and utilize historical information in planning for recommendation. We evaluate RecMind's performance in various recommendation scenarios. Our experiment shows that RecMind outperforms existing zero/few-shot LLM-based recommendation baseline methods in various tasks and achieves comparable performance to a fully trained recommendation model P5.
Traditional manufacturing faces challenges adapting to dynamic environments and quickly responding to manufacturing changes. The use of multi-agent systems has improved adaptability and coordination but requires further advancements in rapid human instruction comprehension, operational adaptability, and coordination through natural language integration. Large language models like GPT-3.5 and GPT-4 enhance multi-agent manufacturing systems by enabling agents to communicate in natural language and interpret human instructions for decision-making. This research introduces a novel framework where large language models enhance the capabilities of agents in manufacturing, making them more adaptable, and capable of processing context-specific instructions. A case study demonstrates the practical application of this framework, showing how agents can effectively communicate, understand tasks, and execute manufacturing processes, including precise G-code allocation among agents. The findings highlight the importance of continuous large language model integration into multi-agent manufacturing systems and the development of sophisticated agent communication protocols for a more flexible manufacturing system.
As deep learning technology advances Autonomous Driving (AD), existing AD methods encounter performance limitations, especially in handling corner cases, interpretability, and verifiability, which are crucial for the safety of connected and autonomous vehicles. Multimodal Large Language Models (MLLMs) demonstrate remarkable understanding and reasoning capabilities, presenting a transformative opportunity to overcome challenges faced by traditional AD algorithms. We conduct a comprehensive study on the application of MLLMs in AD, exploring their potential to address critical challenges faced by traditional AD algorithms. We construct a Visual-Question-Answering dataset for model fine-tuning to address hallucinations and poor logic analysis issues in MLLMs. We then decompose the AD decision-making process into Scene Understanding, Prediction, and Decision, allowing MLLMs to construct Chain-of-Thought to make decisions step by step. Subsequently, we propose a new framework enabling models to perform AD tasks under conditions of limited local computing resources, few-shots, multimodality, and complex scenarios, enhancing the flexibility of future AD system deployment. Our extensive experiments and in-depth analyses demonstrate the significant advantages of MLLMs for AD. We also discuss the strengths and weaknesses of existing methods, providing a detailed outlook on MLLMs in AD.
Although large language models (LLMs) have achieved significant success in various tasks, they often struggle with hallucination problems, especially in scenarios requiring deep and responsible reasoning. These issues could be partially addressed by introducing external knowledge graphs (KG) in LLM reasoning. In this paper, we propose a new LLM-KG integrating paradigm ``$\hbox{LLM}\otimes\hbox{KG}$'' which treats the LLM as an agent to interactively explore related entities and relations on KGs and perform reasoning based on the retrieved knowledge. We further implement this paradigm by introducing a new approach called Think-on-Graph (ToG), in which the LLM agent iteratively executes beam search on KG, discovers the most promising reasoning paths, and returns the most likely reasoning results. We use a number of well-designed experiments to examine and illustrate the following advantages of ToG: 1) compared with LLMs, ToG has better deep reasoning power; 2) ToG has the ability of knowledge traceability and knowledge correctability by leveraging LLMs reasoning and expert feedback; 3) ToG provides a flexible plug-and-play framework for different LLMs, KGs and prompting strategies without any additional training cost; 4) the performance of ToG with small LLM models could exceed large LLM such as GPT-4 in certain scenarios and this reduces the cost of LLM deployment and application. As a training-free method with lower computational cost and better generality, ToG achieves overall SOTA in 6 out of 9 datasets where most previous SOTAs rely on additional training.
Using commercial software for radio map generation and wireless network planning often require complex manual operations, posing significant challenges in terms of scalability, adaptability, and user-friendliness, due to heavy manual operations. To address these issues, we propose an automated solution that employs large language model (LLM) agents. These agents are designed to autonomously generate radio maps and facilitate wireless network planning for specified areas, thereby minimizing the necessity for extensive manual intervention. To validate the effectiveness of our proposed solution, we develop a software platform that integrates LLM agents. Experimental results demonstrate that a large amount manual operations can be saved via the proposed LLM agent, and the automated solutions can achieve an enhanced coverage and signal-to-interference-noise ratio (SINR), especially in urban environments.
Configuring computational fluid dynamics (CFD) simulations typically demands extensive domain expertise, limiting broader access. Although large language models (LLMs) have advanced scientific computing, their use in automating CFD workflows is underdeveloped. We introduce a novel approach centered on domain-specific LLM adaptation. By fine-tuning Qwen2.5-7B-Instruct on NL2FOAM, our custom dataset of 28716 natural language-to-OpenFOAM configuration pairs with chain-of-thought (CoT) annotations, we enable direct translation from natural language descriptions to executable CFD setups. A multi-agent framework orchestrates the process, autonomously verifying inputs, generating configurations, running simulations, and correcting errors. Evaluation on a benchmark of 21 diverse flow cases demonstrates state-of-the-art performance, achieving 88.7% solution accuracy and 82.6% first-attempt success rate. This significantly outperforms larger general-purpose models like Qwen2.5-72B-Instruct, DeepSeek-R1, and Llama3.3-70B-Instruct, while also requiring fewer correction iterations and maintaining high computational efficiency. The results highlight the critical role of domain-specific adaptation in deploying LLM assistants for complex engineering workflows. Our code and fine-tuned model have been deposited at https://github.com/YYgroup/AutoCFD.
Large-language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human-computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.
Objective. Radiotherapy planning requires significant expertise to balance tumor control and organ-at-risk (OAR) sparing. Automated planning can improve both efficiency and quality. This study introduces GPT-Plan, a novel multi-agent system powered by the GPT-4 family of large language models (LLMs), for automating the iterative radiotherapy plan optimization. Approach. GPT-Plan uses LLM-driven agents, mimicking the collaborative clinical workflow of a dosimetrist and physicist, to iteratively generate and evaluate text-based radiotherapy plans based on predefined criteria. Supporting tools assist the agents by leveraging historical plans, mitigating LLM hallucinations, and balancing exploration and exploitation. Performance was evaluated on 12 lung (IMRT) and 5 cervical (VMAT) cancer cases, benchmarked against the ECHO auto-planning method and manual plans. The impact of historical plan retrieval on efficiency was also assessed. Results. For IMRT lung cancer cases, GPT-Plan generated high-quality plans, demonstrating superior target coverage and homogeneity compared to ECHO while maintaining comparable or better OAR sparing. For VMAT cervical cancer cases, plan quality was comparable to a senior physicist and consistently superior to a junior physicist, particularly for OAR sparing. Retrieving historical plans significantly reduced the number of required optimization iterations for lung cases (p < 0.01) and yielded iteration counts comparable to those of the senior physicist for cervical cases (p = 0.313). Occasional LLM hallucinations have been mitigated by self-reflection mechanisms. One limitation was the inaccuracy of vision-based LLMs in interpreting dose images. Significance. This pioneering study demonstrates the feasibility of automating radiotherapy planning using LLM-powered agents for complex treatment decision-making tasks. While challenges remain in addressing LLM limitations, ongoing advancements hold potential for further refining and expanding GPT-Plan’s capabilities.
Large Language Models (LLMs) have significantly advanced natural language processing, demonstrating exceptional reasoning, tool usage, and memory capabilities. As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework that captures LLMs’ reasoning, planning, collaboration, and other social abilities. This work introduces a novel competition-based benchmark framework specifically designed to assess LLMs within multi-agent settings, providing quantitative metrics to evaluate their judgment, reasoning, deception, self-awareness, cooperation, coordination, and rationality.We utilize two social deduction games alongside three game-theory scenarios to create diverse environments.Our frame is fortified with the probabilistic graphic modeling (PGM) method, enhancing the LLMs’ capabilities in navigating complex social and cognitive dimensions. We evaluate seven LLMs, quantitatively highlighting a significant capability gap of over threefold between the strongest, GPT o1, and the weakest, Llama-2-70B. It also confirms that our PGM enhancement boosts the abilities of all selected models by an average of 37%. Our data and code can be found here https://github.com/cathyxl/MAgIC.
AI agents based on multimodal large language models (LLMs) are expected to revolutionize human-computer interaction, and offer more personalized assistant services across various domains like healthcare, education, manufacturing, and entertainment. Deploying LLM agents in 6G networks enables users to access previously expensive AI assistant services via mobile devices democratically, thereby reducing interaction latency and better preserving user privacy. Nevertheless, the limited capacity of mobile devices constrains the effectiveness of deploying and executing local LLMs, which necessitates offloading complex tasks to global LLMs running on edge servers during long-horizon interactions. In this article, we propose a split learning system for LLM agents in 6G networks, leveraging the collaboration between mobile devices and edge servers, where multiple LLMs with different roles are distributed across mobile devices and edge servers to perform user-agent interactive tasks collaboratively. In the proposed system, LLM agents are split into perception, grounding, and alignment modules, facilitating inter-module communications to meet extended user requirements on 6G network functions, including integrated sensing and communication, digital twins, and task-oriented communications. Furthermore, we introduce a novel model caching algorithm for LLMs within the proposed system to improve model utilization in context, thus reducing network costs of the collaborative mobile and edge LLM agents.
Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in social science and role-playing applications. However, one fundamental question remains: can LLM agents really simulate human behavior? In this paper, we focus on one critical and elemental behavior in human interactions, trust, and investigate whether LLM agents can simulate human trust behavior. We first find that LLM agents generally exhibit trust behavior, referred to as agent trust, under the framework of Trust Games, which are widely recognized in behavioral economics. Then, we discover that GPT-4 agents manifest high behavioral alignment with humans in terms of trust behavior, indicating the feasibility of simulating human trust behavior with LLM agents. In addition, we probe the biases of agent trust and differences in agent trust towards other LLM agents and humans. We also explore the intrinsic properties of agent trust under conditions including external manipulations and advanced reasoning strategies. Our study provides new insights into the behaviors of LLM agents and the fundamental analogy between LLMs and humans beyond value alignment. We further illustrate broader implications of our discoveries for applications where trust is paramount.
Human beings possess the capability to multiply a mélange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations, and then introduce action tokens denoting that the embodied agent takes certain actions within the environment, as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY out-performs baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, mul-tisensory captioning, and task decomposition.
This paper explores the significant shift towards agentic workflows in the application of Large Language Models (LLMs), moving away from traditional, linear interactions between users and AI. Through a case study analysis, we highlight the effectiveness of agentic workflows, which facilitate a more dynamic and iterative engagement, in improving outcomes in tasks such as question answering, code generation or stock analysis. Central to the agentic workflow are four foundational design patterns: reflection, planning, multi-agent collaboration, and tool utilization. These components are crucial for boosting LLM productivity and enhancing performance. The study demonstrates how agentic workflows, by promoting an iterative and reflective process, can serve as a crucial step towards achieving Artificial General Intelligence (AGI).
With the rapid development of smart grid technology, the demand for high precision, real-time performance, and adaptive capability in the field of energy metering is becoming increasingly prominent. This study focuses on the construction of large-scale models and intelligent agents in the field of electric energy metering, and proposes an intelligent metering framework that integrates multimodal data and deep learning technology. By integrating power quality parameters such as voltage, current, harmonic content, as well as multi-source heterogeneous data such as environmental variables and equipment status, a hybrid architecture model based on Transformer and Graph Neural Network (GNN) was constructed to achieve dynamic error prediction and anomaly detection of energy metering data. The experimental results show that the accuracy of the model in harmonic distortion rate prediction task reaches 98.2%, which is 21.4% higher than traditional methods. At the same time, it can accurately identify low-frequency anomalies (recall rate 92.3%), verifying its robustness in complex working conditions.
Autonomous Intersection Management (AIM) systems present a novel paradigm for the cooperative control of Connected and Automated Vehicles (CAVs) at unsignalized intersections in future cities. Although Reinforcement Learning (RL) offers potential for increased computational efficiency and optimized solutions, challenges remain. These include limited inference capabilities and poor generalization due to simplified neural networks, along with insufficient safety-focused policy optimization. This study presents a novel offline-to-online framework, Prior-Enhanced Multi-Agent Constrained Decision Transformer (PE-MACDT), designed to tackle these challenges. The process begins with sequential decision-making using offline safe RL, which determines optimal actions through autoregressive modeling based on past states, actions, and both reward and cost returns. Leveraging the superior reasoning abilities and strong generalization of large language models like GPT-x and BERT, the sequence modeling challenges are addressed using the Transformer architecture, enhanced by sequence-level entropy regularizers to foster policy exploration. Subsequently, the safety policy learned from the offline dataset is deployed in the online environment and fine-tuned using the Multi-Agent Constrained Policy Optimization (MACPO) method combined with prior knowledge. This approach employs trust and constraint domains for policy updates, ensuring adherence to high standards of safety, comfort, and efficiency in dynamic traffic environments. Simulation results show our methodology outperforms state-of-the-art AIM methods in training convergence speed and asymptotic performance, as well as post-deployment outcomes in traffic efficiency, driving safety, and passenger comfort. The integration of offline pre-training with MACDT and online fine-tuning using MACPO offers a groundbreaking approach with significant potential for advancements in intelligent transportation systems.
Effective communication skills are critical to academic and professional success, and neither traditional tutoring nor early Intelligent Tutoring Systems (ITS) have a strong tendency to provide individualized, scaled, and contextsensitive feedback. In order to overcome these limitations, a further development of a Transformer-Based Intelligent Tutoring System should be provided, which is intended to build communication skills more efficiently through integrating context-dependent discourse fusion, transformer weighted fluency evaluation, and pragmatic intent matching. It presented the system using fine-tuned transformer models and experimented on a simulation structure of large-scale interaction between learners based on an agent-based framework. The experiments show that the results improve significantly. Grammar accuracy was 95.8%, fluency was 0.87, intent alignment was 93.2%, and discourse coherence was 91.6%. Scalability with 1,200 parallel learners was confirmed by the simulation, which showed the average response time of 245 ms, load-handling efficiency of 94.3%, and throughput of 4,800 sessions per hour. These results demonstrate the novelty of the system with the integration of pedagogical assessment and simulation-based scalability as the cornerstone of communication tutoring systems in the future.
No abstract available
Chatbots play a vital role in digital communication, especially in e-commerce, where user queries range from casual browsing to complex order-related issues. Traditional chatbot models often struggle with accurately detecting user intent in vague, incomplete, or multi-turn interactions, and they typically lack emotional intelligence, resulting in poor user experiences. This study proposes a hybrid chatbot framework that integrates BERT-based intent classification, sentiment analysis, and a Transformer based contextual memory module to enhance both con textual understanding and emotional responsiveness. The proposed model improves intent recognition by 3%-4% compared to conventional systems, supports more natural multi-turn conversations, and significantly boosts user satisfaction. These results highlight the importance of intelligent, adaptive chatbot solutions in the evolving landscape of online commerce.
We introduce the Safe Offline-to-Online Multi-Agent Decision Transformer (SO2-MADT), an innovative framework that revolutionizes safety considerations in Multi-agent Reinforcement Learning (MARL) through a novel sequence modeling approach. Leveraging the dynamic capabilities inherent in Decision Transformers, our methodology seamlessly incorporates safety protocols as a cornerstone element, ensuring secure operations throughout both the offline pre-training phase and the adaptive online fine-tuning phase. At the core of our framework lie two pivotal innovations: the Safety-To-Go (STG) token, embedding safety at a macro level, and the Agent Prioritization Module (APM), facilitating explicit credit assignment at a micro level. Through extensive testing against the challenging environments of the StarCraft Multi-Agent Challenge (SMAC) and Multi-agent MuJoCo, our SO2-MADT not only excels in offline pre-training but also demonstrates superior performance during online fine-tuning, without any degradation in performance. The implications of our work provide a pathway for deployment in critical real-world applications where safety is paramount and non-negotiable. The code is available at https://github.com/shahaamirbader/SO2-MADT.
The traditional maneuver decision-making approaches are highly dependent on accurate and complete situation information, and their decision-making quality becomes poor when opponent information is occasionally missing in complex electromagnetic environments. In order to solve this problem, an autonomous maneuver decision-making approach is developed based on deep reinforcement learning (DRL) architecture. Meanwhile, a Transformer network is integrated into the actor and critic networks, which can find the potential dependency relationships among the time series trajectory data. By using these relationships, the information loss is partially compensated, which leads to maneuvering decisions being more accurate. The issues of limited experience samples, low sampling efficiency, and poor stability in the agent training state appear when the Transformer network is introduced into DRL. To address these issues, the measures of designing an effective decision-making reward, a prioritized sampling method, and a dynamic learning rate adjustment mechanism are proposed. Numerous simulation results show that the proposed approach outperforms the traditional DRL algorithms, with a higher win rate in the case of opponent information loss.
Recent developments in Artificial Intelligence (AI) and Natural Language Processing (NLP) have significantly influenced the education sector, mainly through Large Language Models (LLMs) and Multi-Agent Systems (MAS). Educational chatbots such as GPTutor use Retrieval-Augmented Generation (RAG) and transformer-based models to offer more intelligent tutoring and automated assessment support. Still, most existing systems depend on centralized APIs and monolithic designs, which restrict scalability, adaptability to different contexts, and data privacy.In this paper, we introduce E-GPT, a locally deployable multi-agent educational chatbot framework designed to divide learning tasks among independent, specialized agents. These include a PDF-to-Quiz Generator, PDF-to-Technical Article Writer, RAG-based Chat Assistant, Question Paper Generator, and an OCR-enabled PDF Analyzer. The system integrates fine-tuned versions of LLaMA3 and Mistral models with a MongoDB-FAISS vector database, allowing efficient context retrieval and smooth collaboration between agents.Experimental results show that E-GPT improves contextual accuracy, modular scalability, and response time while maintaining user data privacy. By distributing different cognitive functions across coordinated agents, the system takes a step toward building a more adaptive, transparent, and scalable AI-driven learning environment.
This paper proposes a novel Intelligent Reflecting Surface (IRS)-assisted interweave Cognitive Internet of Vehicles (CIoV) network under malicious jamming attacks, where the IRS enhances communication performance by establishing additional links. In order to maximize the sum transmission rate of Vehicle-to-Infrastructure (V2I) links, we propose an optimization problem that jointly optimizes wireless resource allocation, such as spectrum and transmit power for Vehicle Users (VUs) and IRS phase shift. Because this problem is non-convex and complicated, we further propose a Heterogeneous Multi-agent Transformer-enhanced Dueling Double Deep Q-Network (HMA-TD3QN) based resource allocation method, where VUs and Secondary Base Station (SBS) act as distinct heterogeneous agents can independently perform resource allocation and phase shift optimization. The Transformer neural network architecture can better adapt to long sequence input states and extract relevant features from complex input states through the attention mechanism. Simulation results indicate that the proposed HMA-TD3QN method achieves improvements of 24.42%, 20.79%, and 22.25% over the basic HMA-DQN under three different jamming strategies, highlighting the effectiveness of IRS technology in enhancing the Quality of Service (QoS) and jamming resilience of CIoV network.
Plant diseases remain a significant challenge in global agricultural production. Achieving efficient and accurate disease detection is essential for reducing crop losses, controlling agricultural costs, and improving yields. As agriculture rapidly advances toward digitalization and intelligent transformation, the application of artificial intelligence technologies has become a key pathway to enhancing industrial competitiveness. In this study, Chat Demeter, a multi-agent system for plant disease diagnosis based on deep learning. The system captures real-time leaf images through camera devices. It employs a CNN-Transformer model to perform instance segmentation and object detection, thereby enabling automatic identification of diseased leaves and classification of disease types. To enhance interactivity and practical value, the system incorporates a natural language interface, allowing users to upload images and receive automated diagnostic results and treatment suggestions. Experimental results demonstrate that the system achieves an accuracy of 99.50% and an AUC of 99.91% on the validation dataset, highlighting its superior performance. Overall, Chat Demeter provides an effective tool for crop health monitoring and disease intervention, while offering a feasible pathway and developmental direction for integrating and optimizing future agricultural multi-agent systems.
Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines—such as offline multi-view feature extraction or additional task-specific heads—3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks. The code and model will be released at https://github.com/djiajunustc/3D-LLaVA.
Multi-agent collaborative perception enhances each agent’s perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long-range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird’s-eye view (BEV) representations, which contain a significant amount of non-critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object-query-based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi-agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state-of-the-art methods while drastically reducing communication needs. On V2V4Real, our model (Top-50 object queries) requires only 0.416 Mb bandwidth—83 times less than SOTA methods—while improving AP@70 by 1.1%. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth-constrained environments without sacrificing detection accuracy. The code and models are open-sourced through the following link: https://github.com/taco-group/COCMT.
Compared with terrestrial networks, unmanned aerial vehicles (UAVs) have the characteristics of flexible deployment and strong adaptability, which are an important supplement to intelligent transportation systems (ITS). In this paper, we focus on the multi-UAV network area coverage problem (ACP) which require intelligent UAVs long-term trajectory decisions in the complex and scalable network environment. Multi-agent deep reinforcement learning (DRL) has recently emerged as an effective tool for solving long-term decisions problems. However, since the input dimension of multi-layer perceptron (MLP)-based deep neural network (DNN) is fixed, it is difficult for standard DNN to adapt to a variable number of UAVs and network users. Therefore, we combine Transformer with DRL to meet the scalability of the network and propose a Transformer-based deep multi-agent reinforcement learning (T-MARL) algorithm. Transformer can adapt to variable input dimensions and extract important information from complex network states by attention module. In our research, we find that random initialization of Transformer may cause DRL training failure, so we propose a baseline-assisted pre-training scheme. This scheme can quickly provide an initial policy model for UAVs based on imitation learning, and use the temporal-difference(1) algorithm to initialize policy evaluation network. Finally, based on parameter sharing, T-MARL is applicable to any standard DRL algorithm and supports expansion on networks of different sizes. Experimental results show that T-MARL can make UAVs have cooperative behaviors and perform outstandingly on ACP.
Thematic analysis (TA) is a widely used qualitative method for identifying underlying meanings within unstructured text. However, TA requires manual processes, which become increasingly labour-intensive and time-consuming as datasets grow. While large language models (LLMs) have been introduced to assist with TA on small-scale datasets, three key limitations hinder their effectiveness. First, current approaches often depend on interactions between an LLM agent and a human coder, a process that becomes challenging with larger datasets. Second, with feedback from the human coder, the LLM tends to mirror the human coder, which provides a narrower viewpoint of the data. Third, existing methods follow a sequential process, where codes are generated for individual samples without recalling previous codes and associated data, reducing the ability to analyse data holistically. To address these limitations, we propose Thematic-LM, an LLM-based multi-agent system for large-scale computational thematic analysis. Thematic-LM assigns specialised tasks to each agent, such as coding, aggregating codes, and maintaining and updating the codebook. We assign coder agents different identity perspectives to simulate the subjective nature of TA, fostering a more diverse interpretation of the data. We applied Thematic-LM to the Dreaddit dataset and the Reddit climate change dataset to analyse themes related to social media stress and online opinions on climate change. We evaluate the resulting themes based on trustworthiness principles in qualitative research. Our study reveals insights such as assigning different identities to coder agents promotes divergence in codes and themes.
In the realm of microservices architecture, the occurrence of frequent incidents necessitates the employment of Root Cause Analysis (RCA) for swift issue resolution. It is common that a serious incident can take several domain experts hours to identify the root cause. Consequently, a contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for RCA. Though the recent ReAct framework aligns well with the Site Reliability Engineers (SREs) for its thought-action-observation paradigm, its hallucinations often lead to irrelevant actions and directly affect subsequent results. Additionally, the complex and variable clues of the incident can overwhelm the model one step further. To confront these challenges, we propose Flow-of-Action, a pioneering Standard Operation Procedure (SOP) enhanced LLM-based multi-agent system. By explicitly summarizing the diagnosis steps of SREs, SOP imposes constraints on LLMs at crucial junctures, guiding the RCA process towards the correct trajectory. To facilitate the rational and effective utilization of SOPs, we design an SOP-centric framework called SOP flow. SOP flow contains a series of tools, including one for finding relevant SOPs for incidents, another for automatically generating SOPs for incidents without relevant ones, and a tool for converting SOPs into code. This significantly alleviates the hallucination issues of ReAct in RCA tasks. We also design multiple auxiliary agents to assist the main agent by removing useless noise, narrowing the search space, and informing the main agent whether the RCA procedure can stop. Compared to the ReAct method's 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.
Large Language Models (LLM) are increasingly being explored for problem-solving tasks. However, their strategic planning capability is often viewed with skepticism. Recent studies have incorporated the Monte Carlo Tree Search (MCTS) algorithm to augment the planning capacity of LLM. Despite its potential, MCTS relies on extensive sampling simulations to approximate the true reward distribution, which leads to two primary issues. Firstly, MCTS is effective for tasks like the Game of Go, where simulation results can yield objective rewards (e.g., 1 for a win and 0 for a loss). However, for tasks such as question answering, the result of a simulation is the answer to the question, which cannot yield an objective reward without the ground truth. Secondly, obtaining statistically significant reward estimations typically requires a sample size exceeding 30 simulations, resulting in excessive token usage and time consumption. To address these challenges, we present the Multi-Agent System with Tactical Execution and Reasoning using LLM Specialized MCTS (MASTER), a novel framework that coordinates agent recruitment and communication through LLM specialized MCTS. This system autonomously adjusts the number of agents based on task complexity and ensures focused communication among them. Comprehensive experiments across various tasks demonstrate the effectiveness of our proposed framework. It achieves 76% accuracy on HotpotQA and 80% on WebShop, setting new state-of-the-art performance on these datasets.
Multi-agent systems (MAS) have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving; however, current MAS frameworks suffer from poor flexibility and scalability with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process centered on our Collaborative Reward Model that provides fine-grained reward signals to optimize MAS cooperation. We also introduce an automated data synthesis framework for generating MAS benchmarks without any human annotations. Experimental results show that ReSo matches or outperforms existing methods, achieving 33.7 percent accuracy on Math-MAS and 32.3 percent accuracy on SciBench-MAS, where other approaches completely fail.
Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. These tasks require multiple interactions with a volatile environment for every decision, demanding sufficient intelligence to maximize returns and manage risks. Although LLMs have been used to develop agent systems that surpass human teams and yield impressive investment returns, opportunities to enhance multi-sourced information synthesis and optimize decision-making outcomes through timely experience refinement remain unexplored. Here, we introduce the FinCon, an LLM-based multi-agent framework with CONceptual verbal reinforcement tailored for diverse FINancial tasks. Inspired by effective real-world investment firm organizational structures, FinCon utilizes a manager-analyst communication hierarchy. This structure allows for synchronized cross-functional agent collaboration towards unified goals through natural language interactions and equips each agent with greater memory capacity than humans. Additionally, a risk-control component in FinCon enhances decision quality by episodically initiating a self-critiquing mechanism to update systematic investment beliefs. The conceptualized beliefs serve as verbal reinforcement for the future agent's behavior and can be selectively propagated to the appropriate node that requires knowledge updates. This feature significantly improves performance while reducing unnecessary peer-to-peer communication costs. Moreover, FinCon demonstrates strong generalization capabilities in various financial tasks, including single stock trading and portfolio management.
The rapid advancement of scientific progress requires innovative tools that can accelerate knowledge discovery. Although recent AI methods, particularly large language models (LLMs), have shown promise in tasks such as hypothesis generation and experimental design, they fall short of replicating the collaborative nature of real-world scientific practices, where diverse experts work together in teams to tackle complex problems. To address the limitations, we propose an LLM-based multi-agent system, i.e., Virtual Scientists (VirSci), designed to mimic the teamwork inherent in scientific research. VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. Through comprehensive experiments, we demonstrate that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas. We further investigate the collaboration mechanisms that contribute to its tendency to produce ideas with higher novelty, offering valuable insights to guide future research and illuminating pathways toward building a robust system for autonomous scientific discovery. The code is available at https://github.com/open-sciencelab/Virtual-Scientists.
Software vulnerabilities can lead to severe security issues such as data breaches, financial losses, and service disruptions, making security issue-oriented code review a crucial part of the development process. Traditional approaches struggle with analyzing complex code and providing explanations, while large language models (LLMs) show promise in code review but do not focus on security-related issues. To address these limitations, we propose AutoReview, an LLM-based multi-agent system for security code review. It integrates three agents: (1) Issue Detector identifying potential vulnerabilities using knowledge-level retrieval-augmented generation, (2) Issue Locator pinpoints the vulnerability positions through graph-based code slicing, and (3) Issue Repairer generating context-aware fixes via iterative verification. Evaluated on ReposVul with three code LLMs, AutoReview greatly demonstrates its effectiveness in security code reviews, improving F1-score for detection by 18.72%, precision for location by 27.75%, and BLEU for repair by 14.82% over baselines.
The ubiquitous computing resources in 6G networks provide ideal environments for the fusion of large language models (LLMs) and intelligent services through the agent framework. With auxiliary modules and planning cores, LLM-enabled agents can autonomously plan and take actions to deal with diverse environment semantics and user intentions. However, the limited resources of individual network devices significantly hinder the efficient operation of LLM-enabled agents with complex tool calls, highlighting the urgent need for efficient multi-level device collaborations. To this end, the framework and method of the LLM-enabled multi-agent system with dual-loop terminal-edge collaborations are proposed in 6G networks. Firstly, the outer loop consists of the iterative collaborations between the global agent and multiple sub-agents deployed on edge servers and terminals, where the planning capability is enhanced through task decomposition and parallel sub-task distribution. Secondly, the inner loop utilizes sub-agents with dedicated roles to circularly reason, execute, and replan the sub-task, and the parallel tool calling generation with offloading strategies is incorporated to improve efficiency. The improved task planning capability and task execution efficiency are validated through the conducted case study in 6G-supported urban safety governance. Finally, the open challenges and future directions are thoroughly analyzed in 6G networks, accelerating the advent of the 6G era.
Refactoring is a constant activity in software development and maintenance. Scale and maintain software systems are based on code refactoring. However, this process is still labor intensive, as it requires programmers to analyze the codebases in detail to avoid introducing new defects. In this research, we put forward a large language model (LLM)-based multi-agent system to automate the refactoring process on Haskell code. The objective of this research is to evaluate the effect of LLM-based agents in performing structured and semantically accurate refactoring on Haskell code. Our proposed multi-agent system based on specialized agents with distinct roles, including code analysis, refactoring execution, verification, and debugging. To test the effectiveness and practical applicability of the multi-agent system, we conducted evaluations using different open-source Haskell codebases. The results of the experiments carried out showed that the proposed LLM-based multi-agent system could average 11.03% decreased complexity in code, an improvement of 22.46% in overall code quality, and increase performance efficiency by an average of 13.27%. Furthermore, memory allocation was optimized by up to 14.57%. These results highlight the ability of LLM-based multi-agent in managing refactoring tasks targeted toward functional programming paradigms. Our findings hint that LLM-based multi-agent systems integration into the refactoring of functional programming languages can enhance maintainability and support automated development workflows.
Today, E-commerce sellers face several key challenges, including difficulties in discovering and effectively utilizing available programs and tools, and struggling to understand and utilize rich data from various tools. We therefore aim to develop Insight Agents (IA), a conversational multi-agent Data Insight system, to provide E-commerce sellers with personalized data and business insights through automated information retrieval. Our hypothesis is that IA will serve as a force multiplier for sellers, thereby driving incremental seller adoption by reducing the effort required and increase speed at which sellers make good business decisions. In this paper, we introduce this new LLM-backed end-to-end agentic workflow designed for comprehensive coverage, high accuracy, and low latency. It features a hierarchical multi-agent structure, consisting of manager agent and two worker agents: data presentation and insight generation, for efficient information retrieval and problem-solving. We design a simple yet effective ML solution for manager agent that combines Out-of-Domain (OOD) detection using a lightweight encoder-decoder model and agent routing through a BERT-based classifier, optimizing both accuracy and latency. Within the two worker agents, a strategic planning is designed for API-based data model that breaks down queries into granular components to generate more accurate responses, and domain knowledge is dynamically injected to to enhance the insight generator. IA has been launched for Amazon sellers in US, which has achieved high accuracy of 89.5% based on human evaluation, with latency of P90 below 15s.
Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving, yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods. We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness in LLM-based MAS through LLM training. Optima employs an iterative generate, rank, select, and train paradigm with a reward function balancing task performance, token efficiency, and communication readability. We explore various RL algorithms, including Supervised Fine-Tuning, Direct Preference Optimization, and their hybrid approaches, providing insights into their effectiveness-efficiency trade-offs. We integrate Monte Carlo Tree Search-inspired techniques for DPO data generation, treating conversation turns as tree nodes to explore diverse interaction paths. Evaluated on common multi-agent tasks, including information-asymmetric question answering and complex reasoning, Optima shows consistent and substantial improvements over single-agent baselines and vanilla MAS based on Llama 3 8B, achieving up to 2.8x performance gain with less than 10\% tokens on tasks requiring heavy information exchange. Moreover, Optima's efficiency gains open new possibilities for leveraging inference-compute more effectively, leading to improved inference-time scaling laws. By addressing fundamental challenges in LLM-based MAS, Optima shows the potential towards scalable, efficient, and effective MAS (https://chenweize1998.github.io/optima-project-page).
This paper presents a novel design of a multi-agent system framework that applies large language models (LLMs) to automate the parametrization of simulation models in digital twins. This framework features specialized LLM agents tasked with observing, reasoning, decision-making, and summarizing, enabling them to dynamically interact with digital twin simulations to explore parametrization possibilities and determine feasible parameter settings to achieve an obj ective. The proposed approach enhances the usability of simulation model by infusing it with knowledge heuristics from LLM and enables autonomous search for feasible parametrization to solve a user task. Furthermore, the system has the potential to increase user-friendliness and reduce the cognitive load on human users by assisting in complex decision-making processes. The effectiveness and functionality of the system are demonstrated through a case study, and the visualized demos and codes are available at a GitHub Repository: https://github.comlYuchenXia/LLMDrivenSimulation
Trending topics have become a significant part of modern social media, attracting users to participate in discussions of breaking events. However, they also bring in a new channel for poisoning attacks, resulting in negative impacts on society. Therefore, it is urgent to study this critical problem and develop effective strategies for defense. In this paper, we propose TrendSim, an LLM-based multi-agent system to simulate trending topics in social media under poisoning attacks. Specifically, we create a simulation environment for trending topics that incorporates a time-aware interaction mechanism, centralized message dissemination, and an interactive system. Moreover, we develop LLM-based human-like agents to simulate users in social media, and propose prototype-based attackers to replicate poisoning attacks. Besides, we evaluate TrendSim from multiple aspects to validate its effectiveness. Based on TrendSim, we conduct simulation experiments to study four critical problems about poisoning attacks on trending topics for social benefit.
Large Language Models (LLMs) excel in diverse applications including generation of code snippets, but often struggle with generating code for complex Machine Learning (ML) tasks. Although existing LLM single-agent based systems give varying performance depending on the task complexity, they purely rely on larger and expensive models such as GPT-4. Our investigation reveals that no-cost and low-cost models such as Gemini-Pro, Mixtral and CodeLlama perform far worse than GPT-4 in a single-agent setting. With the motivation of developing a cost-efficient LLM based solution for solving ML tasks, we propose an LLM Multi-Agent based system which leverages combination of experts using profiling, efficient retrieval of past observations, LLM cascades, and ask-the-expert calls. Through empirical analysis on ML engineering tasks in the MLAgentBench benchmark, we demonstrate the effectiveness of our system, using no-cost models, namely Gemini as the base LLM, paired with GPT-4 in cascade and expert to serve occasional ask-the-expert calls for planning. With 94.2% reduction in the cost (from $0.931 per run cost averaged over all tasks for GPT-4 single agent system to $0.054), our system is able to yield better average success rate of 32.95% as compared to GPT-4 single-agent system yielding 22.72% success rate averaged over all the tasks of MLAgentBench.
Traditional methods for making software deployment decisions in the automotive industry typically rely on manual analysis of tabular software test data. These methods often lead to higher costs and delays in the software release cycle due to their labor-intensive nature. Large Language Models (LLMs) present a promising solution to these challenges. However, their application generally demands multiple rounds of human-driven prompt engineering, which limits their practical deployment, particularly for industrial end-users who need reliable and efficient results. In this paper, we propose GoNoGo, an LLM agent system designed to streamline automotive software deployment while meeting both functional requirements and practical industrial constraints. Unlike previous systems, GoNoGo is specifically tailored to address domain-specific and risk-sensitive systems. We evaluate GoNoGo's performance across different task difficulties using zero-shot and few-shot examples taken from industrial practice. Our results show that GoNoGo achieves a 100% success rate for tasks up to Level 2 difficulty with 3-shot examples, and maintains high performance even for more complex tasks. We find that GoNoGo effectively automates decision-making for simpler tasks, significantly reducing the need for manual intervention. In summary, GoNoGo represents an efficient and user-friendly LLM-based solution currently employed in our industrial partner's company to assist with software release decision-making, supporting more informed and timely decisions in the release process for risk-sensitive vehicle systems.
The considerable improvement on the Internet and the corresponding applications leads to the result of online discussions becoming far more popular and significant than any other method for people to communicate with each other and reach a consensus. Meanwhile, the incredible improvement in Large Language Models (LLM) has promoted the performance of LLM-based agents in text understanding and content generation capabilities. The research objective of the PhD thesis is to build democratic discussion environments, with three main issues existing right now: 1) Large-scale discussions tend to be complicated, 2) Rumours and misinformation bring negative effects to the discussions, and 3) Direct democratic discussions are complex and time-consuming. This extended abstract introduces the efforts that have been made to address those issues, with the introduction of the potential directions in the future.
No abstract available
Intelligent Tutoring Systems (ITSs) have revolutionized education by offering personalized learning experiences. However, as goal-oriented learning, which emphasizes efficiently achieving specific objectives, becomes increasingly important in professional contexts, existing ITSs often struggle to deliver this type of targeted learning experience. In this paper, we propose GenMentor, an LLM-powered multi-agent framework designed to deliver goal-oriented, personalized learning within ITS. GenMentor begins by accurately mapping learners' goals to required skills using a fine-tuned LLM trained on a custom goal-to-skill dataset. After identifying the skill gap, it schedules an efficient learning path using an evolving optimization approach, driven by a comprehensive and dynamic profile of learners' multifaceted status. Additionally, GenMentor tailors learning content with an exploration-drafting-integration mechanism to align with individual learner needs. Extensive automated and human evaluations demonstrate GenMentor's effectiveness in learning guidance and content quality. Furthermore, we have deployed it in practice and also implemented it as an application. Practical human study with professional learners further highlights its effectiveness in goal alignment and resource targeting, leading to enhanced personalization. Supplementary resources are available at ttps://github.com/GeminiLight/gen-mentor.
ElliottAgents: A Natural Language-Driven Multi-Agent System for Stock Market Analysis and Prediction
This paper presents ElliottAgents, a multi-agent system leveraging natural language processing (NLP) and large language models (LLMs) to analyze complex stock market data. The system combines AI-driven analysis with the Elliott Wave Principle to generate human-comprehensible predictions and explanations. A key feature is the natural language dialogue between agents, enabling collaborative analysis refinement. The LLM-enhanced architecture facilitates advanced language understanding, reasoning, and autonomous decision-making. Experiments demonstrate the system's effectiveness in pattern recognition and generating natural language descriptions of market trends. ElliottAgents contributes to NLP applications in specialized domains, showcasing how AI-driven dialogue systems can enhance collaborative analysis in data-intensive fields. This research bridges the gap between complex financial data and human understanding, addressing the need for interpretable and adaptive prediction systems in finance.
LLM-based autonomous agents often fail to execute complex web tasks that require dynamic interaction, largely due to the inherent uncertainty and complexity of these environments. Existing LLM-based web agents typically rely on rigid, expert-designed policies specific to certain states and actions, lacking the flexibility and generalizability needed to adapt to unseen tasks. In contrast, humans excel by exploring unknowns, continuously adapting strategies based on new observations, and resolving ambiguities through exploration. To emulate human-like adaptability, web agents need strategic exploration and complex decision-making. Monte Carlo Tree Search (MCTS) is well-suited for this, but classical MCTS struggles with vast action spaces, unpredictable state transitions, and incomplete information in web tasks. In light of this, we develop WebPilot, a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. Specifically, the Global Optimization phase involves generating a high-level plan by breaking down tasks into manageable subtasks, continuously refining this plan through reflective analysis of new observations and previous subtask attempts, thereby focusing the search process and mitigating challenges posed by vast action spaces in classical MCTS. Subsequently, the Local Optimization phase executes each subtask using a tailored MCTS designed for complex environments, effectively addressing uncertainties and managing incomplete information by iteratively refining decisions based on new observations. Experimental results on WebArena and MiniWoB++ demonstrate the effectiveness of WebPilot. Notably, on WebArena, WebPilot achieves SOTA performance with GPT-4, achieving a 93% relative increase in success rate over the concurrent tree search-based method. WebPilot advances autonomous agents, enabling more reliable decision-making in practical environments.
Large language models (LLMs) have large potential for molecular optimization, as they can gather external chemistry tools and enable collaborative interactions to iteratively refine molecular candidates. However, this potential remains underexplored, particularly in the context of structured reasoning, interpretability, and comprehensive tool-grounded molecular optimization. To address this gap, we introduce MT-Mol, a multi-agent framework for molecular optimization that leverages tool-guided reasoning and role-specialized LLM agents. Our system incorporates comprehensive RDKit tools, categorized into five distinct domains: structural descriptors, electronic and topological features, fragment-based functional groups, molecular representations, and miscellaneous chemical properties. Each category is managed by an expert analyst agent, responsible for extracting task-relevant tools and enabling interpretable, chemically grounded feedback. MT-Mol produces molecules with tool-aligned and stepwise reasoning through the interaction between the analyst agents, a molecule-generating scientist, a reasoning-output verifier, and a reviewer agent. As a result, we show that our framework shows the state-of-the-art performance of the PMO-1K benchmark on 17 out of 23 tasks.
Recently, with the development of tool-calling capabilities in large language models (LLMs), these models have demonstrated significant potential for automating electronic design automation (EDA) flows by interacting with EDA tool APIs via EDA scripts. However, considering the limited understanding of EDA tools, LLMs face challenges in practical scenarios where diverse interfaces of EDA tools exist across different platforms. Additionally, EDA flow automation often involves intricate, long-chain tool-calling processes, increasing the likelihood of errors in intermediate steps. Any errors will lead to the instability and failure of EDA flow automation. To address these challenges, we introduce EDAid, a multi-agent collaboration system where multiple agents harboring divergent thoughts converge towards a common goal, ensuring reliable and successful EDA flow automation. Specifically, each agent is controlled by ChipLlama models, which are expert LLMs fine-tuned for EDA flow automation. Our experiments demonstrate the state-of-the-art (SOTA) performance of our ChipLlama models and validate the effectiveness of our EDAid in the automation of complex EDA flows, showcasing superior performance compared to single-agent systems.
Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving actual editing to the user. In contrast, we propose to automate the core task of video editing, formulating it as sequential decision making process. Ours is a multi-agent approach. We design an Editor agent and a Critic agent. The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence. On the other hand, the Critic gives natural language feedback to the editor based on the produced sequence or renders it if it is satisfactory. We introduce a learning-based approach for enabling effective communication across specialized agents to address the language-driven video editing task. Finally, we explore an LLM-as-a-judge metric for evaluating the quality of video editing system and compare it with general human preference. We evaluate our system’s output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference. Please see our companion supplemental video for qualitative results.
Large Language Models (LLMs) have demonstrated remarkable capabilities in Register Transfer Level (RTL) design, enabling high-quality code generation from natural language descriptions. However, LLMs alone face significant limitations in real-world hardware design workflows, including the inability to execute code, lack of debugging capabilities, and absence of long-term memory. To address these challenges, we present ASIC-Agent, an autonomous system designed specifically for digital ASIC design tasks. ASIC-Agent enhances base LLMs with a multi-agent architecture incorporating specialized subagents for RTL generation, verification, OpenLane hardening, and Caravel chip integration, all operating within a comprehensive sandbox environment with access to essential hardware design tools. The system leverages a vector database containing documentation, API references, error knowledge, and curated insights from the open-source silicon community. To evaluate ASIC-Agent’s performance, we introduce ASIC-Agent-Bench, the first benchmark specifically designed to assess agentic systems in hardware design tasks. We evaluate ASIC-Agent with various base LLMs, providing quantitative comparisons and qualitative insights into agent behavior across different design scenarios. Our results demonstrate that ASIC-Agent, when powered by Claude 4 Sonnet, successfully automates a broad range of ASIC design tasks spanning varying levels of complexity, showing the potential of significantly accelerating the ASIC design workflow. Our work is open-source and publicly available on Github1
The proliferation of AI-powered search and recommendation systems has accelerated the formation of “filter bubbles” that reinforce people’s biases and narrow their perspectives. Previous research has attempted to address this issue by increasing the diversity of information exposure, which is often hindered by a lack of user motivation to engage with. In this study, we took a human-centered approach to explore how Large Language Models (LLMs) could assist users in embracing more diverse perspectives. We developed a prototype featuring LLM-powered multi-agent characters that users could interact with while reading social media content. We conducted a participatory design study with 18 participants and found that multi-agent dialogues with gamification incentives could motivate users to engage with opposing viewpoints. Additionally, progressive interactions with assessment tasks could promote thoughtful consideration. Based on these findings, we provided design implications with future work outlooks for leveraging LLMs to help users burst their filter bubbles.
Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7 % fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.
This paper explores the integration of advanced Multi-Agent Systems (MAS) techniques to develop a team of agents with enhanced logical reasoning, long-term knowledge retention, and Theory of Mind (ToM) capabilities. By uniting these core components with optimized communication protocols, we create a novel framework called SynergyMAS, which fosters collaborative teamwork and superior problem-solving skills. The system's effectiveness is demonstrated through a product development team case study, where our approach significantly enhances performance and adaptability. These findings highlight SynergyMAS's potential to tackle complex, real-world challenges.
No abstract available
This study investigates the implementation of LLM agents in smart city management, leveraging both the inherent language processing abilities of LLMs and the distributed problem solving capabilities of multi-agent systems for the improvement of urban decision making processes. A multi-agent system architecture combines LLMs with existing urban information systems to process complex queries and generate contextually relevant responses for urban planning and management. The research is focused on three main hypotheses testing: (1) LLM agents’ capability for effective routing and processing diverse urban queries, (2) the effectiveness of Retrieval-Augmented Generation (RAG) technology in improving response accuracy when working with local knowledge and regulations, and (3) the impact of integrating LLM agents with existing urban information systems. Our experimental results, based on a comprehensive validation dataset of 150 question–answer pairs, demonstrate significant improvements in decision support capabilities. The multi-agent system achieved pipeline selection accuracy of 94–99% across different models, while the integration of RAG technology improved response accuracy by 17% for strategic development queries and 55% for service accessibility questions. The combined use of document databases and service APIs resulted in the highest performance metrics (G-Eval scores of 0.68–0.74) compared to standalone LLM responses (0.30–0.38). Using St. Petersburg’s Digital Urban Platform as a testbed, we demonstrate the practical applicability of this approach to create integrated city management systems with support complex urban decision making processes. This research contributes to the growing field of AI-enhanced urban management by providing empirical evidence of LLM agents’ effectiveness in processing heterogeneous urban data and supporting strategic planning decisions. Our findings suggest that LLM-based multi-agent systems can significantly enhance the efficiency and accuracy of urban decision making while maintaining high relevance in responses.
No abstract available
Prior Authorization delivers safe, appropriate, and cost-effective care that is medically justified with evidence-based guidelines. However, the process often requires labor-intensive manual comparisons between patient medical records and clinical guidelines, that is both repetitive and time-consuming. Recent developments in Large Language Models (LLMs) have shown potential in addressing complex medical NLP tasks with minimal supervision. This paper explores the application of Multi-Agent System (MAS) that utilize specialized LLM agents to automate Prior Authorization task by breaking them down into simpler and manageable sub-tasks. Our study systematically investigates the effects of various prompting strategies on these agents and benchmarks the performance of different LLMs. We demonstrate that GPT-4 achieves an accuracy of 86.2% in predicting checklist item-level judgments with evidence, and 95.6% in determining overall checklist judgment. Additionally, we explore how these agents can contribute to explainability of steps taken in the process, thereby enhancing trust and transparency in the system.
While multi-agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low-performing agents. To resolve this issue, in this paper, we introduce a general and adversary-resistant multi-agent LLM framework based on credibility scoring. We model the collaborative query-answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system's effectiveness in mitigating adversarial influence and enhancing the resilience of multi-agent cooperation, even in the adversary-majority settings.
Heterogeneous multi-agent systems (HMAS) comprise various intelligent agents with specialized functions, such as drones, ground robots, and automated devices, working in coordinated settings. This paper presents AutoHMA-LLM, a novel framework that combines Large Language Models (LLMs) with classical control algorithms to address the challenges of task coordination and scheduling in complex, dynamic environments. The framework is designed with a multi-tier architecture, utilizing a cloud-based LLM as the central planner alongside device-specific LLMs and Generative Agents to improve task execution efficiency and accuracy. Specifically targeting dynamic scenarios, the system enhances resource utilization and stabilizes task execution through refined task scheduling and real-time feedback mechanisms. In experiments conducted across logistics, inspection, and search & rescue scenarios, AutoHMA-LLM demonstrated a 5.7% improvement in task completion accuracy, a 46% reduction in communication steps, and a 31% decrease in token usage and API calls compared to baseline methods. These results highlight our framework’s scalability and efficiency, offering substantial support for effective multi-agent collaboration in complex, resource-constrained environments.
Most discussions about Large Language Model (LLM) safety have focused on single-agent settings but multi-agent LLM systems now create novel adversarial risks because their behavior depends on communication between agents and decentralized reasoning. In this work, we innovatively focus on attacking pragmatic systems that have constrains such as limited token bandwidth, latency between message delivery, and defense mechanisms. We design a $\textit{permutation-invariant adversarial attack}$ that optimizes prompt distribution across latency and bandwidth-constraint network topologies to bypass distributed safety mechanisms within the system. Formulating the attack path as a problem of $\textit{maximum-flow minimum-cost}$, coupled with the novel $\textit{Permutation-Invariant Evasion Loss (PIEL)}$, we leverage graph-based optimization to maximize attack success rate while minimizing detection risk. Evaluating across models including $\texttt{Llama}$, $\texttt{Mistral}$, $\texttt{Gemma}$, $\texttt{DeepSeek}$ and other variants on various datasets like $\texttt{JailBreakBench}$ and $\texttt{AdversarialBench}$, our method outperforms conventional attacks by up to $7\times$, exposing critical vulnerabilities in multi-agent systems. Moreover, we demonstrate that existing defenses, including variants of $\texttt{Llama-Guard}$ and $\texttt{PromptGuard}$, fail to prohibit our attack, emphasizing the urgent need for multi-agent specific safety mechanisms.
No abstract available
Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner's tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.
Writing comprehensive unit tests is a time-consuming challenge in software development. While Large Language Model (LLM) based tools offer a solution, they often struggle with code correctness and reliability. This paper proposes a novel tool for generating Java unit tests using a three-stage pipeline architecture: code description, test scenario listing, and code generation. Our approach integrates advanced techniques like Reflexion for iterative description, parallel generation for scenario diversity, and the ReAct framework for dynamic code repair. Testing on Apache Commons projects shows superior code and branch coverage ($92 \%-98 \%$) over benchmarks, with 78 %-83 % execution success rate.
No abstract available
With the widespread application of large language models (LLMs) in intelligent conversation and recommender systems, integrating them into travel-related tasks has become a key research focus within the smart mobility domain. However, limitations such as high fine-tuning costs, cold-start challenges, issues in validation and logical coherence, and difficulties in maintaining contextual memory hinder the effectiveness of personalized interactions by traditional LLMs in travel scenarios. To address these challenges, we propose the Reasoning-enhanced Multi-turn Agent with Personalized Adaptation Framework (ReMAP), a generation-augmented agent framework that reduces personalization costs, improves validation and logical interpretability via Reasoning-and-Acting (ReAct) and Chain-of-Thought (CoT) prompting, and incorporates self-updating and retrieval mechanisms for factual memory to enhance the robustness of personalized generation in LLMs. The Tibet tourism-oriented personalized interaction agent system built upon this framework demonstrates strong performance in multiple-round, multi-group response experiments conducted under real-world travel scenarios. Experimental results show that ReMAP significantly outperforms baseline approaches in cold-start responsiveness (+10.51% personalization accuracy), itinerary feasibility (+12.38% pass rate), and long-term personalization consistency.
: Smart buildings represent a significant trend in the future of the construction industry. The performance of human-computer interaction plays a vital role in achieving this from a human perspective. However, existing human-computer interaction algorithms are often limited to simple commands and fail to meet the complex and diverse needs of users. To address this issue, this paper introduces large language models (LLMs) and AI agents into smart buildings, proposing a general AI agent framework based on the ReAct strategy. The LLM serves as the system’s brain, responsible for reasoning and action planning, while tool calling mechanism puts the LLM’s plans into practice. Through this framework, developers can rely on prompt engineering alone to enable the LLM to interpret user intent accurately, perform appropriate actions
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
High-Performance Computing (HPC) job scheduling involves balancing conflicting objectives such as minimizing makespan, reducing wait times, optimizing resource use, and ensuring fairness. Traditional methods, including heuristic-based, e.g., First-Come-First-Served(FJFS) and Shortest Job First (SJF), or intensive optimization techniques, often lack adaptability to dynamic workloads and, more importantly, cannot simultaneously optimize multiple objectives in HPC systems. To address this, we propose a novel Large Language Model (LLM)-based scheduler using a ReAct-style framework (Reason + Act), enabling iterative, interpretable decision-making. The system incorporates a scratchpad memory to track scheduling history and refine decisions via natural language feedback, while a constraint enforcement module ensures feasibility and safety. We evaluate our approach using OpenAI’s O4-Mini and Anthropic’s Claude 3.7 across seven real-world HPC workload scenarios, including heterogeneous mixes, bursty patterns, and adversarial cases etc. Comparisons against FCFS, SJF, and Google OR-Tools (on 10 to 100 jobs) reveal that LLM-based scheduling effectively balances multiple objectives while offering transparent reasoning through natural language traces. The method excels in constraint satisfaction and adapts to diverse workloads without domain-specific training. However, a trade-off between reasoning quality and computational overhead challenges real-time deployment. This work presents the first comprehensive study of reasoning-capable LLMs for HPC scheduling, demonstrating their potential to handle multiobjective optimization while highlighting limitations in computational efficiency. The findings provide insights into leveraging advanced language models for complex scheduling problems in dynamic HPC environments.CCS Concepts• General and reference → Experimentation; Evaluation; Measurement; • Computer systems organization → Grid computing; Heterogeneous (hybrid) systems; Real-time system specification; Multicore architectures.
The Semantic Table Annotation (STA) task, which includes Column Type Annotation (CTA) and Cell Entity Annotation (CEA), maps table contents to ontology entities and plays important roles in various semantic applications. However, complex tables often pose challenges such as semantic loss of column names or cell values, strict ontological hierarchy requirements, homonyms, spelling errors, and abbreviations, which hinder annotation accuracy. To address these issues, this paper proposes an LLM-based agent approach for CTA and CEA. We design and implement five external tools with tailored prompts based on the ReAct framework, enabling the STA agent to dynamically select suitable annotation strategies depending on table characteristics. Experiments are conducted on the Tough Tables and BiodivTab datasets from the SemTab challenge, which contain the aforementioned challenges. Our method outperforms existing approaches across various metrics. Furthermore, by leveraging Levenshtein distance to reduce redundant annotations, we achieve a 70% reduction in time costs and a 60% reduction in LLM token usage, providing an efficient and cost-effective solution for STA.
This paper explores the integration of collective intelligence within project management by leveraging advanced large language models (LLMs) for agile sprint planning. Three distinct LLM-based architectures are compared: a single LLM agent, a ReAct (Reasoning + Action) agent, and a multi-agent system with specialized roles. The study demonstrates how these architectures can simulate project management roles, such as project manager and QA engineer, to collaboratively generate comprehensive sprint plans. Using the SMART framework for evaluation, the models’ effectiveness in producing specific, measurable, achievable, relevant, and time-bound tasks is assessed. The findings highlight the strengths of each approach, from fast single-agent solutions to more intricate multi-agent systems. The research contributes to enhancing decision-making processes in agile environments, with significant implications for the future of automated project management tools.
Abstract Motivated by the astonishing capabilities of large language models (LLMs) in text-generation, reasoning, and simulation of complex human behaviors, in this paper, we propose a novel multi-component LLM-based framework, namely LLM4ACOE, that fully automates the collaborative ontology engineering (COE) process using role-playing simulation of LLM agents and retrieval augmented generation (RAG) technology. The proposed solution enhances the LLM-powered role-playing simulation with RAG ‘feeding’ the LLM with three different types of external knowledge. This knowledge corresponds to the knowledge required by each of the COE roles (agents), using a component-based framework, as follows: (a) domain-specific data-centric documents, (b) OWL documentation, and (c) ReAct guidelines. The aforementioned components are evaluated in combination, with the aim of investigating their impact on the quality of generated ontologies. The aim of this work is twofold, (a) to identify the capacity of LLM-based agents to generate acceptable (by human-experts) ontologies through agentic collaborative ontology engineering (ACOE) role-playing simulation, at specific levels of acceptance (accuracy, validity, and expressiveness of ontologies) without human intervention and (b) to investigate whether and/or to what extent the selected RAG components affect the quality of the generated ontologies. The evaluation of this novel approach is performed using ChatGPT-o in the domain of search and rescue (SAR) missions. To assess the generated ontologies, quantitative and qualitative measures are employed, focusing on coverage, expressiveness, structure, and human involvement.
Individuals entering Vietnam’s dynamic Information Technology (IT) job market face a critical gap in reliable career guidance. Existing market reports are often outdated, while the manual analysis of thousands of job postings is impractical for most. To address this challenge, we present the AI Job Market Consultant, a novel conversational agent that delivers deep, data-driven insights directly from the labor market in realtime. The foundation of our system is a custom-built dataset created via an automated pipeline that crawls job portals using Playwright and leverages the Large Language Model (LLM) to intelligently structure unstructured posting data. The core of our system is a tool-augmented AI agent, based on the ReAct agentic framework, which enables the ability of autonomously reasoning, planning, and executing actions through a specialized toolbox for SQL queries, semantic search, and data visualization. Our prototype successfully collected and analyzed 3,745 job postings, demonstrating its ability to answer complex, multistep queries, generate on-demand visualizations, and provide personalized career advice grounded in real-world data. This work introduces a new paradigm for labor market analysis, showcasing how specialized agentic AI systems can democratize access to timely, trustworthy career intelligence for the next generation of professionals.
: Answering questions over multi-page, multimodal documents, including text and figures, is a critical challenge for applications that require answers to integrate information across multiple modalities and contextual dependencies. Existing methods, such as single-turn retrieval-augmented generation (RAG), struggle to retrieve fine-grained and contextually relevant information from large, heterogeneous documents, leading to suboptimal performance. Inspired by iterative frameworks like ReAct, which refine retrieval through feedback, we propose Doc-React, an adaptive iterative framework that balances information gain and uncertainty reduction at each step. Doc-React leverages InfoNCE-guided retrieval to approximate mutual information, enabling dynamic sub-query generation and refinement. A large language model (LLM) serves as both a judge and generator, providing structured feedback to iteratively improve retrieval. By combining mutual information optimization with entropy-aware selection, Doc-React systematically captures relevant multimodal content, achieving strong performance on complex QA tasks.
The emergence of Agentic IoT, where autonomous intelligent agents such as mobile robots, UAVs, and industrial actuators independently execute complex missions, demands communication and security configurations that can adapt to both fast mission-driven changes and slower environment-driven performance drifts. Existing control paradigms are inadequate. Specifically, static policies cannot react to real-time variations, while task-aware adaptive policies largely overlook environmental dynamics, leaving systems vulnerable to network degradation and latency spikes. To address these limitations, we propose the Dynamic Model Context Protocol (dMCP), a cognitive control framework that bridges high-level mission intents with lowlevel system configurations via the standardized MCP interface. dMCP employs a Large Language Model to reason over real-time mission and environment contexts, generating executable policy vectors. An event-driven trigger mechanism re-evaluates policies upon abrupt mission changes or significant environmental drifts, ensuring timely adaptation without overreacting to transient fluctuations. Simulation results demonstrate that dMCP achieves higher reliability, reduced tail latency, and improved Service Level Objective compliance compared with both static and taskaware adaptive baselines, making it a viable control paradigm for highly dynamic Agentic IoT deployments.
While large language models (LLMs) show impressive decision-making abilities, current methods lack a mechanism for automatic self-improvement from errors during task execution. We propose LEAP, an iterative fine-tuning framework that continually improves LLM agents using feedback from AI expert teachers. Our key insight is to equip the expert teachers with a privileged state -- information that is available during training but hidden at test time. This allows even weak experts to provide precise guidance, significantly improving the student agent's performance without access to privileged information at test time. We evaluate LEAP on diverse decision-making benchmarks, including text-based games (ALFWorld), web navigation (WebShop), and interactive coding (Intercode Bash). Our experiments show that LEAP (1) outperforms behavior cloning and ReAct baselines (2) enables weak student models (e.g., Llama3-8B) to exceed the performance of strong teacher models (GPT4-o), and (3) allows weak models to self-improve using privileged versions of themselves. We also provide a theoretical analysis showing that LEAP's success hinges on balancing privileged information with the student's realizability, which we empirically validate. Our code is available at https://leap-llm.github.io
The accelerating growth of scientific literature presents significant challenges for researchers aiming to synthesize information effectively and identify novel research directions. Conventional Large Language Models (LLMs), while versatile, often lack the nuanced understanding of domain-specific academic structures requisite for optimal contextual retrieval. This paper introduces a fine-tuned framework of Large Language Model (LLM)-based research assistant, developed to enhance literature analysis for automated proposal generation within academic domains. The system employs a structured document ingestion pipeline, incorporating Retrieval-Augmented Generation (RAG) with a reranker module to optimize context relevance and reduce hallucinations. Our RAG approach employs a two-step retrieval process, using cosine similarity followed by a reranker to pinpoint the most relevant documents. It integrates multimodal capabilities through vision-language models for diagram interpretation and ReACT to leverage structured prompt templates for research proposals by analyzing “Limitations” and “Future Work” sections. Our experiments clearly show the value of the reranker where it significantly boosts answer relevancy with an average score of 0.994 post-reranking and frequently guiding the system to achieve perfect rank (1.0000). Crucially, our domain-specific fine-tuning of the language models, employing techniques like LoRA, proved effective in adapting the model to academic discourse, achieving a validation perplexity as low as 1.1716. This signifies enhanced model understanding and specialization, leading to more accurate contextual interpretation, better synthesis of complex ideas, and more coherent proposal generation. The proposed model highlights the effective application of domain-specific fine-tuning to streamline data-driven academic workflows in AI-supported research environments.
The growing sophistication of cyber threats and the exponential rise in alert volumes have exposed the limitations of traditional Security Operations Centers (SOCs), leading to analyst fatigue, high turnover, and inefficiencies in incident response. Conventional SOAR platforms struggle to address these issues due to their rigid rule-based logic and insufficient contextual awareness. Although large language model (LLM)-based solutions have shown potential, they often lack consistency in reasoning, effective tool orchestration, factual accuracy, and adaptability to emerging threats. In this work, we present an autonomous SOC agent that integrates the ReAct (Reasoning and Acting) framework with detection engineering principles to overcome these challenges. By embedding structured investigation logic and enriched alert metadata directly into the analysis workflow, our approach delivers domain-specific context to support accurate tool invocation and actionable remediation guidance. This integration fosters transparency and reliability throughout the alert lifecycle. Empirical evaluations demonstrate that our solution significantly enhances alert triage and incident response, offering a scalable path toward more resilient, AI-driven SOC operations.
Accurate prediction of phage-host interactions remains a fundamental challenge that impedes the clinical deployment of phage therapy. Current graph neural network-based methods rely on superficial sequence features, failing to adequately integrate deep biological semantic information. In this work, we present DSHMGAT (Dual-Stream Hierarchical Mixed-Routing Graph Attention Network), a novel framework that incorporates agent-generated semantic embeddings to mitigate this semantic deficiency. By employing a ReAct-driven agent with function calling capabilities to access domain-specific databases, our approach reduces hallucinations inherent in biological LLM applications. DSHMGAT adopts a dualstream hierarchical graph neural network that simultaneously captures genomic sequences and biological semantic representations, enabling effective cross-modal information integration. Inspired by mixture-of-experts architectures, we develop a mixed-routing attention mechanism that improves learning flexibility through dynamic weight allocation between routing heads and shared heads. DSHMGAT achieves an AUC of 0.9486 on the benchmark dataset which outperforms established approaches in our comparative analysis. Ablation experiments reveal that the observed improvements can be attributed to the combined effects of multimodal fusion, cross-modal interaction and mixed-routing mechanism.
Thinktank is a task execution framework that harnesses the reasoning capabilities of large language models (LLMs). These models demonstrate an ability to provide intermediate steps essential for approaching problem solutions. Thinktank capitalizes on this by employing an LLM, such as GPT-4, to iteratively solve given objectives. Initially, Thinktank processes each objective into actionable tasks that can be executed immediately using the available information. We employ a technique called ReAct prompting (https://arxiv.org/abs/2210.03629) to leverage the LLM’s reasoning abilities, guiding it to select appropriate Agents. Our method deviates from the original paper’s proposal; recognizing that the vast number of available functions could overwhelm the decision-making of statistical language models, we restrict the model’s function choices by automatically clustering Agents into no more than ten capabilities. These capabilities are dynamically recalculated whenever new Agents are added to the system.
While large language models (LLMs) have shown promise in the table question answering (TQA) task through prompt engineering, they face challenges in industrial applications, including structural heterogeneity, difficulties in target data localization, and bottlenecks in complex reasoning. To address these limitations, this paper presents TableZoomer, a novel LLM-powered, programming-based agent framework. It introduces three key innovations: (1) replacing the original fully verbalized table with structured table schema to bridge the semantic gap and reduce computational complexity; (2) a query-aware table zooming mechanism that dynamically generates sub-table schema through column selection and entity linking, significantly improving target localization efficiency; and (3) a Program-of-Thoughts (PoT) strategy that transforms queries into executable code to mitigate numerical hallucination. Additionally, we integrate the reasoning workflow with the ReAct paradigm to enable iterative reasoning. Extensive experiments demonstrate that our framework maintains the usability advantages while substantially enhancing performance and scalability across tables of varying scales. When implemented with the Qwen3-8B-Instruct LLM, TableZoomer achieves accuracy improvements of 19.34% and 25% over conventional PoT methods on the large-scale DataBench dataset and the small-scale Fact Checking task of TableBench dataset, respectively.
An LLM-Based Agentic Network Traffic Incident-Report Approach Towards Explainable-AI Network Defense
Traditional intrusion detection systems for IoT networks achieve high classification accuracy but lack interpretability and actionable incident-response capabilities, limiting their operational value in security-critical environments. This paper presents a graph-based multi-agent framework that integrates ensemble machine learning with Large Language Model (LLM)-powered incident report generation via Retrieval-Augmented Generation (RAG). The system employs a three-phase architecture: (1) a lightweight Random Forest binary pre-detection, achieving 99.49% accuracy with a 6 MB model size for edge deployment; (2) ensemble classification combining Multi-Layer Perceptron, Random Forest, and XGBoost with soft voting and SHAP-based feature attribution for explainability; and (3) a ReAct-based summary agent that synthesizes classification results with external threat intelligence from Web search and scholarly databases to generate evidence-grounded incident reports. To address the challenge of evaluating non-deterministic LLM outputs, we introduce custom RAG evaluation metrics—faithfulness and groundedness implemented via the LLM-as-Judge framework. Experimental validation on the ACI IoT Network Dataset 2023 demonstrates ensemble accuracy exceeding 99.8% across 11 attack classes; perfect groundedness scores (1.0), indicating all generated claims derive from the retrieved context; and moderate faithfulness (0.64), reflecting appropriate analytical synthesis. The ensemble approach mitigates individual model weaknesses, improving the UDP Flood F1 score from 48% (MLP alone) to 95% through soft voting. This work bridges the gap between high-accuracy detection and trustworthy, actionable security analysis for automated incident-response systems.
Modern supply chain management systems often suffer from fragmented decision-making where demand forecasting, inventory control, supplier management, and pricing operate as independent processes. This lack of coordination frequently leads to inefficiencies such as stockouts, excess inventory, revenue loss, and poor supplier utilization. To address these challenges, this paper presents StockSage, a multi-agent inventory management system powered by Large Language Models (LLMs). The proposed system employs four specialized agents responsible for forecasting demand, managing inventory levels, selecting optimal suppliers, and recommending pricing strategies. These agents collaborate through a structured two-round coordination protocol that enables cross-functional communication and adaptive decisionmaking. The system is implemented as a full-stack web application using modern technologies including Next.js, React, TypeScript, Prisma ORM, SQLite, and OpenAI GPT APIs. A Monte Carlo simulation framework is used to evaluate system performance against traditional baseline strategies such as static reorder policies, moving average forecasting, and fixed pricing methods. Experimental results indicate improvements in forecast accuracy, service level, inventory turnover, and revenue optimization. The results demonstrate the potential of coordinated multi-agent LLM systems to provide intelligent, explainable, and scalable decision support for modern inventory and supply chain management.
Recent advances in large language models (LLMs) have enabled impressive progress across diverse tasks, yet interpretability remains a core requirement for deployment in high-stakes domains such as crisis prevention and policy-making. Prior work on event prediction has largely prioritized accuracy, but the reasoning behind model outputs often remains opaque and difficult to audit. In this paper, we propose C3OT, Causality Contextualized Chain-of-Thought, which integrates causal reasoning into an agentic LLM framework using the ReAct paradigm. We design and evaluate multiple prompting strategies-including Causal Chain Learning, Chain-of-Thought, and more nuanced hybrid approaches. Experiments assess both predictive accuracy and interpretability, the latter measured through structured rubrics that capture transparency, causal coherence, and auditability. Results demonstrate that our causal reasoning approach attains competitive predictive performance while producing more transparent and auditable reasoning traces. These findings underscore the value of causal reasoning for enhancing both trustworthiness and robustness in sociopolitical forecasting.
Search engines are vital for online e-commerce but often struggle with long, detailed queries. We introduce Search Swarm, a novel multi-agent system designed to improve search engine navigation on platforms like Amazon by accurately locating relevant products based on user instructions. Search Swarm employs multiple large language model (LLM) agents, each with a specific role: query planner, searcher, critic, and attribute selector. These agents collaborate to generate search queries, evaluate results, and identify the best product options tailored to users' needs. Our framework outperforms existing methods like ReAct and Reflexion in the WebShop environment, achieving a reward score of 62.64, compared to scores of 54.1, 59.8, 61.5, and 58.2 for other approaches. Furthermore, in a comparison with a basic rule-based method on Amazon, Search Swarm achieved a score 38.71 points higher and a 41\% greater success rate, demonstrating its superior ability to provide relevant product matches over traditional search engines.
Table Question Answering (TableQA) enables natural language interaction with structured tabular data. However, existing large language model (LLM) approaches face critical limitations: context length constraints that restrict data handling capabilities, hallucination issues that compromise answer reliability, and single-agent architectures that struggle with complex reasoning scenarios involving semantic relationships and multi-hop logic. This paper introduces DataFactory, a multi-agent framework that addresses these limitations through specialized team coordination and automated knowledge transformation. The framework comprises a Data Leader employing the ReAct paradigm for reasoning orchestration, together with dedicated Database and Knowledge Graph teams, enabling the systematic decomposition of complex queries into structured and relational reasoning tasks. We formalize automated data-to-knowledge graph transformation via the mapping function T:D x S x R ->G, and implement natural language-based consultation that - unlike fixed workflow multi-agent systems - enables flexible inter-agent deliberation and adaptive planning to improve coordination robustness. We also apply context engineering strategies that integrate historical patterns and domain knowledge to reduce hallucinations and improve query accuracy. Across TabFact, WikiTableQuestions, and FeTaQA, using eight LLMs from five providers, results show consistent gains. Our approach improves accuracy by 20.2% (TabFact) and 23.9% (WikiTQ) over baselines, with significant effects (Cohen's d>1). Team coordination also outperforms single-team variants (+5.5% TabFact, +14.4% WikiTQ, +17.1% FeTaQA ROUGE-2). The framework offers design guidelines for multi-agent collaboration and a practical platform for enterprise data analysis through integrated structured querying and graph-based knowledge representation.
Large language models (LLMs) augmented with retrieval have shown impressive performance in open-domain question answering, yet struggle significantly with temporal knowledge graph question answering (TKGQA). The core issue lies in structural misalignment: treating structured, temporally sensitive graph queries as plain text often causes LLMs to retrieve or reason with semantically similar but structurally incorrect facts, resulting in critical inaccuracies. To address this, we introduce SAR (Structure-Aligned Reasoning), a novel TKGQA framework that integrates LLM reasoning tightly with the explicit subject–predicate–object–time schema inherent in knowledge graphs. SAR employs an LLM agent to first decompose natural language questions into structured queries, clearly delineating entities, relationships, and temporal constraints. It then conducts schema-consistent, time-aware retrieval from the knowledge graph to acquire candidate quadruples, which guide a subsequent iterative ReAct-style reasoning process by the LLM. A final verification stage ensures that proposed answers strictly adhere to temporal conditions, reinforcing accuracy and temporal coherence. Experiments on two benchmark datasets, MultiTQ and CronQuestions, demonstrate SAR’s effectiveness, achieving the best results. Specifically, with GPT-4.1, SAR achieves 78.2% Hits@1 on MultiTQ, significantly outperforming existing methods, and similarly establishes a new performance record on CronQuestions. Our results underscore the critical importance of structural alignment in temporal reasoning tasks, particularly in handling complex queries involving multiple temporal constraints and multi-hop reasoning.
The behavior of Large Language Models (LLMs) as artificial social agents is largely unexplored, and we still lack extensive evidence of how these agents react to simple social stimuli. Testing the behavior of AI agents in classic Game Theory experiments provides a promising theoretical framework for evaluating the norms and values of these agents in archetypal social situations. In this work, we investigate the cooperative behavior of three LLMs (Llama2, Llama3, and GPT3.5) when playing the Iterated Prisoner's Dilemma against random adversaries displaying various levels of hostility. We introduce a systematic methodology to evaluate an LLM's comprehension of the game rules and its capability to parse historical gameplay logs for decision-making. We conducted simulations of games lasting for 100 rounds and analyzed the LLMs' decisions in terms of dimensions defined in the behavioral economics literature. We find that all models tend not to initiate defection but act cautiously, favoring cooperation over defection only when the opponent's defection rate is low. Overall, LLMs behave at least as cooperatively as the typical human player, although our results indicate some substantial differences among models. In particular, Llama2 and GPT3.5 are more cooperative than humans, and especially forgiving and non-retaliatory for opponent defection rates below 30%. More similar to humans, Llama3 exhibits consistently uncooperative and exploitative behavior unless the opponent always cooperates. Our systematic approach to the study of LLMs in game theoretical scenarios is a step towards using these simulations to inform practices of LLM auditing and alignment.
This paper presents RTLFixer, a novel framework enabling automatic syntax errors fixing for Verilog code with Large Language Models (LLMs). Despite LLM’s promising capabilities, our analysis indicates that approximately 55% of errors in LLM-generated Verilog are syntax-related, leading to compilation failures. To tackle this issue, we introduce a novel debugging framework that employs Retrieval-Augmented Generation (RAG) and ReAct prompting, enabling LLMs to act as autonomous agents in interactively debugging the code with feedback. This framework demonstrates exceptional proficiency in resolving syntax errors, successfully correcting about 98.5% of compilation errors in our debugging dataset, comprising 212 erroneous implementations derived from the VerilogEval benchmark. Our method leads to 32.3% and 10.1% increase in pass@1 success rates in the VerilogEval-Machine and VerilogEval-Human benchmarks, respectively. The source code and benchmark are available at https://github.com/NVlabs/RTLFixer.
Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. How to schedule the domain-specific perceiving models and analyze the collected videos uniformly, efficiently, and especially intelligently to accomplish complicated tasks is challenging. To address the challenge, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to analyze multimedia data collaboratively. To support VIoTGPT and related future works, we meticulously crafted the VIoT-Tool dataset, including the training dataset and the benchmark involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to ReAct instruction tuning method based on VIoT-Tool to learn the tool capability. Quantitative and qualitative experiments and analyses demonstrate the effectiveness of VIoTGPT. We believe VIoTGPT contributes to improving human-centered experiences in VIoT applications.
To complete tasks in dynamic environments, robots need to timely update their plans to react to environment changes. Traditional stripe-like or learning-based planners struggle to achieve this due to their high reliance on meticulously predefined planning rules or labeled data. Fortunately, recent works find that Large Language Models (LLMs) can be effectively prompted to solve planning problems. Thus, we investigate the strategies for LLMs to master reactive planning problems without complex definitions and extra training. We propose Text2Reaction, an LLM-based framework enabling robots to continuously reason and update plans according to the latest environment changes. Inspired from human's step-by-step re-planning process, we present the Re-planning Prompt, which informs LLMs the basic principles of re-planning and fosters the gradual development of a current plan to a new one in a three-hop reasoning manner–cause analysis, consequence inference, and plan adjustment. In addition, Text2Reaction is designed to first generate an initial plan based on the task description before execution, allowing for subsequent iterative updates of this plan. We demonstrate the superior performance of Text2Reaction over prior works in reacting to various environment changes and completing varied tasks. In addition, we validate the reliability of our re-planning prompt through ablation experiments and its capability when deployed in real-world robots, enabling continuous reasoning in the face of diverse changes until the user instructions are successfully completed.
Autonomous LLM-based agents have emerged as a powerful paradigm for complex task execution, yet the field lacks standardized tools for development, deployment, distribution and discovery of agents. We present Cerebrum, an Agent SDK for AIOS that addresses this gap through three key components: (1) a comprehensive SDK featuring a modular four-layer architecture for agent development, encompassing LLM, memory, storage, and tool management; (2) a community-driven Agent Hub for sharing and discovering agents, complete with version control and dependency management; (3) an interactive web interface for testing and evaluating agents. The platform's effectiveness is demonstrated through implementations of various agent architectures, including Chain of Thought (CoT), ReAct, and tool-use agents. Cerebrum advances the field by providing a unified framework that standardizes agent development while maintaining flexibility for researchers and developers to innovate and distribute their agents. The live website is at https://app.aios.foundation, the code is at https://github.com/agiresearch/Cerebrum, and video is at https://app.aios.foundation/video-demo.
Conducting a comprehensive literature review is crucial for advancing circuit design methodologies. However, the rapid influx of state-of-the-art research, inconsistent data representation, and the complexity of optimizing circuit design objectives (e.g. power consumption) make this task significantly challenging. Traditional manual search methods are inefficient, time-consuming, and lack the reasoning capabilities required for synthesizing complex circuits. In this paper, we propose MuaLLM, an open-source multimodal Large Language Model (LLM) agent for circuit design assistance that integrates a hybrid Retrieval-Augmented Generation (RAG) framework with an adaptive vector database of circuit design research papers. Unlike conventional LLMs, the MuaLLM agent employs a Reason + Act (ReAct) workflow for iterative reasoning, goal-setting, and multi-step information retrieval. It functions as a question-answering design assistant, capable of interpreting complex queries and providing reasoned responses grounded in circuit literature. Its multimodal capabilities enable processing of both textual and visual data, facilitating more efficient and comprehensive analysis. The system dynamically adapts using intelligent search tools, automated document retrieval from the internet, and real-time database updates. Unlike conventional approaches constrained by model context limits, MuaLLM decouples retrieval from inference, enabling scalable reasoning over arbitrarily large corpora. At the maximum context length supported by standard LLMs, MuaLLM remains up to 10x less costly and 1.6x faster while maintaining the same accuracy. This allows rapid, no-human-in-the-loop database generation, overcoming the bottleneck of simulation-based dataset creation for circuits. To evaluate MuaLLM, we introduce two custom benchmarks: RAG-250, targeting retrieval and citation performance, and Reasoning-100 (Reas-100), focused on multistep reasoning in circuit design. MuaLLM achieves $\mathbf{9 0. 1 \%}$ recall on RAG-250, highlighting strong multimodal retrieval and citation accuracy. On Reas-100, it reaches 86.8% accuracy, demonstrating robust reasoning capabilities on complex design queries.
We introduce TAPAS (Task-based Adaptation and Planning using AgentS), a multi-agent framework that integrates Large Language Models (LLMs) with symbolic planning to solve complex tasks without the need for manually defined environment models. TAPAS employs specialized LLM-based agents that collaboratively generate and adapt domain models, initial states, and goal specifications as needed using structured tool-calling mechanisms. Through this tool-based interaction, downstream agents can request modifications from upstream agents, enabling adaptation to novel attributes and constraints without manual domain redefinition. A ReAct (Reason+Act)-style execution agent, coupled with natural language plan translation, bridges the gap between dynamically generated plans and real-world robot capabilities. TAPAS demonstrates strong performance in benchmark planning domains and in the VirtualHome simulated real-world environment.
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs’ tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
Large language models (LLMs) remain prone to factual inaccuracies and computational errors, including hallucinations and mistakes in mathematical reasoning. Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations. In this paper, we investigate whether LLMs can learn to use tools without demonstrations. First, we analyse zero-shot prompting strategies to guide LLMs in tool utilisation. Second, we propose a self-training method to synthesise tool-use traces using the LLM itself. We compare supervised fine-tuning and preference fine-tuning techniques for fine-tuning the model on datasets constructed using existing Question Answering (QA) datasets, i.e., TriviaQA and GSM8K. Experiments show that tool-use enhances performance on a long-tail knowledge task: 3.7% on PopQA, which is used solely for evaluation, but leads to mixed results on other datasets, i.e., TriviaQA, GSM8K, and NQ-Open. Our findings highlight the potential and challenges of integrating external tools into LLMs without demonstrations.
Although the power of LLM tool-use agents has ignited a flurry of recent research in this area, the curation of tool-use training data remains an open problem$-$especially for online RL training. Existing approaches to synthetic tool-use data generation tend to be non-interactive, and/or non-compositional. We introduce RandomWorld, a pipeline for the procedural generation of interactive tools and compositional tool-use data. We show that models tuned via SFT and RL on synthetic RandomWorld data improve on a range of tool-use benchmarks, and set the new SoTA for two metrics on the NESTFUL dataset. Further experiments show that downstream performance scales with the amount of RandomWorld-generated training data, opening up the possibility of further improvement through the use of entirely synthetic data.
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete various tasks in a self-directed fashion. The challenge of tool use demands that LLMs not only understand user queries and generate answers accurately but also excel in task planning, tool invocation, and result summarization. While traditional works focus on training a single LLM with all these capabilities, performance limitations become apparent, particularly with smaller models. To overcome these challenges, we propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. Each component is implemented by a single LLM that focuses on a specific capability and collaborates with others to accomplish the task. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability. To effectively train this framework, we introduce a two-stage training paradigm. First, we fine-tune a backbone LLM on the entire dataset without discriminating sub-tasks, providing the model with a comprehensive understanding of the task. Second, the fine-tuned LLM is used to instantiate the planner, caller, and summarizer respectively, which are continually fine-tuned on respective sub-tasks. Evaluation across various tool-use benchmarks illustrates that our proposed multi-LLM framework surpasses the traditional single-LLM approach, highlighting its efficacy and advantages in tool learning.
To achieve faithful reasoning that aligns with human expectations, large language models (LLMs) need to ground their reasoning to real-world knowledge (e.g., web facts, math and physical rules). Tools help LLMs access this external knowledge, but there remains challenges for fine-tuning LLM agents (e.g., Toolformer) to invoke tools in multi-step reasoning problems, where inter-connected tool calls require holistic and efficient tool usage planning. In this work, we propose a new method for LLMs to better leverage tools in multi-step reasoning. Our method, Chain-of-Abstraction (CoA), trains LLMs to first decode reasoning chains with abstract placeholders, and then call domain tools to reify each reasoning chain by filling in specific knowledge. This planning with abstract chains enables LLMs to learn more general reasoning strategies, which are robust to shifts of domain knowledge (e.g., math results) relevant to different reasoning questions. It also allows LLMs to perform decoding and calling of external tools in parallel, which avoids the inference delay caused by waiting for tool responses. In mathematical reasoning and Wiki QA domains, we show that our method consistently outperforms previous chain-of-thought and tool-augmented baselines on both in-distribution and out-of-distribution test sets, with an average ~6% absolute QA accuracy improvement. LLM agents trained with our method also show more efficient tool use, with inference speed being on average ~1.4x faster than baseline tool-augmented LLMs.
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and reduce hallucinations. However, developing prompting techniques that enable LLM agents to effectively use these tools and knowledge remains a heuristic and labor-intensive task. Here, we introduce AvaTaR, a novel and automated framework that optimizes an LLM agent to effectively leverage provided tools, improving performance on a given task. During optimization, we design a comparator module to iteratively deliver insightful and comprehensive prompts to the LLM agent by contrastively reasoning between positive and negative examples sampled from training data. We demonstrate AvaTaR on four complex multimodal retrieval datasets featuring textual, visual, and relational information, and three general question-answering (QA) datasets. We find AvaTaR consistently outperforms state-of-the-art approaches across all seven tasks, exhibiting strong generalization ability when applied to novel cases and achieving an average relative improvement of 14% on the Hit@1 metric for the retrieval datasets and 13% for the QA datasets. Code and dataset are available at https://github.com/zou-group/avatar.
Large language models (LLMs) have exhibited an array of reasoning capabilities but face challenges like error propagation and hallucination, particularly in specialised areas like finance, where data is heterogeneous, and precision is paramount. We explore the potential of language model augmentation with external tools to mitigate these limitations and offload certain reasoning steps to external tools that are more suited for the task, instead of solely depending on the LLM’s inherent abilities. More concretely, using financial domain question answering datasets, we apply supervised finetuning on a LLAMA-2 13B CHAT model to act both as a task router and task solver. The task router dynamically directs a question to either be answered internally by the LLM or externally via the right tool from the tool set. Our tool-equipped SFT model, RAVEN, demonstrates an improvement of 35.2% and 5.06% over the base model and SFT-only baselines, respectively, and is highly competitive with strong GPT-3.5 results. To the best of our knowledge, our work is the first that investigates tool augmentation of language models for the finance domain.
Recently, tool use with LLMs has become one of the primary research topics as it can help LLM generate truthful and helpful responses. Existing studies on tool use with LLMs primarily focus on enhancing the tool-calling ability of LLMs. In practice, like chat assistants, LLMs are also required to align with human values in the context of tool use. Specifically, LLMs should refuse to answer unsafe tool use relevant instructions and insecure tool responses to ensure their reliability and harmlessness. At the same time, LLMs should demonstrate autonomy in tool use to reduce the costs associated with tool calling. To tackle this issue, we first introduce the principle that LLMs should follow in tool use scenarios: H2A. The goal of H2A is to align LLMs with **helpfulness**, **harmlessness**, and **autonomy**. In addition, we propose ToolAlign, a dataset comprising instruction-tuning data and preference data to align LLMs with the H2A principle for tool use. Based on ToolAlign, we develop LLMs by supervised fine-tuning and preference learning, and experimental results demonstrate that the LLMs exhibit remarkable tool-calling capabilities, while also refusing to engage with harmful content, and displaying a high degree of autonomy in tool utilization. The code and datasets are available at: https://github.com/zhiyuanc2001/ToolAlign.
Integrating tools into Large Language Models (LLMs) has facilitated the widespread application. Despite this, in specialized downstream task contexts, reliance solely on tools is insufficient to fully address the complexities of the real world. This particularly restricts the effective deployment of LLMs in fields such as medicine. In this paper, we focus on the downstream tasks of medical calculators, which use standardized tests to assess an individual's health status. We introduce MeNTi, a universal agent architecture for LLMs. MeNTi integrates a specialized medical toolkit and employs meta-tool and nested calling mechanisms to enhance LLM tool utilization. Specifically, it achieves flexible tool selection and nested tool calling to address practical issues faced in intricate medical scenarios, including calculator selection, slot filling, and unit conversion. To assess the capabilities of LLMs for quantitative assessment throughout the clinical process of calculator scenarios, we introduce CalcQA. This benchmark requires LLMs to use medical calculators to perform calculations and assess patient health status. CalcQA is constructed by professional physicians and includes 100 case-calculator pairs, complemented by a toolkit of 281 medical tools. The experimental results demonstrate significant performance improvements with our framework. This research paves new directions for applying LLMs in demanding scenarios of medicine.
Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed-loop self-correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows. Our code and benchmark are publicly available at https://github.com/KatherLab/ToolMaker.
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.
Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task-specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in-context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidating the collective efforts of research community. Code repo is available at https://github.com/tsinghua-fib-lab/AgentSquare.
Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to Tool Overuse, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce SMART (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent's self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce SMART-ER, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop SMARTAgent, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4o. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.
Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.
Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.
Analog layout design heavily involves interactive processes between humans and design tools. electronic design automation (EDA) tools for this task are usually designed to use scripting commands or visualized buttons for manipulation, especially for interactive automation functionalities, which have a steep learning curve and cumbersome user experience, making a notable barrier to designers’ adoption. Aiming to address such a usability issue, this article introduces LayoutCopilot, a pioneering multiagent collaborative framework powered by large language models (LLMs) for interactive analog layout design. LayoutCopilot simplifies human-tool interaction by converting natural language instructions into executable script commands, and it interprets high-level design intents into actionable suggestions, significantly streamlining the design process. Experimental results demonstrate the flexibility, efficiency, and accessibility of LayoutCopilot in handling real-world analog designs.
Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.
Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.
Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extend their utility, enabling them to solve practical tasks. Previous methods manually parse tool documentation and create in-context demonstrations, transforming tools into structured formats for LLMs to use in their step-by-step reasoning. However, this manual process requires domain expertise and struggles to scale to large toolsets. Additionally, these methods rely heavily on ad-hoc inference techniques or special tokens to integrate free-form LLM generation with tool-calling actions, limiting the LLM's flexibility in handling diverse tool specifications and integrating multiple tools. In this work, we propose AutoTools, a framework that enables LLMs to automate the tool-use workflow. Specifically, the LLM automatically transforms tool documentation into callable functions, verifying syntax and runtime correctness. Then, the LLM integrates these functions into executable programs to solve practical tasks, flexibly grounding tool-use actions into its reasoning processes. Extensive experiments on existing and newly collected, more challenging benchmarks illustrate the superiority of our framework. Inspired by these promising results, we further investigate how to improve the expertise of LLMs, especially open-source LLMs with fewer parameters, within AutoTools. Thus, we propose the AutoTools-Learning approach, training the LLMs with three learning tasks on 34k instances of high-quality synthetic data, including documentation understanding, relevance learning, and function programming. Fine-grained results validate the effectiveness of our overall training approach and each individual task. Our methods are an important step towards the use of LLMs for solving real-world tasks with external tools.
Leveraging more test-time computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify-and-improve paradigm stands out for enabling dynamic solution exploration and feedback incorporation. However, existing approaches often suffer from restricted feedback spaces and lack of coordinated training of different parties, leading to suboptimal performance. To address this, we model this multi-turn refinement process as a Markov Decision Process and introduce DPSDP (Direct Policy Search by Dynamic Programming), a reinforcement learning algorithm that trains an actor-critic LLM system to iteratively refine answers via direct preference learning on self-generated data. Theoretically, DPSDP can match the performance of any policy within the training distribution. Empirically, we instantiate DPSDP with various base models and show improvements on both in- and out-of-distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An ablation study further confirms the benefits of multi-agent collaboration and out-of-distribution generalization.
Large Language Models (LLMs) have found increasing application in tasks requiring multi-step reasoning, yet challenges such as hallucinations and inconsistencies in the generated responses persist. This study presents an innovative methodology to enhance the reasoning capabilities of LLMs by brokering and integrating multiple expert LLMs within a reflection layer to provide targeted feedback on the reasoning trajectories of the base LLM. The approach employs a foundational pre-trained LLM as the base model, which is further supported by agents to promote cognitive assistance for specific task types. In instances where conclusions are deemed incorrect or reasoning is interrupted, these instances are forwarded to the expert LLM layer, which includes systems such as Claude-3 haiku for intricate contexts and MedAlpaca for medical reasoning, to deliver feedback on the base model’s reasoning paths. This feedback forms a ‘reflection pool,’ enabling the base LLM to amend and enhance its reasoning trajectories in subsequent iterations. The experiments conducted across diverse datasets, including HotPotQA, SimpleQA, and PubmedQA, underscore the proposed architecture’s efficacy in augmenting success signals, Rouge-L scores (indicative of quality and precision), and CTRLEval Consistency Scores (indicative of coherence and consistency). The architecture effectively addresses the issues of hallucinations and inconsistencies that frequently occur in multi-step reasoning. Importantly, the approach exhibits considerable potential in tackling domain-specific tasks, underscoring the importance of achieving correct and reliable conclusions. To facilitate further investigation and validation of our proposed brokered multi-expert reflection framework for non-commercial use, the source code of our system is available at https://github.com/WiZY936/Brokered-Multi-Expert-Reflection
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models are fully open-sourced.
Recent advances in LLM agents have largely built on reasoning backbones like ReAct, which interleave thought and action in complex environments. However, ReAct often produces ungrounded or incoherent reasoning steps, leading to misalignment between the agent's actual state and goal. Our analysis finds that this stems from ReAct's inability to maintain consistent internal beliefs and goal alignment, causing compounding errors and hallucinations. To address this, we introduce ReflAct, a novel backbone that shifts reasoning from merely planning next actions to continuously reflecting on the agent's state relative to its goal. By explicitly grounding decisions in states and enforcing ongoing goal alignment, ReflAct dramatically improves strategic reliability. This design delivers substantial empirical gains: ReflAct surpasses ReAct by 27.7% on average, achieving a 93.3% success rate in ALFWorld. Notably, ReflAct even outperforms ReAct with added enhancement modules (e.g., Reflexion, WKM), showing that strengthening the core reasoning backbone is key to reliable agent performance.
No abstract available
Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning
Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.
Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at https://github.com/jcguo123/Temporal-Consistency
Chain-of-thought (CoT) has proven to improve the reasoning capability of large language models (LLMs). However, due to the complexity of multimodal scenarios and the difficulty in collecting high-quality CoT data, CoT reasoning in multimodal LLMs has been largely overlooked. To this end, we propose a simple yet effective self-training framework, R3V, which iteratively enhances the model's Vision-language Reasoning by Reflecting on CoT Rationales. Our framework consists of two interleaved parts: (1) iteratively bootstrapping positive and negative solutions for reasoning datasets, and (2) reflection on rationale for learning from mistakes. Specifically, we introduce the self-refine and self-select losses, enabling the model to refine flawed rationale and derive the correct answer by comparing rationale candidates. Experiments on a wide range of vision-language tasks show that R3V consistently improves multimodal LLM reasoning, achieving a relative improvement of 23 to 60 percent over GPT-distilled baselines. Additionally, our approach supports self-reflection on generated solutions, further boosting performance through test-time computation.
Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection&lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent's (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.
Urban mobility systems face escalating challenges associated with sustainability, equity, and resilience, further compounded by environmental pressures. Traditional agent-based models (ABMs) often fail to capture cognitively rich, adaptive behaviors, limiting their ability to simulate realistic user responses to disruptions. In this work, we propose a cognitive agent architecture based on Large Language Models (LLMs), featuring multi-horizon memory-driven planning, reflection, and adaptation. Integrated into the SimFleet agent-based simulator with realistic sociodemographic profiles, the agents dynamically generate, adjust, and reflect upon travel plans across a 20-day simulation involving over 320 individuals. Experimental results reveal emergent adaptation patterns under both stable and disrupted transport conditions, and an ablation study under severe service disruption quantifies the contributions of short-term and long-term memory modules to memory-driven reasoning, demonstrating the potential of LLM-driven agents to enhance the realism, flexibility, and interpretability of urban mobility simulations.
Chatbots’ role in fostering self-reflection is now widely recognized, especially in inducing users’ behavior change. While the benefits of 24/7 availability, scalability, and consistent responses have been demonstrated in contexts such as healthcare and tutoring to help one form a new habit, their utilization in coaching necessitating deeper introspective dialogue to induce leadership growth remains unexplored. This paper explores the potential of such a chatbot powered by recent Large Language Models (LLMs) in collaboration with professional coaches in the field of executive coaching. Through a design workshop with them and two weeks of user study involving ten coach-client pairs, we explored the feasibility and nuances of integrating chatbots to complement human coaches. Our findings highlight the benefits of chatbots’ ubiquity and reasoning capabilities enabled by LLMs while identifying their limitations and design necessities for effective collaboration between human coaches and chatbots. By doing so, this work contributes to the foundation for augmenting one’s self-reflective process with prevalent conversational agents through the human-in-the-loop approach.
Complex tasks involving tool integration pose significant challenges for Large Language Models (LLMs), leading to the emergence of multi-agent workflows as a promising solution. Reflection has emerged as an effective strategy for correcting erroneous trajectories in agentic workflows. However, existing approaches only exploit such capability in the post-action stage, where the agent observes the execution outcomes. We argue that, like humans, LLMs can also engage in reflection before action execution: the agent can anticipate undesirable outcomes from its own decisions, which not only provides a necessarily complementary perspective to evaluate the decision but also prevents the propagation of errors throughout the trajectory. In this paper, we propose MIRROR, a framework that consists of both intra-reflection, which critically assesses intended actions before execution, and inter-reflection, which further adjusts the trajectory based on observations. This design systematically leverages LLM reflection capabilities to eliminate and rectify erroneous actions on a more comprehensive scope. Evaluations on both the StableToolBench and TravelPlanner benchmarks demonstrate MIRROR's superior performance, achieving state-of-the-art results compared to existing approaches.
Phishing attacks in Web3 ecosystems are increasingly sophisticated, exploiting deceptive contract logic, malicious frontend scripts, and token approval patterns. We present DeepTx, a real-time transaction analysis system that detects such threats before user confirmation. DeepTx simulates pending transactions, extracts behavior, context, and UI features, and uses multiple large language models (LLMs) to reason about transaction intent. A consensus mechanism with self-reflection ensures robust and explainable decisions. Evaluated on our phishing dataset, DeepTx achieves high precision and recall (demo video: https://youtu.be/4OfK9KCEXUM).
We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly.We investigate the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, we developed two RPO methods, RPO-Traj and RPO-Batch, to adapt to different settings.Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, can effectively learn and apply action principles to enhance performance.
Efficient news topic classification is a critical challenge in the era of informa-tion overload. While Large Language Models (LLMs) offer a promising alternative to tradi-tional methods, they still face issues with reliability and interpretability. This paper introduces AgentPress, a collaborative multi-agent framework designed to enhance LLM performance and robustness for this task. AgentPress decomposes classification into a pipeline of three specialized agents: initial analysis, Retrieval-Augmented Generation (RAG), and reflective correction. We constructed an eight-category news benchmark and conducted comprehensive ablation studies on the Qwen3-4B and Llama-3.1-8B models. Experiments reveal that the RAG agent is the single most critical component, delivering a substantial performance gain by boosting the macro F1-score of Qwen3-4B from 56.23% to 82.83%. Our findings em-pirically demonstrate that for structured tasks like news classification, optimizing retrieval-augmentation is a more effective and economical strategy than implementing complex rea-soning or reflection mechanisms.
Evaluating autonomous driving systems in complex and diverse traffic scenarios through controllable simulation is essential to ensure their safety and reliability. However, existing traffic simulation methods face challenges in their controllability. To address this, we propose a novel diffusion-based and LLM-enhanced traffic simulation framework. Our approach incorporates a high-level understanding module and a low-level refinement module, which systematically examines the hierarchical structure of traffic elements, guides LLMs to thoroughly analyze traffic scenario descriptions step by step, and refines the generation by self-reflection, enhancing their understanding of complex situations. Furthermore, we propose a Frenet-frame-based cost function framework that provides LLMs with geometrically meaningful quantities, improving their grasp of spatial relationships in a scenario and enabling more accurate cost function generation. Experiments on the Waymo Open Motion Dataset (WOMD) demonstrate that our method can handle more intricate descriptions and generate a broader range of scenarios in a controllable manner.
Vision-language models (VLMs) have achieved remarkable results in remote sensing scene interpretation. However, existing models primarily rely on a single-step reasoning paradigm, which suffers from Incomplete Perception and Granularity Limitation when confronting complex tasks requiring comprehensive, multigranularity visual contexts. To overcome these bottlenecks, we propose RS_DeepReason, a training-free, large language model (LLM)-driven Deep Reasoning Framework. This framework leverages the logical planning capabilities of LLMs to restructure the reasoning process into a hierarchical, iterative Reason-Observe-Re-reason cycle. Specifically, we construct a dynamic reasoning tree where the LLM recursively decomposes the problem into fine-grained queries, actively guiding the perception module to mine visual clues. Subsequently, a multistage reflection mechanism consolidates these multilevel contexts to derive the final conclusion. Extensive experiments demonstrate that RS_DeepReason significantly outperforms both single-VLM baselines and collaborative baselines in accuracy and robustness. Furthermore, our framework generates clear, traceable reasoning paths, substantially enhancing interpretability in complex remote sensing scenarios.
Real-Time Bidding (RTB) enables advertisers to place competitive bids on impression opportunities instantaneously, striving for cost-effectiveness in a highly competitive landscape. Although RTB has widely benefited from the utilization of technologies such as deep learning and reinforcement learning, the reliability of related methods often encounters challenges due to the discrepancies between online and offline environments and the rapid fluctuations of online bidding. To handle these challenges, RTBAgent is proposed as the first RTB agent system based on large language models (LLMs), which synchronizes real competitive advertising bidding environments and obtains bidding prices through an integrated decision-making process. Specifically, obtaining reasoning ability through LLMs, RTBAgent is further tailored to be more professional for RTB via involved auxiliary modules, i.e., click-through rate estimation model, expert strategy knowledge, and daily reflection. In addition, we propose a two-step decision-making process and multi-memory retrieval mechanism, which enables RTBAgent to review historical decisions and transaction records and subsequently make decisions more adaptive to market changes in real-time bidding. Empirical testing with real advertising datasets demonstrates that RTBAgent significantly enhances profitability. The RTBAgent code will be publicly accessible at: https://github.com/CaiLeng/RTBAgent.
Large-language-model (LLM)-based AI agents have recently showcased impressive versatility by employing dynamic reasoning, an adaptive, multi-step process that coordinates with external tools. This shift from static, single-turn inference to agentic, multi-turn workflows broadens task generalization and behavioral flexibility, but it also introduces serious concerns about system-level cost, efficiency, and sustainability. This paper presents the first comprehensive system-level analysis of AI agents, quantifying their resource usage, latency behavior, energy consumption, and datacenter-wide power consumption demands across diverse agent designs and test-time scaling strategies. We further characterize how AI agent design choices, such as few-shot prompting, reflection depth, and parallel reasoning, impact accuracy-cost tradeoffs. Our findings reveal that while agents improve accuracy with increased compute, they suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs. Through detailed evaluation of representative agents, we highlight the profound computational demands introduced by AI agent workflows, uncovering a looming sustainability crisis. These results call for a paradigm shift in agent design toward compute-efficient reasoning, balancing performance with deployability under real-world constraints.
Grounding the reasoning ability of large language models (LLMs) for embodied tasks is challenging due to the complexity of the physical world. Especially, LLM planning for multi-agent collaboration requires communication of agents or credit assignment as the feedback to re-adjust the proposed plans and achieve effective coordination. However, existing methods that overly rely on physical verification or self-reflection suffer from excessive and inefficient querying of LLMs. In this paper, we propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. Specifically, we perform critic regression to learn a sequential advantage function from LLM-planned data, and then treat the LLM planner as an optimizer to generate actions that maximize the advantage function. It endows the LLM with the foresight to discern whether the action contributes to accomplishing the final task. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Overcooked-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents and query rounds of LLMs, demonstrating its high efficiency for grounding LLMs. More results are given at https://embodied-read.github.io
Diplomacy is one of the most sophisticated activities in human society, involving complex interactions among multiple parties that require skills in social reasoning, negotiation, and long-term strategic planning. Previous AI agents have demonstrated their ability to handle multi-step games and large action spaces in multi-agent tasks. However, diplomacy involves a staggering magnitude of decision spaces, especially considering the negotiation stage required. While recent agents based on large language models (LLMs) have shown potential in various applications, they still struggle with extended planning periods in complex multi-agent settings. Leveraging recent technologies for LLM-based agents, we aim to explore AI's potential to create a human-like agent capable of executing comprehensive multi-agent missions by integrating three fundamental capabilities: 1) strategic planning with memory and reflection; 2) goal-oriented negotiation with social reasoning; and 3) augmenting memory through self-play games for self-evolution without human in the loop.
Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks, yet their performance in dynamic, real-world financial environments remains underexplored. Existing approaches are limited to historical backtesting, where trading actions cannot influence market prices and agents train only on static data. To address this limitation, we present the Agent Trading Arena, a virtual zero-sum stock market in which LLM-based agents engage in competitive multi-agent trading and directly impact price dynamics. By simulating realistic bid-ask interactions, our platform enables training in scenarios that closely mirror live markets, thereby narrowing the gap between training and evaluation. Experiments reveal that LLMs struggle with numerical reasoning when given plain-text data, often overfitting to local patterns and recent values. In contrast, chart-based visualizations significantly enhance both numerical reasoning and trading performance. Furthermore, incorporating a reflection module yields additional improvements, especially with visual inputs. Evaluations on NASDAQ and CSI datasets demonstrate the superiority of our method, particularly under high volatility. All code and data are available at https://github.com/wekjsdvnm/Agent-Trading-Arena.
Large Language Models (LLMs) provide cognitive capabilities that enable robots to interpret and reason about their workspace, especially when paired with semantically rich representations like semantic maps. However, these models are prone to generating inaccurate or invented responses, known as hallucinations, that can produce an erratic robotic operation. This can be addressed by employing agentic workflows, structured processes that guide and refine the model’s output to improve response quality. This work formally defines and qualitatively analyzes the impact of three agentic workflows (LLM Ensemble, Self-Reflection, and Multi-Agent Reflection) on enhancing the reasoning capabilities of an LLM guiding a robotic system to perform object-centered planning. In this context, the LLM is provided with a pre-built semantic map of the environment and a query, to which it must respond by determining the most relevant objects for the query. This response can be used in a multitude of downstream tasks. Extensive experiments were carried out employing state-of-the-art LLMs and semantic maps generated from the widely-used datasets ScanNet and SceneNN. The results show that agentic workflows significantly enhance object retrieval performance, especially in scenarios requiring complex reasoning, with improvements averaging up to 10% over the baseline.
Large Language Models (LLMs) often falter at complex planning tasks that require exploration and self-correction, as their linear reasoning process struggles to recover from early mistakes. While search algorithms like Monte Carlo Tree Search (MCTS) can explore alternatives, they are often ineffective when guided by sparse rewards and fail to leverage the rich semantic capabilities of LLMs. We introduce SPIRAL (Symbolic LLM Planning via Grounded and Reflective Search), a novel framework that embeds a cognitive architecture of three specialized LLM agents into an MCTS loop. SPIRAL's key contribution is its integrated planning pipeline where a Planner proposes creative next steps, a Simulator grounds the search by predicting realistic outcomes, and a Critic provides dense reward signals through reflection. This synergy transforms MCTS from a brute-force search into a guided, self-correcting reasoning process. On the DailyLifeAPIs and HuggingFace datasets, SPIRAL consistently outperforms the default Chain-of-Thought planning method and other state-of-the-art agents. More importantly, it substantially surpasses other state-of-the-art agents; for example, SPIRAL achieves 83.6% overall accuracy on DailyLifeAPIs, an improvement of over 16 percentage points against the next-best search framework, while also demonstrating superior token efficiency. Our work demonstrates that structuring LLM reasoning as a guided, reflective, and grounded search process yields more robust and efficient autonomous planners. The source code, full appendices, and all experimental data are available for reproducibility at the official project repository.
Modern artificial intelligence (AI) systems critically lack the ability to reflect on their own behavior, reasoning processes, and generated output. For example, despite recent advancements in large language models (LLMs), the appearance of reflection in such systems is a linguistic trick rather than a cognitive competence. This deficiency can cause significant challenges, particularly in contexts where safety and reliability are important, such as healthcare and security applications, and also in social situations where complex contexts drive expected behavior. Building on previous work in computational self-awareness and reflection in adaptive systems, in this paper, we propose a reflective agent architecture that incorporates formal models of social expectations and self-simulation mechanisms with LLMs This architecture enhances LLM-based systems to reflect on their decisions and outputs, evaluating alignment with expected behaviors. Furthermore, it enables agents to internally simulate potential actions and evaluate their consequences. Using the expectation event calculus (EEC), the system formally represents expectations, events, and derived outcomes, supporting systematic self-evaluation. Concurrently, self-simulation allows the agent to introspectively predict and analyze possible outcomes and refine its decision when necessary. Our results demonstrate enhanced alignment with human expectations, highlighting the architecture's promise of greater social sensitivity in complex scenarios that require robust and trustworthy AI interactions.
Building Energy Management System (BEMS) is important for optimizing energy usage. However, existing BEMS often lack natural language explanations, making it difficult for residents to understand and trust system decisions. This paper proposes a large language model (LLM) Agent-based (BEMS Agent) policy explanation for BEMS, enabling the system to generate comprehensible explanations of environmental parameters and control decisions. To improve reasoning accuracy, we introduce a Reflection-Chain-of-Thought (RCoT) mechanism that enhances the logical consistency of explanations. Furthermore, we develop an automated evaluation method to assess how well users understand the generated explanations. This paper highlights the potential of LLMdriven interpretability in increasing user trust and optimizing energy management efficiency, as evidenced by a 29.4 % improvement in scores when using RCoT prompts compared to CoT prompts.
Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic "Error → Reflection → Correction" learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities. Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM's 9.1%.
While Large Language Models (LLMs) show promise for task planning, their efficacy diminishes in complex, long-horizon tasks within dynamic, partially observable environments, primarily due to challenges in long-term reasoning and effective adaptation from experience. A key limitation of current approaches is the insufficient utilization of historical trajectory information. To overcome these challenges in the context of Partially Observable Markov Decision Process (POMDP) planning, this paper introduces DMA-MCTS (Dynamic Memory-Augmented Monte-Carlo Tree Search), a framework that integrates Monte Carlo Tree Search (MCTS) with LLMs, augmented by a novel dynamic memory and reflection system. The core technical contributions include: (1) a dual-layer semantic memory repository enabling efficient context-aware retrieval of past experiences; (2) a memory-enhanced UCT selection strategy biased by historical Q-values to guide search; and (3) a differentiated reflection mechanism employing LLMs to extract generalizable knowledge from both successful and failed trajectories. Comprehensive evaluations conducted on complex object rearrangement tasks within the VirtualHome simulator demonstrate that DMA-MCTS significantly outperforms relevant baselines, including standard LLM-MCTS approaches, in terms of task success rate, generalization capabilities, and planning efficiency. These results underscore the critical importance of integrating structured dynamic memory and systematic reflection mechanisms for developing highly adaptive and effective LLM-based agents capable of tackling long-horizon planning problems.
We demonstrate that a group of Al agents can autonomously optimize power commissioning in WDM links. By leveraging modern LLMs' reflection and reasoning capabilities and interacting with a network digital twin, the agents achieve optimal solutions for different criteria such as power and OSNR equalization.
Large language models (LLMs) fine-tuned on multimodal financial data have demonstrated impressive reasoning capabilities in various financial tasks. However, they often struggle with multi-step, goal-oriented scenarios in interactive financial markets, such as trading, where complex agentic approaches are required to improve decision-making. To address this, we propose \textsc{FLAG-Trader}, a unified architecture integrating linguistic processing (via LLMs) with gradient-driven reinforcement learning (RL) policy optimization, in which a partially fine-tuned LLM acts as the policy network, leveraging pre-trained knowledge while adapting to the financial domain through parameter-efficient fine-tuning. Through policy gradient optimization driven by trading rewards, our framework not only enhances LLM performance in trading but also improves results on other financial-domain tasks. We present extensive empirical evidence to validate these enhancements.
In the contemporary landscape of large language models (LLMs) development, it is crucial to address the challenges of deploying these models on hardware-constrained consumer electronic devices (CEDs), especially within complex dynamic task scheduling in multi-cloud environments (MCE). We propose a novel methodology leveraging a lightweight LLM to enhance task scheduling decisions in MCE. Our approach involves creating a task scheduling expert database informed by optimization objectives to fine-tune the lightweight LLM. This enables the model to generate a schedulable candidate set of tasks based on the current state of tasks and operational conditions within CEDs across MCE, optimizing scheduling decisions and enhancing overall efficiency. Simulations using both synthetic and real-world datasets demonstrate that our method outperforms three other algorithms in cost minimization, makespan reduction, and energy consumption. In summary, our methodology empowers CEDs to optimize the utilization of multi-cloud resources and harness the capabilities of lightweight LLMs to effectively minimize makespan, operational costs, and energy consumption during the task scheduling process, thereby facilitating efficient task scheduling.
Although Multi-Agent Reinforcement Learning (MARL) is effective for complex multi-robot tasks, it suffers from low sample efficiency and requires iterative manual reward tuning. Large Language Models (LLMs) have shown promise in single-robot settings, but their application in multi-robot systems remains largely unexplored. This letter introduces a novel LLM-Aided MARL (LAMARL) approach, which integrates MARL with LLMs, significantly enhancing sample efficiency without requiring manual design. LAMARL consists of two modules: the first module leverages LLMs to fully automate the generation of prior policy and reward functions. The second module is MARL, which uses the generated functions to guide robot policy training effectively. On a shape assembly benchmark, both simulation and real-world experiments demonstrate the unique advantages of LAMARL. Ablation studies show that the prior policy improves sample efficiency by an average of 185.9% and enhances task completion, while structured prompts based on Chain-of-Thought (CoT) and basic APIs improve LLM output success rates by 28.5%–67.5%.
Multi-agent systems (MAS) have shown great potential in executing complex tasks, but coordination and safety remain significant challenges. Multi-Agent Reinforcement Learning (MARL) offers a promising framework for agent collaboration, but it faces difficulties in handling complex tasks and designing reward functions. The introduction of Large Language Models (LLMs) has brought stronger reasoning and cognitive abilities to MAS, but existing LLM-based systems struggle to respond quickly and accurately in dynamic environments. To address these challenges, we propose LLM-based Graph Collaboration MARL (LGC-MARL), a framework that efficiently combines LLMs and MARL. This framework decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph-based coordination. Specifically, LGC-MARL consists of two main components: an LLM planner and a graph-based collaboration meta policy. The LLM planner transforms complex task instructions into a series of executable subtasks, evaluates the rationality of these subtasks using a critic model, and generates an action dependency graph. The graph-based collaboration meta policy facilitates communication and collaboration among agents based on the action dependency graph, and adapts to new task environments through meta-learning. Experimental results on the AI2-THOR simulation platform demonstrate the superior performance and scalability of LGC-MARL in completing various complex tasks.
With the rapid development of Industrial Internet of Things (IIoT), the emergence of credible federated learning provides a more effective solution for it. In this article, we use the credible collaboration between large language models (LLMs) and reinforcement learning (RL) model to improve the autonomous decision-making efficiency of autonomous underwater vehicle (AUV), reduce resource and power consumption, and solve robust decision-making problem in open environments. First, considering the complex terrain and hydrodynamic environment in the ocean, we construct a 3-D ocean simulation environment with high accuracy and high reliability to simulate the behavioral constraints of AUV in the real ocean. Second, we integrate LLaMA model into the decision-making process of AUV, utilizing its powerful information processing capability for environmental analysis and action selection, so as to improve the decision-making generalization ability of AUV in dynamic ocean environments. Finally, we propose proximal policy advantage estimation (PPAE) method and achieve safe and efficient path planning for AUV based on LLMs decision output and dynamic field environment information. The experimental results show that our method achieves a good effect in improving the decision accuracy and robustness of the AUV, which proves the effectiveness of the LLMs in the application of underwater intelligent agent control decision.
Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.
We introduce a novel reinforcement learning framework of LLM agents named AGILE (AGent that Interacts and Learns from Environments) designed to perform complex conversational tasks with users, leveraging LLMs, memory, tools, and interactions with experts. The agent possesses capabilities beyond conversation, including reflection, tool usage, and expert consultation. We formulate the construction of such an LLM agent as a reinforcement learning (RL) problem, in which the LLM serves as the policy model. We fine-tune the LLM using labeled data of actions and the PPO algorithm. We focus on question answering and release a dataset for agents called ProductQA, comprising challenging questions in online shopping. Our extensive experiments on ProductQA, MedMCQA and HotPotQA show that AGILE agents based on 7B and 13B LLMs trained with PPO can outperform GPT-4 agents. Our ablation study highlights the indispensability of memory, tools, consultation, reflection, and reinforcement learning in achieving the agent's strong performance. Datasets and code are available at https://github.com/bytarnish/AGILE.
We propose herein LLM-Guided Reinforcement Learning (LGRL), a novel framework that leverages large language models (LLMs) to decompose high-level objectives into a sequence of manageable subgoals in interactive environments. Our approach decouples high-level planning from low-level action execution by dynamically generating context-aware subgoals that guide the reinforcement learning (RL) agent. During training, intermediate subgoals—each associated with partial rewards—are generated based on the agent’s current progress, providing fine-grained feedback that facilitates structured exploration and accelerates convergence. At inference, a chain-of-thought strategy is employed, enabling the LLM to adaptively update subgoals in response to evolving environmental states. Although demonstrated on a representative interactive setting, our method is generalizable to a wide range of complex, goal-oriented tasks. Experimental results show that LGRL achieves higher success rates, improved efficiency, and faster convergence compared to baseline approaches.
Reinforcement learning (RL) often encounters delayed and sparse feedback in real-world applications, even with only episodic rewards. Previous approaches have made some progress in reward redistribution for credit assignment but still face challenges, including training difficulties due to redundancy and ambiguous attributions stemming from overlooking the multifaceted nature of mission performance evaluation. Hopefully, Large Language Model (LLM) encompasses fruitful decision-making knowledge and provides a plausible tool for reward redistribution. Even so, deploying LLM in this case is non-trivial due to the misalignment between linguistic knowledge and the symbolic form requirement, together with inherent randomness and hallucinations in inference. To tackle these issues, we introduce LaRe, a novel LLM-empowered symbolic-based decision-making framework, to improve credit assignment. Key to LaRe is the concept of the Latent Reward, which works as a multi-dimensional performance evaluation, enabling more interpretable goal attainment from various perspectives and facilitating more effective reward redistribution. We examine that semantically generated code from LLM can bridge linguistic knowledge and symbolic latent rewards, as it is executable for symbolic objects. Meanwhile, we design latent reward self-verification to increase the stability and reliability of LLM inference. Theoretically, reward-irrelevant redundancy elimination in the latent reward benefits RL performance from more accurate reward estimation. Extensive experimental results witness that LaRe (i) achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks.
本次综合报告将AI Agent与大模型的研究划分为五大核心维度:首先是多智能体架构与协作通信,探讨复杂系统中的交互机制;其次是推理、反射与规划能力的底层逻辑增强;第三是工具使用与任务执行,侧重于 Agent 的自主工具调用及自我迭代能力;第四是广泛的垂直行业落地应用,涵盖了工程、金融、科研等关键生产领域;最后是针对Agent系统的安全性保障、可靠性评估框架及性能基准建设,旨在推动Agent从单一功能验证向可信、可控的工业级应用演进。