agent方向最新研究
工具学习:让Agent学会稳定、可扩展地调用工具(个性化/协作/自动化)
围绕“工具学习与可用工具调用能力”的核心问题:让LLM从文档/协作/交互信号中学会稳定调用工具,并进一步实现个性化、自动化提示与多步工具使用(含工具发现/选择与函数化可执行接口的学习目标)。
- Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents(Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Suzan Verberne, Zhaochun Ren, 2024, Proceedings of the ACM on Web Conference 2025)
- Learning to Use Tools via Cooperative and Interactive Agents(Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Pengjie Ren, Suzan Verberne, Zhaochun Ren, 2024, Findings of the Association for Computational Linguistics: EMNLP 2024)
- PEToolLLM: Towards Personalized Tool Learning in Large Language Models(Qiancheng Xu, Yongqi Li, Heming Xia, Fan Liu, Yang Min, Wenjie Li, 2025, Findings of the Association for Computational Linguistics: ACL 2025)
- LLM-Based Agents for Tool Learning: A Survey(Weikai Xu, Chengrui Huang, Shen Gao, Shuo Shang, 2025, Data Science and Engineering)
- StepTool: Enhancing Multi-Step Tool Usage in LLMs via Step-Grained Reinforcement Learning(Yuanqing Yu, Zhefan Wang, Weizhi Ma, Zhicheng Guo, Jingtao Zhan, Shuai Wang, Chuhan Wu, Zhiqiang Guo, Min Zhang, 2024, Proceedings of the 34th ACM International Conference on Information and Knowledge Management)
- AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning(Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, V. Ioannidis, Karthik Subbian, J. Leskovec, James Zou, 2024, Advances in Neural Information Processing Systems 37)
- EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction(Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, Deqing Yang, 2025, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers))
规划与执行编排:从计划生成到状态化评测(降本增效)
共同聚焦“规划-执行/编排策略与状态化交互评测”:通过规划器(如全局DAG)、自校验/推理计划、统计结构降低工具选择成本,以及面向GUI/交互式环境引入跨回合记忆与状态化评测,以提升复杂任务的执行质量与效率。
- ToolFiVe: Enhancing Tool-Augmented LLMs via Tool Filtering and Verification(Hailun Lu, Xingming Li, Xuanyu Ji, Zhigang Kan, Qingyong Hu, 2025, ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning(Xiaolong Wei, Yuehu Dong, Xin Wang, Xi Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- AutoTool: Efficient Tool Selection for Large Language Model Agents(Jingyi Jia, Qinbin Li, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- Beyond Static GUI Agent: Evolving LLM-based GUI Testing via Dynamic Memory(Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Yangguang Xue, Boyu Wu, Yuekai Huang, Libin Wu, Qing Wang, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
- ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities(Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Nan Feng, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang, 2025, Findings of the Association for Computational Linguistics: NAACL 2025)
- ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities(Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Nan Feng, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang, 2025, Findings of the Association for Computational Linguistics: NAACL 2025)
模块化系统与协议/栈:用标准接口连接工具与组件
围绕“模块化系统与协议/栈标准化”:将工具与组件以接口/协议方式进行工程化连接,强调planner/executor与路由/内存等模块的可组合性,并通过MCP等协议降低集成成本、提升可扩展部署。
- Model Context Protocol (MCP): A Lightweight, Modular Framework for Tool-Augmented LLM Agents(Nisharg Nargund, Anil Kumar Swain, Naliniprava Behera, 2025, 2025 13th International Conference on Intelligent Systems and Embedded Design (ISED))
- Composable AI Stack for Intelligent Agents: Modular Orchestration Using Context Routing, Memory, and Tools(Angshuman Rudra, Manan Agrawal, 2025, 2025 International Conference on Computer and Applications (ICCA))
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning(Xiaolong Wei, Yuehu Dong, Xin Wang, Xi Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- An Agent Framework for a Modular Serious Game(P. Jepp, M. Fradinho, J. Pereira, 2010, 2010 Second International Conference on Games and Virtual Worlds for Serious Applications)
- Adaptive Modular Agent Architecture for Hybrid Two-Level Reasoning(Dmitry Gnatyshak, S. Álvarez-Napagao, Julian Padget, Ulises Cortés, 2025, Lecture Notes in Computer Science)
- Composable AI Stack for Intelligent Agents: Modular Orchestration Using Context Routing, Memory, and Tools(Angshuman Rudra, Manan Agrawal, 2025, 2025 International Conference on Computer and Applications (ICCA))
记忆与工作内存:长期/分布式记忆与长程推理的内存约束缓解
聚焦“记忆与工作内存管理”:包括长期/分布式记忆的管理机制、短期/工具检索记忆的优化,以及针对长程推理或图推理的工作内存约束缓解(通过缓冲、索引与与工具协作的方式提升长程能力)。
- Memory Matters: The Need to Improve Long-Term Memory in LLM-Agents(Kostas Hatalis, Despina Christou, Joshua Myers, Steven Jones, Keith A. Lambert, Adam Amos-Binks, Zohreh Dannenhauer, Dustin Dannenhauer, 2024, Proceedings of the AAAI Symposium Series)
- MemIndex: Agentic Event-based Distributed Memory Management for Multi-agent Systems(Alaa Saleh, Sasu Tarkoma, Anders Lindgren, Praveen Kumar Donta, S. Dustdar, Susanna Pirttikangas, Lauri Lovén, 2025, ACM Transactions on Autonomous and Adaptive Systems)
- Beyond Static GUI Agent: Evolving LLM-based GUI Testing via Dynamic Memory(Mengzhuo Chen, Zhe Liu, Chunyang Chen, Junjie Wang, Yangguang Xue, Boyu Wu, Yuekai Huang, Libin Wu, Qing Wang, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
- GraphCogent: Mitigating LLMs' Working Memory Constraints via Multi-Agent Collaboration in Complex Graph Understanding(Rongzheng Wang, S. Liang, Qizhi Chen, Yihong Huang, Muquan Li, Yi-Zhuo Ma, Dongyang Zhang, Ke Qin, Man-Fai Leung, 2025, Proceedings of the ACM Web Conference 2026)
- MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Retrieval and Invocation in LLM Agent Multi-turn Conversations(Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, James A. Burke, 2026, Lecture Notes in Computer Science)
多智能体协作与协调:共识决策、ToM对齐与规模化评测
共同聚焦“多智能体协作与协调(共识/ToM/协作RL/规模化)”:包含语义层面的共识决策与冲突消解、ToM对齐与协作强化学习建模、coordination基准与评测分析,以及在大规模、多域环境中实现无需预设SOP的动态分解与并行执行。
- Many Minds, One Path: LLM-Augmented Consensus Decision for Distributed Control in Multi-Agent Collaborative Stable Scenarios(Zhuohao Yu, Zhe Li, Tao Ren, ChenXue Wang, Junjie Wang, Qing Wang, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- LLM-Guided Multi-Agent Collaboration for Complex Task Automation(Vishal Bharadwaj Meruga, 2025, 2025 5th International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT))
- Evaluating LLM-Based Autonomous Agent Architectures for Task Execution with Social Robots(David Cuevas, Rubén Manrique, 2025, Communications in Computer and Information Science)
- LLM Collaboration with Multi-Agent Reinforcement Learning(Shuo Liu, Zeyu Liang, Xueguang Lyu, Christopher Amato, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- Adaptive Theory of Mind for LLM-based Multi-Agent Coordination(Chunjiang Mu, Yasi Zeng, Qiaosheng Zhang, Kun Shao, Chenhui Chu, Hao Guo, Danyang Jia, Zhen Wang, Shuyue Hu, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models(Saaket Agashe, Yue Fan, Anthony Reyna, Xin Eric Wang, 2023, Findings of the Association for Computational Linguistics: NAACL 2025)
- MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs(Qian Wang, Tianyu Wang, Zhenheng Tang, Qinbin Li, Nuo Chen, Jingsheng Liang, Bingsheng He, 2024, Findings of the Association for Computational Linguistics: ACL 2025)
认知与混合推理架构:非单调逻辑/语义中介与模块化认知流程
围绕“认知与混合推理架构”中的机制性设计:用非单调逻辑/领域知识与语义中介构建可重规划能力,并受工作记忆启发进行分解式模块化(sense/buffer/execute等),强调通过认知架构弥补纯生成的推理缺陷。
- Combining LLM, Non-Monotonic Logical Reasoning, and Human-In-the-loop Feedback in an Assistive AI Agent(Tianyi Fu, Brian Jauw, Mohan Sridharan, 2025, 2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN))
- A Modular Cognitive Architecture for Collective Intelligence Systems(A. Gibson, D. Sokolov, 2025, Lecture Notes in Computer Science)
- GraphCogent: Mitigating LLMs' Working Memory Constraints via Multi-Agent Collaboration in Complex Graph Understanding(Rongzheng Wang, S. Liang, Qizhi Chen, Yihong Huang, Muquan Li, Yi-Zhuo Ma, Dongyang Zhang, Ke Qin, Man-Fai Leung, 2025, Proceedings of the ACM Web Conference 2026)
- A Modular Cognitive Architecture for Collective Intelligence Systems(A. Gibson, D. Sokolov, 2025, Lecture Notes in Computer Science)
Agentic AI总体框架与综述:范式分类、架构组成与治理落地
作为“宏观架构综述与框架化治理/范式分类”的上层视角:系统梳理agentic AI的概念演进、架构组成(工具/记忆/规划/治理等)、符号/神经/混合范式分类,并给出面向工程与研究的总体框架与未来方向。
- A Modular Semantic Kernel Agent for Automated Code Review and Refactoring Feedback(Semih Yazıcı, Seza Dursun, Bahar Önel, Tülin Işıkkent, Sedat Çelik, Erem Karalar, Mert Alacan, 2025, Orclever Proceedings of Research and Development)
- Agentic AI: A Review of Architecture, Governance, and Sustainable Goal-Directed Autonomy(G Rafiee, WL Woo, 2025, Authorea Preprints)
- The Rise of Agentic AI: Synthesis of Current Knowledge and Future Research Agenda(Md. Asadul Islam, Subbulakshmi Somu, F. Aldaihani, 2025, Global Business and Organizational Excellence)
- Designing the Mind: How Agentic Frameworks Are Shaping the Future of AI Behavior(V. Garg, 2025, Journal of Computer Science and Technology Studies)
- AI Agents and Agentic Systems: A Multi-Expert Analysis(Laurie Hughes, Yogesh K. Dwivedi, Tegwen Malik, Mazen Shawosh, M. Albashrawi, Il Jeon, Vincent Dutot, Mandanna Appanderanda, Tom Crick, Rahul De', Mark Fenwick, Senali Madugoda Gunaratnege, Paulius Jurcys, A. Kar, N. Kshetri, Keyao Li, Sashah Mutasa, Spyridon Samothrakis, Michael Wade, Paul Walton, 2025, Journal of Computer Information Systems)
- Agentic AI: a comprehensive survey of architectures, applications, and future directions(Mohamad Abou Ali, F. Dornaika, 2025, Artificial Intelligence Review)
- The Rise of Agentic AI: A Review of Definitions, Frameworks, Architectures, Applications, Evaluation Metrics, and Challenges(Ajay Bandi, Bhavani Kongari, Roshini Naguru, Sahitya Pasnoor, Sri Vidya Vilipala, 2025, Future Internet)
- Agentic AI Systems: What It Is and Isn't(Yogesh K. Dwivedi, M. Y. Helal, I. Elgendy, Rasha Alahmad, Paul Walton, Ayoung Suh, Vinay Singh, Il Jeon, 2025, Global Business and Organizational Excellence)
- Agentic AI: a comprehensive survey of architectures, applications, and future directions(Mohamad Abou Ali, F. Dornaika, 2025, Artificial Intelligence Review)
面向真实应用的Agentic AI:领域落地、安全风险与评测基准
面向“真实应用与安全/风险评测”的落地导向:覆盖医疗等安全敏感场景、自动驾驶/无人机等自主系统威胁与安全性、以及可复现实验与基准思路(如FHIR-AgentEval),同时讨论将agentic能力用于机器人现实世界的评估与伦理/安全维度。
- FHIR-AgentEval: A Modular Sandbox for Benchmarking Clinical LLM Agents with an Evaluation of Memory-Augmented Configurations.(Youssef Mokssit, Kamalakkannan Ravi, Mengshu Nie, Junyoung Kim, Cong Liu, 2026, Research square)
- The Role of Agentic Artificial Intelligence in Healthcare: A Systematic Review(Bernardo Gabriele Collaço, Syed Ali Haider, Srinivasagam Prabha, Cesar A. Gomez-Cabello, Ariana Genovese, Nadia Wood, Sanjay P. Bagaria, Narayanan Gopala, Cui Tao, Antonio J. Forte, 2025, Research Square)
- LLM and AI Agents for Autonomous Systems: A Survey of Applications, Datasets, and Security Challenges(M. Ferrag, Abderrahmane Lakas, N. Tihanyi, Mérouane Debbah, 2026, IEEE Open Journal of Intelligent Transportation Systems)
- From Prompt to Action: A Comprehensive Review of LLM Autonomous Agents(Zainab Rafique, Muhammad Wasim, Mudassar Hussain, Muzammil Hussain, Muhamad Irfan Memon, 2025, 2025 IEEE International Conference on Wireless for Space and Extreme Environments (WiSEE))
- The role of agentic AI in shaping a smart future: A systematic review(Soodeh Hosseini, Hossein Seilani, 2025, Array)
- Agentic LLM-based robotic systems for real-world applications: a review on their agenticness and ethics(Emmanuel K. Raptis, Athanasios Ch. Kapoutsis, Elias B. Kosmatopoulos, 2025, Frontiers in Robotics and AI)
Agent能力训练与自我改进:反思学习、数据方法与评测框架
聚焦“Agent能力训练与自我改进/反思反馈学习”:通过数据重构与负样本降低幻觉、用语言反馈实现无需直接权重更新的强化、以及自动迭代优化agent配置的系统框架;同时强调配套的定义、评测指标与测试方法体系。
- Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models(Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, Feng Zhao, 2024, Findings of the Association for Computational Linguistics ACL 2024)
- Reflexion: language agents with verbal reinforcement learning(Noah Shinn, Federico Cassano, Beck Labash, A. Gopinath, Karthik Narasimhan, Shunyu Yao, 2023, Advances in Neural Information Processing Systems 36)
- A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops(Kamer Ali Yüksel, Thiago Castro Ferreira, Mohamed Al-Badrashiny, Hassan Sawaf, 2025, Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025))
- LLM-Based Multi-agent Systems: Frameworks, Evaluation, Open Challenges, and Research Frontiers(S. Shaikh, 2025, Communications in Computer and Information Science)
- The Rise of Agentic AI: A Review of Definitions, Frameworks, Architectures, Applications, Evaluation Metrics, and Challenges(Ajay Bandi, Bhavani Kongari, Roshini Naguru, Sahitya Pasnoor, Sri Vidya Vilipala, 2025, Future Internet)
- The role of agentic AI in shaping a smart future: A systematic review(Soodeh Hosseini, Hossein Seilani, 2025, Array)
端到端Agent可操作性:工具发现-执行与记忆闭环
围绕“端到端推理代理将工具发现/动作执行/记忆管理整合”的可操作性:将工具与推理动作在同一代理流程中闭环执行,并通过模块化与记忆机制提升整体可用性与泛化。
- MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Retrieval and Invocation in LLM Agent Multi-turn Conversations(Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, James A. Burke, 2026, Lecture Notes in Computer Science)
- Adaptive Modular Agent Architecture for Hybrid Two-Level Reasoning(Dmitry Gnatyshak, S. Álvarez-Napagao, Julian Padget, Ulises Cortés, 2025, Lecture Notes in Computer Science)
- DeepAgent: A General Reasoning Agent with Scalable Toolsets(Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou, 2025, Proceedings of the …)
- DeepAgent: A General Reasoning Agent with Scalable Toolsets(Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou, 2025, Proceedings of the …)
合并后的统一分组将agent方向最新研究按“能力获取(工具学习/训练)—系统实现(规划编排/模块化协议/端到端闭环)—长期可用性(记忆与工作内存)—复杂任务能力(多智能体协作/认知混合推理)—上层治理与工程落地(总体框架综述/真实应用安全评测)”的主线拆分为并列的若干主题板块。整体覆盖从算法方法、评测基准到工程部署的体系化演进,避免了将不同细粒度机制(如工具学习、规划编排、记忆管理、端到端闭环)混为笼统大组的情况,并保留了各文献集合的独特关注点。
总计50篇相关文献
In this paper, we provide a review of the current efforts to develop LLM agents, which are autonomous agents that leverage large language models. We examine the memory management approaches used in these agents. One crucial aspect of these agents is their long-term memory, which is often implemented using vector databases. We describe how vector databases are utilized to store and retrieve information in LLM agents. Moreover we highlight open problems, such as the separation of different types of memories and the management of memory over the agent's lifetime. Lastly, we propose several topics for future research to address these challenges and further enhance the capabilities of LLM agents, including the use of metadata in procedural and semantic memory and the integration of external knowledge sources with vector databases.
… requiring repeated, independent tool usage. We introduce MemTool, a short-term memory framework enabling LLM agents to dynamically retrieve and manage tools or MCP server …
The development of Large Language Models (LLMs) enables LLM-based GUI testing to interact with graphical user interfaces by understanding GUI screenshots and generating actions, which are widely applied in industry and academia. However, current approaches test each app in isolation, lacking mechanisms for experience accumulation and reuse. This limitation often causes GUI testing approaches to miss deeper exploration and fail to trigger bug-prone functionalities. To address this, we propose MemoDroid, a three-layer memory mechanism that augments LLM-based GUI testing with the ability to evolve through repeated interaction. MemoDroid designs episodic memory to capture functional-level testing traces, reflective memory to summarize issue patterns and redundant behaviors, and strategic memory to synthesize cross-app exploration strategies. These memory layers are dynamically retrieved and injected into LLM prompts at runtime, enabling the agent to reuse successful behaviors, avoid ineffective actions, and prioritize bug-prone paths. We implement MemoDroid as a lightweight plugin, which can be integrated into existing LLM-based GUI testing approaches. We evaluate MemoDroid on real-world apps from 15 diverse app categories. Results show that MemoDroid enhances GUI testing performance across five baselines, with activity and code coverage increasing by 79% - 96% and 81% - 97%, and bug detection improving by 57% - 198%. Ablation studies confirm the contributions of each memory layer. Furthermore, MemoDroid detects 49 new bugs in 200 popular apps, with 35 confirmed fixes and 14 acknowledged by developers, showing its practical value in memory-driven GUI testing.
… This narrative review explores the role of Agentic AI in shaping … the diverse capabilities of Agentic AI (eg, multimodal … The paper examines how Agentic AI enables autonomous …
The rapid adoption of artificial intelligence (AI) is shifting from tools that assist human tasks toward self‐directed, agentic AI systems capable of planning and executing complex goals with minimal oversight. However, a clear understanding of what distinguishes these systems from conventional AI agents and generative AI is lacking, obscuring their unique opportunities and risks. To this end, this article addresses that gap by defining the core concepts, technologies, and management approaches for agentic AI systems, which utilize planning, shared memory, tools, and multi‐agent teamwork to complete complex tasks autonomously. By contrasting this paradigm with its predecessors, the paper synthesizes recent technical surveys, governance proposals, and early industrial deployments to highlight that while agentic AI enables transformative applications like end‐to‐end process automation and adaptive decision support, it also introduces significant challenges, including cascading errors, goal misalignment, and regulatory gaps. Finally, this paper concludes with strategic guidance for organizations and consumers to adopt the capabilities of these systems responsibly, emphasizing the imperative of maintaining transparency, accountability, and human oversight.
Agentic AI represents a transformative shift in artificial intelligence, but its rapid advancement has led to a fragmented understanding, often conflating modern neural systems with outdated symbolic models—a practice known as conceptual retrofitting. This survey cuts through this confusion by introducing a novel dual-paradigm framework that categorizes agentic systems into two distinct lineages: the symbolic/classical (relying on algorithmic planning and persistent state) and the neural/generative (leveraging stochastic generation and prompt-driven orchestration). Through a systematic PRISMA-based review of 90 studies (2018–2025), we provide a comprehensive analysis structured around this framework across three dimensions: (1) the theoretical foundations and architectural principles defining each paradigm; (2) domain-specific implementations in healthcare, finance, and robotics, demonstrating how application constraints dictate paradigm selection; and (3) paradigm-specific ethical and governance challenges, revealing divergent risks and mitigation strategies. Our analysis reveals that the choice of paradigm is strategic: symbolic systems dominate safety-critical domains (e.g., healthcare), while neural systems prevail in adaptive, data-rich environments (e.g., finance). Furthermore, we identify critical research gaps, including a significant deficit in governance models for symbolic systems and a pressing need for hybrid neuro-symbolic architectures. The findings culminate in a strategic roadmap arguing that the future of Agentic AI lies not in the dominance of one paradigm, but in their intentional integration to create systems that are both adaptable and reliable. This work provides the essential conceptual toolkit to guide future research, development, and policy toward robust and trustworthy hybrid intelligent systems.
Agentic frameworks represent a paradigm shift in artificial intelligence, transitioning from reactive systems to autonomous entities capable of perceiving environments, reasoning about complex situations, planning actions, and executing decisions aligned with specific goals. These architectures integrate multiple specialized components—perception modules, world modeling capabilities, goal management systems, planning mechanisms, and action execution frameworks—working in concert to enable proactive behavior in dynamic environments. While offering transformative potential across domains including robotics, healthcare, finance, and human-AI collaboration, agentic systems simultaneously present significant challenges related to safety, value alignment, interpretability, and governance. Addressing these challenges requires multidisciplinary approaches spanning technical innovation, responsible design methodologies, and anticipatory governance frameworks. The evolution of agentic AI represents not merely a technical advancement but a fundamental reconceptualization of human-machine relationships, with profound implications for how intelligent systems will operate and integrate within society.
… Abstract—Agentic artificial intelligence (agentic AI) represents the next stage of intelligent systems, extending generative models with goal orientation, planning, memory, tool use, and …
ABSTRACT The emergence of AI agents and agentic systems represents a significant milestone in artificial intelligence, enabling autonomous systems to operate, learn, and collaborate in complex environments with minimal human intervention. This paper, drawing on multi-expert perspectives, examines the potential of AI agents and agentic systems to reshape industries by decentralizing decision-making, redefining organizational structures, and enhancing cross-functional collaboration. Specific applications include healthcare systems capable of creating adaptive treatment plans, supply chain agents that predict and address disruptions in real-time, and business process automation that reallocates tasks from humans to AI, improving efficiency and innovation. However, the integration of these systems raises critical challenges, including issues of attribution and shared accountability in decision-making, compatibility with legacy systems, and addressing biases in AI-driven processes. The paper concludes that while agentic systems hold immense promise, robust governance frameworks, cross-industry collaboration, and interdisciplinary research into ethical design are essential. Future research should explore adaptive workforce reskilling strategies, transparent accountability mechanisms, and energy-efficient deployment models to ensure ethical and scalable implementation.
Background/ Objectives: Agentic AI represents a promising evolution of AI technology applied to healthcare, with systems increasingly capable of operating autonomously to achieve defined clinical goals. However, the literature lacks conceptual clarity between “AI agents” and “agentic AI”, and few studies have rigorously explored their clinical applications. Therefore, this study aims to conduct a novel systematic review addressing this gap by examining agentic AI systems in healthcare settings, characterizing their applications, features, outcomes, and limitations, and clarifying the conceptual distinctions between AI agents and agentic AI systems using predefined and objective criteria. Methods A comprehensive search was conducted across PubMed, Embase, Cochrane, Scopus, and Google Scholar on April 6th, 2025. Studies were included if they involved AI systems in healthcare settings that demonstrated the following agentic features: autonomous operation, goal-directed behavior, and initiating action. Data on the clinical tasks achieved by the agents, key findings, features, and limitations were collected from the included studies. Screening and extraction followed PRISMA guidelines, with Risk of Bias assessed using ROBINS-I and Cochrane's Risk of Bias tools. Results Of 984 retrieved records, seven studies met the inclusion criteria, spanning domains such as emergency medicine, oncology, radiology, and rehabilitation. Multi-agent architecture was frequently used to decompose and coordinate complex workflows. Among the included studies, the AIs showed high accuracy in diagnosing cancer patients, conducting treatment plans, sending alerts, coaching messages, analysing image data, and adapting to challenging experimental scenarios. While demonstrating potential for improved efficiency, task accuracy, and patient engagement, significant limitations were noted: narrow task scope, lack of physical agency, limited clinical validation, and barriers to integration into real-world healthcare systems. Only one system had been deployed in a patient-facing trial setting. Conclusion The current literature suggests an emerging role and application of Agentic AI, holding promise with the potential to revolutionize diagnostics, triage, treatment planning, and patient management. However, real-world implementation and evaluations in the literature are limited. Future research must address critical validation, regulation, ethics, and clinical integration challenges to realize their full potential. Clear operational definitions and frameworks for evaluating agency are essential to support safe and effective deployment of these systems.
Agentic AI systems are a recently emerged and important approach that goes beyond traditional AI, generative AI, and autonomous systems by focusing on autonomy, adaptability, and goal-driven reasoning. This study provides a clear review of agentic AI systems by bringing together their definitions, frameworks, and architectures, and by comparing them with related areas like generative AI, autonomic computing, and multi-agent systems. To do this, we reviewed 143 primary studies on current LLM-based and non-LLM-driven agentic systems and examined how they support planning, memory, reflection, and goal pursuit. Furthermore, we classified architectural models, input–output mechanisms, and applications based on their task domains where agentic AI is applied, supported using tabular summaries that highlight real-world case studies. Evaluation metrics were classified as qualitative and quantitative measures, along with available testing methods of agentic AI systems to check the system’s performance and reliability. This study also highlights the main challenges and limitations of agentic AI, covering technical, architectural, coordination, ethical, and security issues. We organized the conceptual foundations, available tools, architectures, and evaluation metrics in this research, which defines a structured foundation for understanding and advancing agentic AI. These findings aim to help researchers and developers build better, clearer, and more adaptable systems that support responsible deployment in different domains.
Agentic artificial intelligence (AAI) represents a significant evolution in the field of AI, moving beyond traditional and generative systems toward models characterized by autonomy, adaptivity, proactiveness, and decision agency. Unlike earlier AI paradigms that were reactive or limited to narrow tasks, AAI integrates reasoning, memory, planning, and tool orchestration to pursue complex objectives with minimal human oversight. Using a systematic literature review method, this study synthesizes current knowledge on AAI by examining its conceptual foundations, practical applications, and emerging research directions. Conceptually, AAI is distinguished from automation, generative AI, and multi‐agent systems through its unique capacity to operate as a socio‐technical partner in organizational and societal contexts. In practice, AAI is being applied across sectors such as healthcare, finance, manufacturing, education, and sustainability, enabling organizations to enhance decision support, optimize processes, and improve resilience in global business contexts. However, these advancements present significant challenges, including governance, transparency, accountability, workforce transformation, and integration with legacy systems. On the research front, four major streams dominate current scholarship: human–AI collaboration and co‐agency; balancing AI autonomy with human control; governance and trust; and societal and ethical implications. To unify these insights, this paper develops an antecedent–mechanism–outcome framework linking technological, organizational, and societal enablers to the mechanisms and outcomes of AAI adoption. Building on this synthesis, a future research agenda is proposed that emphasizes conceptual refinement, responsible integration, methodological innovation, and interdisciplinary collaboration. Overall, the study contributes to both academic and managerial understanding in the global business context by highlighting AAI as both a driver of business strategy and a potential enabler of organizational excellence and sustainable development.
Large Language Models (LLMs) have quickly pushed the frontiers of autonomous agents with advanced reasoning, natural language interaction, and tool chaining in complex worlds. With LLM-based agents arising in domains like digital assistants, autonomous robots, and mission planning, it is more critical than ever to have a deep understanding of their construction, strengths, and weaknesses—especially for safety-critical and adversarial domains like space systems. This article presents an overview of the most recent developments in autonomous agents built with LLMs. We categorize modern architectures, single-agent and multi-agent architectures, and their most prominent functional modules—perception, reasoning, planning, and action. We present new functionality facilitated by LLMs, including zero-shot generalization, dynamic tool use, and human-AI collaboration, and criticize their drawbacks in real-world use, e.g., hallucination, limited resources, and safety. Besides, we discuss future standards and metrics for LLM agents, including how to measure dependability and robustness in hostile environments. Lastly, we present open research challenges highlighting the necessity of stable, efficient, and robust LLM-based agents deployable in wireless, remote, and hostile environments. This survey aims to offer researchers and practitioners a brief overview of the status quo with LLM-based autonomous agents and inspire future work bridging current gaps between general-purpose language intelligence and domain-specific autonomous systems.
Agentic AI refers to autonomous systems that can perceive their environment, make decisions, and take actions to achieve goals with minimal or no human intervention. Recent advances in Large Language Models (LLMs) have opened new pathways to imbue robots with such “agentic” behaviors by leveraging the LLMs’ vast knowledge and reasoning capabilities for planning and control. This survey provides the first comprehensive exploration of LLM-based robotic systems integration into agentic behaviors that have been validated in real-world applications. We systematically categorized these systems across navigation, manipulation, multi-agent, and general-purpose multi-task robots, reflecting the range of applications explored. We introduce a novel, first-of-its-kind agenticness classification that evaluates existing LLM-driven robotic works based on their degree of autonomy, goal-directed behavior, adaptability, and decision-making. Additionally, central to our contribution is an evaluation framework explicitly addressing ethical, safety, and transparency principles—including bias mitigation, fairness, robustness, safety guardrails, human oversight, explainability, auditability, and regulatory compliance. By jointly mapping the landscape of agentic capabilities and ethical safeguards, we uncover key gaps, tensions, and design trade-offs in current approaches. We believe that this work serves as both a diagnostic and a call to action: as LLM-empowered robots grow more capable, ensuring they remain comprehensible, controllable, and aligned with societal norms is not optional—it is essential.
The rapid integration of Large Language Models (LLMs) into autonomous systems marks a significant transition from modular, rule-based approaches to reasoning-driven, agent-based, and multimodal intelligence. LLM reasoning enables adaptive decision-making, context-aware planning, and human-aligned interaction, while AI agents extend these capabilities into structured autonomy pipelines that coordinate perception, reasoning, and control. These advancements are particularly critical in safety-sensitive domains such as autonomous driving (AD) and unmanned aerial vehicles (UAVs). This survey provides a comprehensive review of LLM reasoning and AI agents across scenario generation, decision-making, multimodal perception, cooperative V2X interactions, and UAV swarm autonomy. We examine the role of simulation platforms and datasets, including CARLA, Apollo ADS, AirSim, nuScenes, DriveLM, and emerging synthetic environments, in supporting reproducible evaluation and benchmarking. In addition, we analyze pressing security and robustness challenges, including adversarial prompt injection, data poisoning, multimodal perturbations, privacy leakage, and vulnerabilities in cooperative agent communication. Finally, we propose future research directions including adversarially robust pipelines, hybrid symbolic LLM planning, secure multimodal fusion, privacy-preserving human alignment, distributed trust mechanisms for swarm autonomy, and optimized Drone-LLM deployment across on-drone, edge, and cloud environments. By unifying applications, datasets, benchmarks, reasoning, agents, and security, this survey establishes a roadmap for developing robust, trustworthy, and secure LLM-enabled autonomous systems.
… MAS systems highly extensible, allowing researchers to rapidly test workflows or simulate multi-role scenarios without extensive labelled data or retraining loops. Prompt templates thus …
Agentic AI systems use specialized agents to handle tasks within complex workflows, enabling automation and efficiency.However, optimizing these systems often requires laborintensive, manual adjustments to refine roles, tasks, and interactions.This paper introduces a framework for autonomously optimizing Agentic AI solutions across industries, such as NLGdriven enterprise applications.The system employs agents for Refinement, Execution, Evaluation, Modification, and Documentation, leveraging iterative feedback loops powered by an LLM (Llama 3.2-3B).The framework achieves optimal performance without human input by autonomously generating and testing hypotheses to improve system configurations.This approach enhances scalability and adaptability, offering a robust solution for real-world applications in dynamic environments.Case studies across diverse domains illustrate the transformative impact of this framework, showcasing significant improvements in output quality, relevance, and actionability.All data for these case studies, including original and evolved agent codes, along with their outputs, are here: anonymous.4open.science/r/evolver-1D11/
… LLM-based autonomous agent architectures for social robots, with the overarching goal of improving both autonomous … by implementing an LLM-based autonomous agent architecture, …
… in the current research arena pertaining to LLM-based Autonomous Agents lies in the lack of … development process, AI agents create a continuous feedback loop that not only corrects …
LLM-based multi-agent systems (MAS) have shown promise in tackling complex tasks. However, existing solutions often suffer from limited agent coordination and heavy reliance on predefined Standard Operating Procedures (SOPs), which demand extensive human input. To address these limitations, we propose MegaAgent, a large-scale autonomous LLM-based multi-agent system. MegaAgent generates agents based on task complexity and enables dynamic task decomposition, parallel execution, efficient communication, and comprehensive system monitoring of agents. In evaluations, MegaAgent demonstrates exceptional performance, successfully developing a Gobang game within 800 seconds and scaling up to 590 agents in a national policy simulation to generate multi-domain policies. It significantly outperforms existing systems, such as MetaGPT, in both task completion efficiency and scalability. By eliminating the need for predefined SOPs, MegaAgent demonstrates exceptional scalability and autonomy, setting a foundation for advancing true autonomy in MAS. Our code is available at https://github.com/Xtra-Computing/MegaAgent .
Large Language Models (LLMs) are considered state of the art for many tasks in robotics and AI. At the same time, there is increasing evidence of their critical limitations such as generating arbitrary responses in new situations, inability to support rapid incremental updates based on limited examples, and opacity. Toward addressing these limitations, our architecture leverages the complementary strengths of LLMs and knowledge-based reasoning. Specifically, the architecture enables an AI agent assisting a human to use an LLM to provide generic abstract predictions of upcoming tasks. The agent also reasons with domain-specific knowledge, recent history of interactions with the human, and semantic databases to: (a) provide contextual prompts to the LLM; and (b) compute a plan of concrete actions that jointly implements the current task and prepares for the anticipated task, replanning as needed. Furthermore, the agent solicits and uses high-level human feedback based on need and availability to incrementally revise the domain-specific knowledge and interactions with the LLM. We ground and evaluate our architecture’s abilities in the realistic VirtualHome simulation environment, demonstrating a substantial performance improvement compared with just using an LLM or an LLM and logical reasoner. Project website: https://brianej.github.io/igfmrdskaa.github.io/
Large Language Models (LLMs) have demonstrated emergent common-sense reasoning and Theory of Mind (ToM) capabilities, making them promising candidates for developing coordination agents. This study introduces the LLM-Coordination Benchmark, a novel benchmark for analyzing LLMs in the context of Pure Coordination Settings, where agents must cooperate to maximize gains. Our benchmark evaluates LLMs through two distinct tasks. The first is Agentic Coordination, where LLMs act as proactive participants in four pure coordination games. The second is Coordination Question Answering (CoordQA), which tests LLMs on 198 multiple-choice questions across these games to evaluate three key abilities: Environment Comprehension, ToM Reasoning, and Joint Planning. Results from Agentic Coordination experiments reveal that LLM-Agents excel in multi-agent coordination settings where decision-making primarily relies on environmental variables but face challenges in scenarios requiring active consideration of partners' beliefs and intentions. The CoordQA experiments further highlight significant room for improvement in LLMs' Theory of Mind reasoning and joint planning capabilities. Zero-Shot Coordination (ZSC) experiments in the Agentic Coordination setting demonstrate that LLM agents, unlike RL methods, exhibit robustness to unseen partners. These findings indicate the potential of LLMs as Agents in pure coordination setups and underscore areas for improvement. Code Available at https://github.com/eric-ai-lab/llm_coordination.
Theory of Mind (ToM) refers to the ability to reason about others’ mental states, and higher-order ToM involves considering that others also possess their own ToM. Equipping large language model (LLM)-driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks. However, we find that misaligned ToM orders—mismatches in the depth of ToM reasoning between agents—can lead to insufficient or excessive reasoning about others, thereby impairing their coordination. To address this issue, we design an adaptive ToM (A-ToM) agent, which can align in ToM orders with its partner. Based on prior interactions, the agent estimates the partner’s likely ToM order and leverages this estimation to predict the partner’s action, thereby facilitating behavioral coordination. We conduct empirical evaluations on four multi-agent coordination tasks: a repeated matrix game, two grid navigation tasks and an Overcooked task. The results validate our findings on ToM alignment and demonstrate the effectiveness of our AToM agent. Furthermore, we discuss the generalizability of our A-ToM to non-LLM-based agents, as well as what would diminish the importance of ToM alignment.
A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using MARL methods for LLM collaboration and highlights the associated challenges.
The combination of multi-agent systems and Large Language Models (LLMs) has become an exciting new paradigm capable of the real-time automation of complex tasks. In this paper, we introduce a framework of LLM-Guided Multi-Agent Collaboration that utilizes LLMs as high-level reasoning engines to coordinate, optimize and adjust the interactions between multiple autonomous agents. The proposed framework builds on almost zero task decomposition and intent interpretation in multitasking and resolution of conflict that is provided by the natural language understanding and decision support capability of the LLMs. Agents that have domain-specific knowledge will work in cooperation, under LLM supervision to deliver an agent-centric task execution that is scalable, explainable and adaptive. We consider the framework against the disparate domains of process automation, data-driven decision-making, and cybersecurity incident response. In the experimental results, better efficiency, resource use, and flexibility were observed in respect to dedicated multi-agent models. This paper indicates the promise that LLM-based coordination may hold to turn multi-agent collaboration into a more intelligent, context-aware, and robust paradigm of automating complex tasks in the real world.
Distributed multi-agent systems are increasingly deployed in dynamic and high-stakes environments such as power grids, intelligent traffic systems, and collaborative robotics. In these systems, long-term stability, the ability to maintain coherent and safe system behavior over time, is critical but underexplored in existing research. This paper presents LLMASC, a framework designed to enhance long-term stability in multi-agent collaboration by combining semantic reasoning with decentralized control. LLMASC comprises three key components: a Semantic Perception Encoder that transforms heterogeneous agent observations into structured natural language; an LLM-Guided Consensus Decision module that enables strategic alignment through proposal exchange and voting; and a Policy Execution Controller that maps high-level plans to executable actions via reinforcement learning. We evaluate LLMASC across three representative simulation domains (Multi-Walker, Simulation of Urban Mobility and Power Grid Stabilization), spanning both physical and cyber-physical systems. Experiments show that LLMASC consistently outperforms the best baselines, improving stability rates by up to 44% and long-term success by 31%. Further analysis confirms its decision-making efficiency and robustness under varying agent populations and model choices.
Zhiwei Liu, Weiran Yao, Jianguo Zhang, Rithesh Murthy, Liangwei Yang, Zuxin Liu, Tian Lan, Ming Zhu, Juntao Tan, Shirley Kokane, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong. Proceedings of the 28th Conference on Computational Natural Language Learning. 2024.
Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To manage long-horizon interactions, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. The code and demo are available at https://github.com/RUC-NLPIR/DeepAgent.
Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.
Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner's tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and reduce hallucinations. However, developing prompting techniques that enable LLM agents to effectively use these tools and knowledge remains a heuristic and labor-intensive task. Here, we introduce AvaTaR, a novel and automated framework that optimizes an LLM agent to effectively leverage provided tools, improving performance on a given task. During optimization, we design a comparator module to iteratively deliver insightful and comprehensive prompts to the LLM agent by contrastively reasoning between positive and negative examples sampled from training data. We demonstrate AvaTaR on four complex multimodal retrieval datasets featuring textual, visual, and relational information, and three general question-answering (QA) datasets. We find AvaTaR consistently outperforms state-of-the-art approaches across all seven tasks, exhibiting strong generalization ability when applied to novel cases and achieving an average relative improvement of 14% on the Hit@1 metric for the retrieval datasets and 13% for the QA datasets. Code and dataset are available at https://github.com/zou-group/avatar.
Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to APIbased models when acting as agents.How to integrate agent ability into general LLMs becomes a crucial and urgent problem.This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations.Based on the above findings, we propose Agent-FLAN to effectively Fine-tune LANguage models for Agents.Through careful decomposition and redesign of the training corpus, Agent-FLAN enables Llama2-7B to outperform prior best works by 3.5% across various agent evaluation datasets.With comprehensively constructed negative samples, Agent-FLAN greatly alleviates the hallucination issues based on our established evaluation benchmark.Besides, it consistently improves the agent capability of LLMs when scaling model sizes while slightly enhancing the general capability of LLMs.The code and model are available at https://github.com/InternLM/Agent-FLAN.
… enable LLMs to access external knowledge, but there remain challenges for fine-tuned LLM agents (eg, Toolformer [108]) to invoke tools in multi-step reasoning tasks, where inter-…
Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extend their utility, enabling them to solve practical tasks. Previous methods manually parse tool documentation and create in-context demonstrations, transforming tools into structured formats for LLMs to use in their step-by-step reasoning. However, this manual process requires domain expertise and struggles to scale to large toolsets. Additionally, these methods rely heavily on ad-hoc inference techniques or special tokens to integrate free-form LLM generation with tool-calling actions, limiting the LLM's flexibility in handling diverse tool specifications and integrating multiple tools. In this work, we propose AutoTools, a framework that enables LLMs to automate the tool-use workflow. Specifically, the LLM automatically transforms tool documentation into callable functions, verifying syntax and runtime correctness. Then, the LLM integrates these functions into executable programs to solve practical tasks, flexibly grounding tool-use actions into its reasoning processes. Extensive experiments on existing and newly collected, more challenging benchmarks illustrate the superiority of our framework. Inspired by these promising results, we further investigate how to improve the expertise of LLMs, especially open-source LLMs with fewer parameters, within AutoTools. Thus, we propose the AutoTools-Learning approach, training the LLMs with three learning tasks on 34k instances of high-quality synthetic data, including documentation understanding, relevance learning, and function programming. Fine-grained results validate the effectiveness of our overall training approach and each individual task. Our methods are an important step towards the use of LLMs for solving real-world tasks with external tools.
Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang. Findings of the Association for Computational Linguistics: NAACL 2025. 2025.
Tool learning has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools.Existing tool learning studies primarily focus on the general-purpose tool-use capability, which addresses explicit requirements in user instructions.However, they overlook the importance of personalized tool-use capability, leading to an inability to meet personalized user needs.To address the limitation, we first formulate the task of personalized tool learning, which integrates user's interaction history towards personalized tool usage.To fill the gap of missing benchmarks, we construct PETool-Bench, featuring diverse user preferences reflected in interaction history under three distinct personalized settings, and encompassing a wide range of tool-use scenarios.Moreover, we propose a framework PEToolLLaMA to adapt LLMs to the personalized tool learning task, which is trained through supervised finetuning and direct preference optimization.Extensive experiments on PEToolBench demonstrate the superiority of PEToolLLaMA over existing LLMs.We release our code and data at https://github.com/travis-xu/PEToolBench.
Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, Deqing Yang. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025.
Tool-augmented Large Language Models (LLMs) provide a robust theoretical foundation for AI agents, with the generation of reasoning plans being a crucial stage. Previous methods for generating reasoning plans primarily rely on In-Context Learning (ICL) or Supervised Fine-Tuning (SFT). However, methods based on ICL struggle with accurately utilizing tools, while those based on SFT face challenges in adapting to new toolsets and tasks. To address these issues, we introduce ToolFiVe, a general, plug-and-play, self-correction-based framework for leveraging specialized tools in compositional reasoning tasks. ToolFiVe redesigns the process of reasoning plan generation and integrates the tool-execution stage. Specifically, in the reasoning plan generation stage, ToolFiVe employs a filtering module to exclude task-irrelevant tools, creating a candidate toolset. ToolFiVe then constructs prompts using the candidate toolset and iteratively generates and refines the reasoning plan. Subsequently, a verification module evaluates the completeness of the reasoning plan, providing feedback for the self-correction loop. Ultimately, the reasoning plan developed during this stage guides the execution of the tools. Extensive experiments demonstrate that ToolFiVe outperforms other state-of-the-art (SOTA) methods, highlighting the significance of reasoning plan generation for general tool-augmented LLMs.
Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Pengjie Ren, Suzan Verberne, Zhaochun Ren. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024.
Large Language Model (LLM) agents have emerged as powerful tools for automating complex tasks by leveraging the reasoning and decision-making abilities of LLMs. However, a major bottleneck in current agent frameworks lies in the high inference cost of tool selection, especially in approaches like ReAct that repeatedly invoke the LLM to determine which tool to use at each step. In this work, we propose AutoTool, a novel graph-based framework that bypasses repeated LLM inference by exploiting a key empirical observation: tool usage inertia—the tendency of tool invocations to follow predictable sequential patterns. AutoTool constructs a directed graph from historical agent trajectories, where nodes represent tools and edges capture transition probabilities, effectively modeling the inertia in tool selection. It further integrates parameter-level information to refine tool input generation. By traversing this structured representation, AutoTool efficiently selects tools and their parameters with minimal reliance on LLM inference. Extensive experiments across diverse agent tasks demonstrate that AutoTool reduces inference costs by up to 30% while maintaining competitive task completion rates, offering a practical and scalable enhancement for inference-heavy frameworks. Our work highlights the promise of integrating statistical structure into LLM agent design for greater efficiency without sacrificing performance.
Despite their powerful text generation capabilities, large language models (LLMs) still struggle to effectively utilize external tools to solve complex tasks, a challenge known as tool learning. Existing methods primarily rely on supervised fine-tuning, treating tool learning as a text generation problem while overlooking the decision-making complexities inherent in multi-step contexts. In this work, we propose modeling tool learning as a dynamic decision-making process and introduce StepTool, a novel step-grained reinforcement learning framework that enhances LLMs' capabilities in multi-step tool use. StepTool comprises two key components: Step-grained Reward Shaping, which assigns rewards to each tool interaction based on its invocation success and contribution to task completion; and Step-grained Optimization, which applies policy gradient methods to optimize the model across multiple decision steps. Extensive experiments across diverse benchmarks show that StepTool consistently outperforms both SFT-based and RL-based baselines in terms of task Pass Rate and Recall of relevant tools. Furthermore, our analysis suggests that StepTool helps models discover new tool-use strategies rather than merely re-weighting prior knowledge. These results highlight the importance of fine-grained decision modeling in tool learning and establish StepTool as a general and robust solution for enhancing multi-step tool use in LLMs. Code and data are available at https://github.com/yuyq18/StepTool.
The rise of composable AI agents has transformed automation orchestration across diverse domains. Rather than relying on monolithic pipelines, these agents enable modular workflows using routing, memory, and tool APIs for real-time task execution. This paper introduces a Composable AI Stack demonstrated through marketing automation, with generalizability to finance, healthcare, industrial automation, and intelligent transportation. The architecture employs memory-augmented agents, context-aware routing, and API function-calling to manage insights, personalization, and performance optimization. We present comprehensive ablation studies isolating each module’s contribution, scalability benchmarks for enterprise deployment, and governance considerations for responsible AI. Experimental evaluation confirms significant improvements: CTR +38%, CPA -24%, and ROAS +48% over monolithic alternatives, with zero system failures compared to 9% crash rate in baseline systems.
Interactive applications are latency-sensitive systems that enable dynamic responses to user inputs in domains such as robotics, industrial automation, and autonomous control. These applications require efficient application protocols for communication, with the pub/sub model being one of the most promising approaches. However, existing pub/sub systems are architecturally constrained, particularly by limited memory capacity and inefficiencies in dynamic environments. Addressing these challenges requires effective distributed memory management, yet this aspect has received limited attention in existing research. This paper addresses the gap by proposing MemIndex, an adaptive and autonomous distributed memory-management framework with an intent-indexed bipartite graph architecture. It is designed for an LM-based multi-agent pub/sub systems, enabling agents to autonomously negotiate memory operations in real time through dynamic index spaces for efficient reasoning. We evaluate our proposed MemIndex using diverse models against two baselines. Experimental results show MemIndex outperforms both baselines across storage, retrieval, update, and deletion operations, achieving average reductions of about 34% and 56% in elapsed time, 57% and 75% in CPU utilization, 23% and 76% in memory usage. Scalability tests further demonstrate that MemIndex maintains low end-to-end delay as submissions and agents grow, confirming that its negotiation-driven offloading enables efficient distributed memory management in interactive applications.
However, as large language models (LLMs) continue to act as self-contained agents more than ever before, there is an increasing demand for standard protocols to provide a smooth bridge between these tools and outside bits of equipment. The Model Context Protocol (MCP) is a lightweight, extensible architecture that describes structured communication loops between agents, tools and controllers. We present here an overall conception of MCP’s design philosophy, communication processes, and levels of system modularity. We also examine several public MCP server implementations, analyzing their tool orchestration schemes and ascertain their adaptability in several different environments. Through this theoretical and practical endeavor, MCP is shown to offer a robust foundation for constructing scalable, transparent systems with extensions. Advanced LLM agents We also explore current limitations, such as prompt dependency and lack of standard benchmarks. Moreover, we point out future areas to enhance protocol security and memory management, as well as to build in support for multiple languages.
… memory modules play a critical role here by projecting agent … cognitive architecture and an agent framework developed as part … regarding third-party tools like Docker and RabbitMQ, are …
… , co-create evolving conceptual frameworks, and maintain … display tools but active semantic mediators that shape agent … memory or semantic reasoning, our Virtual Cognitive Agents (…
… AMs are a tool that guides players and analyses progress. In such a way, … Although BDI agents are very powerful they do not include memory and emotion in the decision making …
In modern software development, maintaining clean, efficient, and reliable code is critical to team productivity and product quality. This paper introduces a modular Large Language Model (LLM)-based agent, designed using Microsoft’s Semantic Kernel framework, for automated code review and refactoring feedback. The agent leverages plugin-based function orchestration, Retrieval-Augmented Generation (RAG), and dynamic prompt engineering to analyze source code across multiple dimensions; including readability, efficiency, security, and adherence to best practices. Integrated into CI/CD pipelines and broader SDLC workflows, the system provides contextual insights, the system provides contextual insights, suggests specific improvements, and explains reasoning for each recommendation. Evaluation results across real-world open-source repositories demonstrate the agent’s effectiveness in reducing human review time while improving refactor quality. The modular design ensures adaptability to various programming languages and enterprise development environments. This research highlights the potential of agentic LLM systems to augment software engineering workflows with intelligent, transparent, and developer-aligned feedback mechanisms. Keywords: Code Review, Semantic Kernel, Plugin Orchestration, Refactoring, Large Language Models, Agentic AI, Retrieval-Augmented Generation, Prompt Engineering
Large language models (LLMs) show promising performance on small-scale graph reasoning tasks but fail when handling real-world graphs with complex queries. This phenomenon arises from LLMs' working memory constraints, which result in their inability to retain long-range graph topology over extended contexts while sustaining coherent multi-step reasoning. However, real-world graphs are often structurally complex, such as Web, Transportation, Social, and Citation networks. To address these limitations, we propose GraphCogent, a collaborative agent framework inspired by human Working Memory Model that decomposes graph reasoning into specialized cognitive processes: sense, buffer, and execute. The framework consists of three modules: Sensory Module standardizes diverse graph text representations via subgraph sampling, Buffer Module integrates and indexes graph data across multiple formats, and Execution Module combines tool calling and tool creation for efficient reasoning. We also introduce Graph4real, a comprehensive benchmark that contains four domains of real-world graphs (Web, Transportation, Social, and Citation) to evaluate LLMs' graph reasoning capabilities. Our Graph4real covers 21 different graph reasoning tasks, categorized into three types (Structural Querying, Algorithmic Reasoning, and Predictive Modeling tasks), with graph scales up to 10 times larger than existing benchmarks. Experiments show that Llama3.1-8B based GraphCogent achieves a 50% improvement over massive-scale LLMs like DeepSeek-R1 (671B). Compared to state-of-the-art code-based baseline, our framework outperforms by 20% in accuracy while reducing token usage by 80% for in-toolset tasks and 30% for out-toolset tasks.
Healthcare data exchange increasingly relies on HL7 FHIR, but FHIR's implementation complexity creates barriers for clinical workflows. Large language model (LLM) agents could bridge this gap by translating natural language requests into structured FHIR operations, yet their reliability remains unproven. We present FHIR-AgentEval, an extensible evaluation sandbox comprising 43 modular tasks for benchmarking LLM agents on realistic appointment management and genetic testing workflows. Each task executes against a resettable FHIR server with custom deterministic validation of both agent responses and resulting server state. We run an ablation study of five agent configurations, varying access to an on-demand FHIR R4 specifications server and long-term memory trained with or without specification grounding. Across four experimental settings, memory consistently improves task success and reduces strategic failures such as incorrect tool selection and resource-type confusion. On held-out tasks, the best memory configuration improves success by 9.1% over baseline, offering a potential pathway toward more robust clinical deployment.
合并后的统一分组将agent方向最新研究按“能力获取(工具学习/训练)—系统实现(规划编排/模块化协议/端到端闭环)—长期可用性(记忆与工作内存)—复杂任务能力(多智能体协作/认知混合推理)—上层治理与工程落地(总体框架综述/真实应用安全评测)”的主线拆分为并列的若干主题板块。整体覆盖从算法方法、评测基准到工程部署的体系化演进,避免了将不同细粒度机制(如工具学习、规划编排、记忆管理、端到端闭环)混为笼统大组的情况,并保留了各文献集合的独特关注点。