agent，ai，llm - Acadwrite

agent，ai，llm

本报告对 Agent、AI 及 LLM 研究进行了深度整合与分类，构建了涵盖底层理论、应用开发与安全治理的综合框架。研究呈现出明显的范式转变：从基础的 Agentic RAG 框架进化至复杂的科学自动化工作流；从单一的任务执行向多智能体协同协作演进；从通用架构设计走向各行业（如生物医药、工业运维、社会科学）的垂直落地。同时，随着智能体应用复杂性的提升，针对隐私安全、行为验证及人机信任评估的研究已成为不可或缺的基石，标志着 AI Agent 正步入高度自主与可控并行发展的工业化应用阶段。

共 239 篇文献，6 个研究方向

代理式检索增强生成 (Agentic RAG) 与知识集成架构

这些文献系统性地探讨了将自主代理逻辑融入 RAG 流程，包括动态检索、推理增强、图知识库结合以及多模态知识处理，旨在解决传统 RAG 静态响应的局限性。相关文献: Yongqi Leng et. al, 2025 等 49 篇文献

通用智能体架构、规划范式与评估方法

该类文献关注智能体的核心基础设计，包括ReAct范式、任务规划、工具调用、轨迹校准及衡量智能体能力的标准化评估基准研究。相关文献: Shunyu Yao et. al, 2022 等 49 篇文献

科学发现、医疗与生物工程中的智能体自动化

这些文献研究智能体在科研自动化、临床决策支持、药物发现、生物信息分析等专业领域中的应用，强调自动化工作流对科学探索的赋能。相关文献: Meng Xiao et. al, 2025 等 55 篇文献

工业运维、基础设施治理与企业级自动化

该组文献集中在智能体于企业 IT、云服务、工业自动化、DevOps 及网络安全等场景的部署，探讨了如何通过策略协同实现工业流程监控与安全治理。相关文献: Vamsi Krishna et. al, 2025 等 32 篇文献

社会认知、人机交互与具身感知智能

这些文献探讨智能体在社会模拟、教育辅导、城市交通、具身智能及人机协作中的应用，重点关注用户体验、意图理解及动态适应性。相关文献: Jiawei Wang et. al, 2024 等 42 篇文献

代理安全、隐私与信任评估

专门针对智能体运行时的安全挑战，包括隐私风险提取、对抗攻击防御、恶意行为监测、信任验证以及在复杂基础设施下的安全性保障。相关文献: Liang-bo Ning et. al, 2024 等 12 篇文献

总计242篇相关文献

基于大模型的AI通话分析智能体研究与实现 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

该智能体依托联通云犀平台与元景大模型，构建了“智能体实时动态调度CoE (Collaboration of Experts)”引擎和AI通话分析智能体。CoE引擎通过任务规划与多模型混合调度机制， ...

安装插件收集

基于Coze平台与DeepSeek大模型的智能辅助教学智能体 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

本系统是基于国产的DeepSeek大模型，结合Coze平台，采用了“AI + 教育专家”双轮驱动的手段构建了一套面向教育的教学辅助智能体。智能体设计了四大核心功能模块：智能备课、 ...

安装插件收集

基于大语言模型的教育智能体个性化学习应用理论研究 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

大模型驱动的教育智能体以多模态信息感知、智能推理决策和动态执行作为其核心技术，形成了“感知–决策–行动”的逻辑闭环。其中，大语言模型是教育智能体推理能力的核心，使其 ...

安装插件收集

基于大语言模型的财务报告指标抽取智能体方法 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

在自建的银行年报问答数据集(BAR-QA)上的实验结果表明，LedgerLens在指标抽取任务中取得了94.1%的F1分数，并在大多数任务上实现了领先表现。研究结果证明，引入基于智能体的 ...

安装插件收集

矿业工程专业专门用途英语大模型智能体“Mining Lingua”的研发及其 ...

www.hanspub.org-Unknown Authors-Unknown Journal

基于大语言模型的智能体(以下简称“大模型智能体”)具有检索增强生成、推理与规划、交互与进化等核心功能[7]，可有效避免通用大模型常见的“幻觉”、短期记忆等问题，通过调用 ...

安装插件收集

构建面向AGI时代的开源IoT AI智能体架构与实践 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

在面向AGI的AIoT系统中，大语言模型(LLM)作为核心的认知与推理引擎，为AI智能体提供语义理解、任务规划、自然交互等高阶能力。考虑到不同应用场景对延迟、算力和网络的要求 ...

安装插件收集

基于大模型技术的个性化外语学习系统

www.hanspub.org-Unknown Authors-Unknown Journal

大模型智能体功能测试用例. 大模型智能体功能是系统的关键模块，主要包括智能体人设及回复逻辑、工作流、插件和知识库检索。下面测试用例覆盖了模块的全部功能，确保智能 ...

安装插件收集

基于大语言模型的银行智能语音助手 - 汉斯出版社

www.hanspub.org-Unknown Authors, 2025-Unknown Journal

前端交互层为客户端应用，后端为以大语言模型为基础构建的银行智能语音智能体，主要包括语音交互引擎、对话引擎(本地化LLM)、银行知识库、银行业务系统。

安装插件收集

多模态大模型驱动的空间智能:技术进展,评估体系与未来挑战

www.hanspub.org-Unknown Authors-Unknown Journal

本文系统梳理了多模态大模型在三维视觉理解,空间感知与推理,具身交互等方面的技术演进路径,重点分析了以视频,深度图,点云等多源异构数据为基础的空间表征方法,并归纳了当前 ...

安装插件收集

AI智能体在国内医学教育领域的研究进展 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

健康教育领域基于大语言模型的多智能体系统通过知识普及智能体、健康计划智能体、沟通协调智能体等模块协同工作，实现患者教育全流程智能管理[21]。老年尿路结石患者 ...

安装插件收集

基于智能体的在地化农产品品牌视觉设计策略研究 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

本研究中的“智能体”指基于多模态大模型的人工智能系统，具备语义理解、视觉生成及用户反馈学习功能。与传统生成式AI不同，智能体可在循环交互中持续调整输出。然而，其在深层 ...

安装插件收集

英语教育智能体：协作学习的设计与赋能

www.hanspub.org-Unknown Authors-Unknown Journal

生成式人工智能(GenAI)与大语言模型(LLMs)推动英语教育从CALL向ICALL深度变革，英语教育智能体(EEA)为缓解协作学习交互浅表化、反馈滞后等痛点提供技术解决方案。

安装插件收集

城市AI智能体与数字孪生指挥技术研究及应用 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

摘要: 针对传统指挥系统在复杂应急场景下面临的信息孤岛化、人工依赖度高、智能决策与物理执行脱节等问题，本文提出并构建了一种“AI智能体具身孪生指挥系统”。

安装插件收集

基于AI Agent与Jobs-to-Be-Done理论框架的研究 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

本文基于AI Agent与Jobs-to-Be-Done理论框架,探讨数字经济时代电子商务用户决策的经济逻辑.研究针对传统用户行为分析方法的局限性,创新性构建融合行为经济学与智能技术 ...

安装插件收集

生成式人工智能数据安全风险及刑法应对 - 汉斯出版社

www.hanspub.org-Unknown Authors-Unknown Journal

... 智能体能够自主决定是否实施某种行为的能力。在判断时可以借鉴人类行为能力的规定，结合人工智能体的技术特点进行综合考量。例如，通过模拟不同的场景和情境，观察人工智能体 ...

安装插件收集

基于精神科护理伦理冲突事件学习微场景专属智能体的开发研究

www.hanspub.org-Unknown Authors-Unknown Journal

摘要: 目的：开发基于智能体技术的“伦理沙盒”模拟训练系统，以解决精神科护理中常见的伦理冲突问题，切实提高护理专业学生的伦理决策能力。方法：系统构建结构化精神科伦理冲突 ...

安装插件收集

Unveiling Privacy Risks in LLM Agent Memory

揭示LLM代理内存中的隐私风险

Bo Wang, Weiyi He, Pengfei He 等, 2025-Annual Meeting of the Association for Computational Linguistics

Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications. They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black-box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent designer's and the attacker's perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.

安装插件收集

被引 52

OrcaLoca: An LLM Agent Framework for Software Issue Localization

OrcaLoca：一种用于软件问题定位的LLM代理框架

Zhongming Yu, Hejia Zhang, Yujie Zhao 等, 2025-International Conference on Machine Learning

Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization -- precisely identifying software problems by navigating to relevant code sections -- remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.

安装插件收集

被引 32

OpenFOAMGPT: A retrieval-augmented large language model (LLM) agent for OpenFOAM-based computational fluid dynamics

OpenFOAMGPT：一种基于OpenFOAM的计算流体动力学检索增强大型语言模型（LLM）代理

Sandeep Pandey, Ran Xu, Wenkang Wang 等, 2025-Physics of Fluids2区IF 4.3

This work presents a large language model (LLM)-based agent OpenFOAMGPT tailored for OpenFOAM-centric computational fluid dynamics (CFD) simulations, leveraging two foundation models from OpenAI: the GPT-4o (GPT means Generative Pre-trained Transformer) and a chain-of-thought–enabled o1 preview model. Both agents demonstrate success across multiple tasks. While the price of token with o1 model is six times as that of GPT-4o, it consistently exhibits superior performance in handling complex tasks, from zero-shot/few-shot case setup to boundary condition modifications, zero-shot turbulence model adjustments, and zero-shot code translation. Through an iterative correction loop, the agent efficiently addressed single-phase and multiphase flow, heat transfer, Reynolds-averaged Navier–Stokes modeling, large eddy simulation, and other engineering scenarios, often converging in a limited number of iterations at low token costs. To embed domain-specific knowledge, we employed a retrieval-augmented generation pipeline, demonstrating how preexisting simulation setups can further specialize the agent for subdomains such as energy and aerospace. Despite the great performance of the agent, human oversight remains crucial for ensuring accuracy and adapting to shifting contexts. Fluctuations in model performance over time suggest the need for monitoring in mission-critical applications. Although our demonstrations focus on OpenFOAM, the adaptable nature of this framework opens the door to developing LLM-driven agents into a wide range of solvers and codes. By streamlining CFD simulations, this approach has the potential to accelerate both fundamental research and industrial engineering advancements.

安装插件收集

被引 44

RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing

RepoAudit：一种用于仓库级代码审计的自主LLM智能体

Jinyao Guo, Chengpeng Wang, Xiangzhe Xu 等, 2025-International Conference on Machine Learning

Code auditing is the process of reviewing code with the aim of identifying bugs. Large Language Models (LLMs) have demonstrated promising capabilities for this task without requiring compilation, while also supporting user-friendly customization. However, auditing a code repository with LLMs poses significant challenges: limited context windows and hallucinations can degrade the quality of bug reports, and analyzing large-scale repositories incurs substantial time and token costs, hindering efficiency and scalability. This work introduces an LLM-based agent, RepoAudit, designed to perform autonomous repository-level code auditing. Equipped with agent memory, RepoAudit explores the codebase on demand by analyzing data-flow facts along feasible program paths within individual functions. It further incorporates a validator module to mitigate hallucinations by verifying data-flow facts and checking the satisfiability of path conditions associated with potential bugs, thereby reducing false positives. RepoAudit detects 40 true bugs across 15 real-world benchmark projects with a precision of 78.43%, requiring on average only 0.44 hours and $2.54 per project. Also, it detects 185 new bugs in high-profile projects, among which 174 have been confirmed or fixed. We have open-sourced RepoAudit at https://github.com/PurCL/RepoAudit.

安装插件收集

被引 41

PaSa: An LLM Agent for Comprehensive Academic Paper Search

PaSa：一种基于大型语言模型的全面学术论文搜索代理

Yichen He, Guanhua Huang, Peiyuan Feng 等, 2025-Annual Meeting of the Association for Computational Linguistics

We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.

安装插件收集

被引 47

UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design

UXAgent：一种基于大型语言模型代理的网页设计可用性测试框架

Yuxuan Lu, Bingsheng Yao, Hansu Gu 等, 2025-Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

Usability testing is a fundamental yet challenging research method for user experience (UX) researchers to evaluate a web design. Recent advances in Large Language Model-simulated Agent (LLM Agent) research inspired us to design UXAgent to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features an LLM Agent module and a universal browser connector module so that UX researchers can automatically generate thousands of simulated users to test the target website. The system can generate UX study results in qualitative (e.g., interviewing how an agent thinks), quantitative (e.g., # of actions), and video recording formats for UX researchers to analyze. Through a heuristic user evaluation with five UX researchers, participants praised the innovation of our system but also expressed concerns about the future of UX study with LLM Agents1.

安装插件收集

被引 35

STeCa: Step-level Trajectory Calibration for LLM Agent Learning

STeCa：基于步骤级轨迹校准的LLM代理学习

Hanlin Wang, Jian Wang, Chak Tou Leong 等, 2025-Annual Meeting of the Association for Computational Linguistics

Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.

安装插件收集

被引 19

Spec2RTL-Agent: Automated Hardware Code Generation from Complex Specifications Using LLM Agent Systems

Spec2RTL-Agent：基于LLM代理系统的复杂规格自动化硬件代码生成

Zhongzhi Yu, Mingjie Liu, Michael Zimmer 等, 2025-2025 IEEE International Conference on LLM-Aided Design (ICLAD)

Despite recent progress in generating hardware register transfer level (RTL) code with large language models (LLMs), existing solutions still suffer from a substantial gap between practical application scenarios and the requirements of real-world RTL code development. Prior approaches either focus on overly simplified hardware descriptions or depend on extensive human guidance to process complex specifications, limiting their scalability and automation potential. In this paper, we address this gap by proposing an LLM agent system, termed Spec2RTL-Agent, designed to directly process complex specification documentation and generate corresponding RTL code implementations, advancing LLM-based RTL code generation toward more realistic application settings. To achieve this goal, Spec2RTL-Agent introduces a novel multi-agent collaboration framework that integrates three key enablers: (1) a reasoning and understanding module that translates specifications into structured, step-by-step implementation plans; (2) a progressive coding and prompt optimization module that iteratively refines the code across multiple representations (pseudocode, Python, and C++) to enhance correctness and synthesisability for RTL conversion; and (3) an adaptive reflection module that identifies and traces the source of errors during generation, ensuring a more robust code generation flow. Instead of directly generating RTL from natural language, our system strategically generates synthesizable C++ code, which is then optimized for high-level synthesis (HLS). This agent-driven refinement ensures greater correctness and compatibility compared to naive direct RTL generation approaches. We evaluate Spec2RTL-Agent on a benchmark of three specification documents, demonstrating its effectiveness in generating accurate RTL code with as much as 75% fewer human interventions compared to existing approaches. These results underscore Spec2RTL-Agent’s role as the first fully automated multi-agent system for RTL generation from unstructured specification documents, reducing the reliance on human effort and expertise in hardware design.

安装插件收集

被引 17

ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

ECom-Bench：大型语言模型代理能否解决现实世界电子商务客户支持问题？

Haoxin Wang, Xianhan Peng, Xucheng Huang 等, 2025-Conference on Empirical Methods in Natural Language Processing

In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. The code and data have been made publicly available at https://github.com/XiaoduoAILab/ECom-Bench to facilitate further research and development in this domain.

安装插件收集

被引 18

Data Interpreter: An LLM Agent For Data Science

数据解释器：一种用于数据科学的LLM智能体

Sirui Hong, Yizhang Lin, Bangbang Liu 等, 2024-Annual Meeting of the Association for Computational Linguistics

Large Language Model (LLM)-based agents have shown effectiveness across many applications. However, their use in data science scenarios requiring solving long-term interconnected tasks, dynamic data adjustments and domain expertise remains challenging. Previous approaches primarily focus on individual tasks, making it difficult to assess the complete data science workflow. Moreover, they struggle to handle real-time changes in intermediate data and fail to adapt dynamically to evolving task dependencies inherent to data science problems. In this paper, we present Data Interpreter, an LLM-based agent designed to automatically solve various data science problems end-to-end. Our Data Interpreter incorporates two key modules: 1) Hierarchical Graph Modeling, which breaks down complex problems into manageable subproblems, enabling dynamic node generation and graph optimization; and 2) Programmable Node Generation, a technique that refines and verifies each subproblem to iteratively improve code generation results and robustness. Extensive experiments consistently demonstrate the superiority of Data Interpreter. On InfiAgent-DABench, it achieves a 25% performance boost, raising accuracy from 75.9% to 94.9%. For machine learning and open-ended tasks, it improves performance from 88% to 95%, and from 60% to 97%, respectively. Moreover, on the MATH dataset, Data Interpreter achieves remarkable performance with a 26% improvement compared to state-of-the-art baselines. The code is available at https://github.com/geekan/MetaGPT.

安装插件收集

被引 193

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

SceneCraft：一种将3D场景合成为Blender代码的LLM代理

Ziniu Hu, Ahmet Iscen, Aashi Jain 等, 2024-International Conference on Machine Learning

This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.

安装插件收集

被引 96

Uncertainty Propagation on LLM Agent

大型语言模型代理中的不确定性传播

Qiwei Zhao, Dong Li, Yanchi Liu 等, 2025-Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

No abstract available

安装插件收集

被引 13

Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation

大型语言模型作为城市居民：一种用于个人出行生成的大型语言模型代理框架

Jiawei Wang, Renhe Jiang, Chuang Yang 等, 2024-Neural Information Processing Systems

This paper introduces a novel approach using Large Language Models (LLMs) integrated into an agent framework for flexible and effective personal mobility generation. LLMs overcome the limitations of previous models by effectively processing semantic data and offering versatility in modeling various tasks. Our approach addresses three research questions: aligning LLMs with real-world urban mobility data, developing reliable activity generation strategies, and exploring LLM applications in urban mobility. The key technical contribution is a novel LLM agent framework that accounts for individual activity patterns and motivations, including a self-consistency approach to align LLMs with real-world activity data and a retrieval-augmented strategy for interpretable activity generation. We evaluate our LLM agent framework and compare it with state-of-the-art personal mobility generation approaches, demonstrating the effectiveness of our approach and its potential applications in urban mobility. Overall, this study marks the pioneering work of designing an LLM agent framework for activity generation based on real-world human activity data, offering a promising tool for urban mobility analysis.

安装插件收集

被引 119

AgentSquare: Automatic LLM Agent Search in Modular Design Space

AgentSquare：模块化设计空间中的自动LLM代理搜索

Yu Shang, Yu Li, Keyu Zhao 等, 2024-International Conference on Learning Representations

Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task-specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in-context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidating the collective efforts of research community. Code repo is available at https://github.com/tsinghua-fib-lab/AgentSquare.

安装插件收集

被引 71

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

LMR-BENCH：评估大型语言模型在重现语言建模研究中的能力

Shuo Yan, Ruochen Li, Ziming Luo 等, 2025-Conference on Empirical Methods in Natural Language Processing

Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents'ability to autonomously reproduce scientific research

安装插件收集

被引 11

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

小型LLM是弱工具学习者：一个多LLM智能体

Weizhou Shen, Chenliang Li, Hongzhan Chen 等, 2024-Conference on Empirical Methods in Natural Language Processing

Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete various tasks in a self-directed fashion. The challenge of tool use demands that LLMs not only understand user queries and generate answers accurately but also excel in task planning, tool invocation, and result summarization. While traditional works focus on training a single LLM with all these capabilities, performance limitations become apparent, particularly with smaller models. To overcome these challenges, we propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. Each component is implemented by a single LLM that focuses on a specific capability and collaborates with others to accomplish the task. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability. To effectively train this framework, we introduce a two-stage training paradigm. First, we fine-tune a backbone LLM on the entire dataset without discriminating sub-tasks, providing the model with a comprehensive understanding of the task. Second, the fine-tuned LLM is used to instantiate the planner, caller, and summarizer respectively, which are continually fine-tuned on respective sub-tasks. Evaluation across various tool-use benchmarks illustrates that our proposed multi-LLM framework surpasses the traditional single-LLM approach, highlighting its efficacy and advantages in tool learning.

安装插件收集

被引 113

ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary

ChatCite：具有人类工作流程指导的LLM代理用于比较文学摘要

Yutong Li, Lu Chen, Aiwei Liu 等, 2024-International Conference on Computational Linguistics

The literature review is an indispensable step in the research process. It provides the benefit of comprehending the research problem and understanding the current research situation while conducting a comparative analysis of prior works. However, literature summary is challenging and time consuming. The previous LLM-based studies on literature review mainly focused on the complete process, including literature retrieval, screening, and summarization. However, for the summarization step, simple CoT method often lacks the ability to provide extensive comparative summary. In this work, we firstly focus on the independent literature summarization step and introduce ChatCite, an LLM agent with human workflow guidance for comparative literature summary. This agent, by mimicking the human workflow, first extracts key elements from relevant literature and then generates summaries using a Reflective Incremental Mechanism. In order to better evaluate the quality of the generated summaries, we devised a LLM-based automatic evaluation metric, G-Score, in refer to the human evaluation criteria. The ChatCite agent outperformed other models in various dimensions in the experiments. The literature summaries generated by ChatCite can also be directly used for drafting literature reviews.

安装插件收集

被引 43

You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

任你命名，我来运行：一个执行任意项目测试的LLM代理

Islem Bouzenia, Michael Pradel, 2024-Proceedings of the ACM on Software Engineering

The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it difficult to create a reliable, universal test execution method that works across different projects. This paper presents ExecutionAgent, an automated technique that prepares scripts for building an arbitrary project from source code and running its test cases. Inspired by the way a human developer would address this task, our approach is a large language model (LLM)-based agent that autonomously executes commands and interacts with the host system. The agent uses meta-prompting to gather guidelines on the latest technologies related to the given project, and it iteratively refines its process based on feedback from the previous steps. Our evaluation applies ExecutionAgent to 50 open-source projects that use 14 different programming languages and many different build and testing tools. The approach successfully executes the test suites of 33/50 projects, while matching the test results of ground truth test suite executions with a deviation of only 7.5%. These results improve over the best previously available technique by 6.6x. The costs imposed by the approach are reasonable, with an execution time of 74 minutes and LLM costs of USD 0.16, on average per project. We envision ExecutionAgent to serve as a valuable tool for developers, automated programming tools, and researchers that need to execute tests across a wide variety of projects.

安装插件收集

被引 50

WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment

WorldCoder：一种基于模型的LLM智能体：通过编写代码与环境交互构建世界模型

Hao Tang, Darren Key, Kevin Ellis, 2024-Neural Information Processing Systems

We give a model-based agent that builds a Python program representing its knowledge of the world based on its interactions with the environment. The world model tries to explain its interactions, while also being optimistic about what reward it can achieve. We define this optimism as a logical constraint between a program and a planner. We study our agent on gridworlds, and on task planning, finding our approach is more sample-efficient compared to deep RL, more compute-efficient compared to ReAct-style agents, and that it can transfer its knowledge across environments by editing its code.

安装插件收集

被引 84

WaitGPT: Monitoring and Steering Conversational LLM Agent in Data Analysis with On-the-Fly Code Visualization

WaitGPT：在数据分析中通过实时代码可视化监控和引导对话型大型语言模型代理

Liwenhan Xie, Chengbo Zheng, Haijun Xia 等, 2024-Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology

Large language models (LLMs) support data analysis through conversational user interfaces, as exemplified in OpenAI’s ChatGPT (formally known as Advanced Data Analysis or Code Interpreter). Essentially, LLMs produce code for accomplishing diverse analysis tasks. However, presenting raw code can obscure the logic and hinder user verification. To empower users with enhanced comprehension and augmented control over analysis conducted by LLMs, we propose a novel approach to transform LLM-generated code into an interactive visual representation. In the approach, users are provided with a clear, step-by-step visualization of the LLM-generated code in real time, allowing them to understand, verify, and modify individual data operations in the analysis. Our design decisions are informed by a formative study (N=8) probing into user practice and challenges. We further developed a prototype named WaitGPT and conducted a user study (N=12) to evaluate its usability and effectiveness. The findings from the user study reveal that WaitGPT facilitates monitoring and steering of data analysis performed by LLMs, enabling participants to enhance error detection and increase their overall confidence in the results.

安装插件收集

被引 61

Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement

步步为营！通过迭代步骤级过程优化进行大型语言模型智能体学习

Weimin Xiong, Yifan Song, Xiutian Zhao 等, 2024-Conference on Empirical Methods in Natural Language Processing

Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the **I**terative step-level **P**rocess **R**efinement **(IPR)** framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical finds highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.

安装插件收集

被引 73

LLM Agent for Hyper-Parameter Optimization

基于大型语言模型的超参数优化智能体

Wanzhe Wang, Jianqiu Peng, Menghao Hu 等, 2025-2025 IEEE/CIC International Conference on Communications in China (ICCC Workshops)

Hyper-parameters are essential and critical for the performance of communication algorithms. However, current hyper-parameters optimization approaches for Warm-Start Particles Swarm Optimization with Crossover and Mutation (WSPSO-CM) algorithm, designed for radio map-enabled unmanned aerial vehicle (UAV) trajectory and communication, are primarily heuristic-based, exhibiting low levels of automation and improvable performance. In this paper, we design an Large Language Model (LLM) agent for automatic hyper-parameters-tuning, where an iterative framework and Model Context Protocol (MCP) are applied. In particular, the LLM agent is first set up via a profile, which specifies the boundary of hyper-parameters, task objective, terminal condition, conservative or aggressive strategy of optimizing hyper-parameters, and LLM configurations. Then, the LLM agent iteratively invokes WS-PSO-CM algorithm for exploration. Finally, the LLM agent exits the loop based on the terminal condition and returns an optimized set of hyperparameters. Our experiment results show that the minimal sum-rate achieved by hyper-parameters generated via our LLM agent is significantly higher than those by both human heuristics and random generation methods. This indicates that an LLM agent with PSO and WS-PSO-CM algorithm knowledge is useful in seeking high-performance hyper-parameters.

安装插件收集

被引 4

LightVA: Lightweight Visual Analytics With LLM Agent-Based Task Planning and Execution

LightVA：基于LLM代理的轻量级视觉分析任务规划和执行

Yuheng Zhao, Junjie Wang, Linbing Xiang 等, 2024-IEEE Transactions on Visualization and Computer Graphics2区 TopIF 6.5

Visual analytics (VA) requires analysts to iteratively propose analysis tasks based on observations and execute tasks by creating visualizations and interactive exploration to gain insights. This process demands skills in programming, data processing, and visualization tools, highlighting the need for a more intelligent, streamlined VA approach. Large language models (LLMs) have recently been developed as agents to handle various tasks with dynamic planning and tool-using capabilities, offering the potential to enhance the efficiency and versatility of VA. We propose LightVA, a lightweight VA framework that supports task decomposition, data analysis, and interactive exploration through human-agent collaboration. Our method is designed to help users progressively translate high-level analytical goals into low-level tasks, producing visualizations and deriving insights. Specifically, we introduce an LLM agent-based task planning and execution strategy, employing a recursive process involving a planner, executor, and controller. The planner is responsible for recommending and decomposing tasks, the executor handles task execution, including data analysis, visualization generation and multi-view composition, and the controller coordinates the interaction between the planner and executor. Building on the framework, we develop a system with a hybrid user interface that includes a task flow diagram for monitoring and managing the task planning process, a visualization panel for interactive data exploration, and a chat view for guiding the model through natural language instructions. We examine the effectiveness of our method through a usage scenario and an expert study.

安装插件收集

被引 29

Elicitron: An LLM Agent-Based Simulation Framework for Design Requirements Elicitation

Elicitron：一种基于大型语言模型（LLM）的仿真框架，用于设计需求提取

Mohammadmehdi Ataei, Hyunmin Cheong, Daniele Grandi 等, 2024-Journal of Computing and Information Science in Engineering3区IF 3.3

Requirements elicitation, a critical, yet time-consuming and challenging step in product development, often fails to capture the full spectrum of user needs. This may lead to products that fall short of expectations. This paper introduces a novel framework that leverages Large Language Models (LLMs) to automate and enhance the requirements elicitation process. LLMs are used to generate a vast array of simulated users (LLM agents), enabling the exploration of a much broader range of user needs and unforeseen use cases. These agents engage in product experience scenarios, through explaining their actions, observations, and challenges. Subsequent agent interviews and analysis uncover valuable user needs, including latent ones. We validate our framework with three experiments. First, we explore different methodologies for the challenge of diverse agent generation, discussing their advantages and shortcomings. We measure the diversity of identified user needs and demonstrate that context-aware agent generation leads to greater diversity. Second, we show how our framework effectively mimics empathic lead user interviews, identifying a greater number of latent needs than conventional human interviews. Third, we showcase that LLMs can be used to analyze interviews, capture needs and classify them as latent or not. Our work highlights the potential of using LLMs to accelerate early-stage product development, reduce costs, and increase innovation.

安装插件收集

被引 31

CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent

CheatAgent：通过LLM代理攻击LLM赋能推荐系统的攻击方法

Liang-bo Ning, Shijie Wang, Wenqi Fan 等, 2024-Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Recently, Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience and have attracted considerable attention. Despite the impressive progress, the research question regarding the safety vulnerability of LLM-empowered RecSys still remains largely under-investigated. Given the security and privacy concerns, it is more practical to focus on attacking the black-box RecSys, where attackers can only observe the system's inputs and outputs. However, traditional attack approaches employing reinforcement learning (RL) agents are not effective for attacking LLM-empowered RecSys due to the limited capabilities in processing complex textual inputs, planning, and reasoning. On the other hand, LLMs provide unprecedented opportunities to serve as attack agents to attack RecSys because of their impressive capability in simulating human-like decision-making processes. Therefore, in this paper, we propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs, where an LLM-based agent is developed to attack LLM-Empowered RecSys. Specifically, our method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, we utilize the prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real-world datasets demonstrate the effectiveness of our proposed attacking method.

安装插件收集

被引 48

ELLMA-T: an Embodied LLM-agent for Supporting English Language Learning in Social VR

ELLMA-T：一种支持社交虚拟现实（VRChat）中英语语言学习的具身LLM代理

Mengxue Pan, Alexandra Kitson, Hongyu Wan 等, 2024-Proceedings of the 2025 ACM Designing Interactive Systems Conference

Many people struggle with learning a new language when moving to a new country, with traditional tools falling short in providing contextualized learning tailored to each learner’s needs. The recent development of large language models (LLMs) and embodied conversational agents (ECAs) in social virtual reality (VR) provides new opportunities to practice language learning in a contextualized and naturalistic way that takes into account the learner’s language level and needs. To explore this opportunity, we developed ELLMA-T, a design probe that integrates an LLM (GPT-4) with an ECA for English language learning in social VR (VRChat), informed by the situated learning framework. We conducted a feasibility study to explore the potential and challenges of LLM-based ECAs for language learning in social VR. Drawing on qualitative interviews (N=12), we reveal the potential of ELLMA-T to generate realistic, believable, and context-specific role plays for agent-learner interaction in VR, and LLM’s capability to provide initial language assessment and continuous feedback to learners. We provide four design implications for the future development of LLM-based language agents in social VR.

安装插件收集

被引 26

Exploring and Controlling Diversity in LLM-Agent Conversation

探索与控制大型语言模型代理对话中的多样性

Kuanchao Chu, Yi-Pei Chen, Hideki Nakayama, 2024-Conference on Empirical Methods in Natural Language Processing

Controlling diversity in LLM-agent simulations is essential for balancing stability in structured tasks with variability in open-ended interactions. However, we observe that dialogue diversity tends to degrade over long-term simulations. To explore the role of prompt design in this phenomenon, we modularized the utterance generation prompt and found that reducing contextual information leads to more diverse outputs. Based on this insight, we propose Adaptive Prompt Pruning (APP), a novel method that allows users to control diversity via a single parameter, lambda. APP dynamically prunes prompt segments based on attention scores and is compatible with existing diversity control methods. We demonstrate that APP effectively modulates diversity through extensive experiments and propose a method to balance the control trade-offs. Our analysis reveals that all prompt components impose constraints on diversity, with the Memory being the most influential. Additionally, high-attention contents consistently suppress output diversity.

安装插件收集

被引 9

Enhancing Cognitive Digital Twin Interaction using an LLM Agent

利用大型语言模型代理增强认知数字孪生交互

Jan Sturm, Patrik Zajec, Maja Skrjanc 等, 2024-2024 47th MIPRO ICT and Electronics Convention (MIPRO)

This paper introduces a conceptual architecture design aimed at enhancing interactions with cognitive digital twins of countries through an Large Language Model (LLM) agent. By leveraging sophisticated data retrieval and summarization techniques, the architecture integrates data from diverse sources, including environmental sensors, web pages, and human inputs, to create a dynamic and comprehensive digital twin. The LLM agent facilitates intuitive conversational interfaces, allowing users to query and interact with the digital twin in a natural manner. Through advanced natural language processing and prompt engineering, the agent can understand complex queries, retrieve relevant data, and provide transparent and explainable insights. Additionally, the system incorporates a feedback loop for continuous improvement based on user interactions. This approach addresses significant challenges in data acquisition and management, offering a scalable solution for creating accurate and real-time representations of countries. The architecture aims to empower decision-makers with precise, actionable insights for policy-making, urban planning, and resource management, demonstrating a significant step towards realizing the potential of digital twins in understanding and managing complex national systems.

安装插件收集

被引 8

Agent Laboratory: Using LLM Agents as Research Assistants

智能实验室：利用大型语言模型代理作为研究助手

Samuel Schmidgall, Yusheng Su, Ze Wang 等, 2025-Findings of the Association for Computational Linguistics: EMNLP 2025

Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.

安装插件收集

被引 280

Red-Teaming LLM Multi-Agent Systems via Communication Attacks

通过通信攻击对基于大型语言模型的多人代理系统进行红队测试

Pengfei He, Yuping Lin, Shen Dong 等, 2025-Annual Meeting of the Association for Computational Linguistics

Large Language Model-based Multi-Agent Systems (LLM-MAS) have revolutionized complex problem-solving capability by enabling sophisticated agent collaboration through message-based communications. While the communication framework is crucial for agent coordination, it also introduces a critical yet unexplored security vulnerability. In this work, we introduce Agent-in-the-Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM-MAS by intercepting and manipulating inter-agent messages. Unlike existing attacks that compromise individual agents, AiTM demonstrates how an adversary can compromise entire multi-agent systems by only manipulating the messages passing between agents. To enable the attack under the challenges of limited control and role-restricted communication format, we develop an LLM-powered adversarial agent with a reflection mechanism that generates contextually-aware malicious instructions. Our comprehensive evaluation across various frameworks, communication structures, and real-world applications demonstrates that LLM-MAS is vulnerable to communication-based attacks, highlighting the need for robust security measures in multi-agent systems.

安装插件收集

被引 70

Thematic-LM: A LLM-based Multi-agent System for Large-scale Thematic Analysis

基于大型语言模型的主题分析多智能体系统：Thematic-LM

Tingrui Qiao, Caroline Walker, Chris Cunningham 等, 2025-Proceedings of the ACM on Web Conference 2025

Thematic analysis (TA) is a widely used qualitative method for identifying underlying meanings within unstructured text. However, TA requires manual processes, which become increasingly labour-intensive and time-consuming as datasets grow. While large language models (LLMs) have been introduced to assist with TA on small-scale datasets, three key limitations hinder their effectiveness. First, current approaches often depend on interactions between an LLM agent and a human coder, a process that becomes challenging with larger datasets. Second, with feedback from the human coder, the LLM tends to mirror the human coder, which provides a narrower viewpoint of the data. Third, existing methods follow a sequential process, where codes are generated for individual samples without recalling previous codes and associated data, reducing the ability to analyse data holistically. To address these limitations, we propose Thematic-LM, an LLM-based multi-agent system for large-scale computational thematic analysis. Thematic-LM assigns specialised tasks to each agent, such as coding, aggregating codes, and maintaining and updating the codebook. We assign coder agents different identity perspectives to simulate the subjective nature of TA, fostering a more diverse interpretation of the data. We applied Thematic-LM to the Dreaddit dataset and the Reddit climate change dataset to analyse themes related to social media stress and online opinions on climate change. We evaluate the resulting themes based on trustworthiness principles in qualitative research. Our study reveals insights such as assigning different identities to coder agents promotes divergence in codes and themes.

安装插件收集

被引 31

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

哪些代理导致任务失败及其原因？关于LLM多代理系统自动故障归因的研究

Shaokun Zhang, Ming Yin, Jieyu Zhang 等, 2025-International Conference on Machine Learning

Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution

安装插件收集

被引 57

LLM Agents for Smart City Management: Enhancing Decision Support Through Multi-Agent AI Systems

基于LLM的智能城市管理代理：通过多智能体AI系统提升决策支持能力

A. Kalyuzhnaya, Sergey Mityagin, E. Lutsenko 等, 2025-Smart Cities3区IF 5.5

This study investigates the implementation of LLM agents in smart city management, leveraging both the inherent language processing abilities of LLMs and the distributed problem solving capabilities of multi-agent systems for the improvement of urban decision making processes. A multi-agent system architecture combines LLMs with existing urban information systems to process complex queries and generate contextually relevant responses for urban planning and management. The research is focused on three main hypotheses testing: (1) LLM agents’ capability for effective routing and processing diverse urban queries, (2) the effectiveness of Retrieval-Augmented Generation (RAG) technology in improving response accuracy when working with local knowledge and regulations, and (3) the impact of integrating LLM agents with existing urban information systems. Our experimental results, based on a comprehensive validation dataset of 150 question–answer pairs, demonstrate significant improvements in decision support capabilities. The multi-agent system achieved pipeline selection accuracy of 94–99% across different models, while the integration of RAG technology improved response accuracy by 17% for strategic development queries and 55% for service accessibility questions. The combined use of document databases and service APIs resulted in the highest performance metrics (G-Eval scores of 0.68–0.74) compared to standalone LLM responses (0.30–0.38). Using St. Petersburg’s Digital Urban Platform as a testbed, we demonstrate the practical applicability of this approach to create integrated city management systems with support complex urban decision making processes. This research contributes to the growing field of AI-enhanced urban management by providing empirical evidence of LLM agents’ effectiveness in processing heterogeneous urban data and supporting strategic planning decisions. Our findings suggest that LLM-based multi-agent systems can significantly enhance the efficiency and accuracy of urban decision making while maintaining high relevance in responses.

安装插件收集

被引 42

AutoHMA-LLM: Efficient Task Coordination and Execution in Heterogeneous Multi-Agent Systems Using Hybrid Large Language Models

AutoHMA-LLM：基于混合大型语言模型的异构多智能体系统高效任务协调与执行

Tinging Yang, Ping Feng, Qixin Guo 等, 2025-IEEE Transactions on Cognitive Communications and Networking1区 TopIF 7.0

Heterogeneous multi-agent systems (HMAS) comprise various intelligent agents with specialized functions, such as drones, ground robots, and automated devices, working in coordinated settings. This paper presents AutoHMA-LLM, a novel framework that combines Large Language Models (LLMs) with classical control algorithms to address the challenges of task coordination and scheduling in complex, dynamic environments. The framework is designed with a multi-tier architecture, utilizing a cloud-based LLM as the central planner alongside device-specific LLMs and Generative Agents to improve task execution efficiency and accuracy. Specifically targeting dynamic scenarios, the system enhances resource utilization and stabilizes task execution through refined task scheduling and real-time feedback mechanisms. In experiments conducted across logistics, inspection, and search & rescue scenarios, AutoHMA-LLM demonstrated a 5.7% improvement in task completion accuracy, a 46% reduction in communication steps, and a 31% decrease in token usage and API calls compared to baseline methods. These results highlight our framework’s scalability and efficiency, offering substantial support for effective multi-agent collaboration in complex, resource-constrained environments.

安装插件收集

被引 30

Exploring LLM-Based Multi-Agent Situation Awareness for Zero-Trust Space-Air-Ground Integrated Network

基于大型语言模型的多智能体态势感知在零信任空地一体网络中的应用探索

Xinye Cao, Gu Nan, Hongcan Guo 等, 2025-IEEE Journal on Selected Areas in Communications1区 TopIF 17.2

Space-air-ground integrated network (SAGIN), which integrates satellite systems, aerial networks, and terrestrial communications, offers ubiquitous coverage for a multitude of applications. Nevertheless, the highly dynamic and open nature of SAGIN increases the network’s vulnerability. Hence, zero-trust security, operating on the principle of “never trust, always verify”, holds the significant potential of securing SAGIN. However, implementing zero-trust SAGIN in practice presents three primary challenges: 1) understanding massive unstructured threat information across diverse domains, 2) performing adaptive security assessments, and 3) making in-depth security decisions. This motivates us to propose SAG-Attack and LLM-SA to enhance zero-trust SAGIN. SAG-Attack serves as a simulator that aims to mimic various attacks in SAGIN. Our LLM-SA is a novel situation awareness method that explores the multiple agents of large language model (LLM). Specifically, the output logs of SAG-Attack will be fed into LLM-SA, and LLM-SA fuses vast amounts of heterogeneous threat information from various domains, thus tackling the first challenge. Then, our LLM-SA relies on multiple LLM-based agents to perform adaptive security assessments, utilizing the chain-of-thought capabilities of LLMs to automatically generate in-depth defense strategies, thereby addressing the second and third challenges. Experiments on five benchmarks demonstrate the superiority of the proposed SAG-Attack and LLM-SA. Notably, our method based on open-sourced Llama3-8B even outperforms ChatGPT-4 under the same setting, despite involving significantly fewer parameters. To foster further research in this area, we will release our platform to the community, facilitating the advancement of zero-trust SAGIN.

安装插件收集

被引 40

AI-powered consumer segmentation and targeting: A theoretical framework for precision marketing by autonomous (Agentic) AI

基于人工智能的消费者细分与定位：自主（代理）人工智能在精准营销中的理论框架

Arunraju Chinnaraju, 2025-International Journal of Science and Research Archive

Consumer segmentation and targeting are essential for precision marketing, allowing businesses to deliver personalized experiences. The article explores the transformative role of autonomous AI agents in enhancing consumer segmentation and targeting within the data-driven marketing landscape. The proposed framework integrates machine learning (ML), natural language processing (NLP), and predictive analytics to continuously optimize segmentation models, enabling real-time targeting and hyper-personalization without human oversight. Autonomous agents dynamically manage segmentation by leveraging unsupervised learning algorithms, including K-means and DBSCAN, to refine clusters and discover complex micro-segments based on evolving consumer behavior and preferences. The AI agents use reinforcement learning to enhance campaign management through continuous feedback loops. By monitoring real-time performance metrics, such as click-through rates and conversions, they dynamically adjust ad spend, resource allocation, and personalized content delivery across digital channels. Predictive models, including Random Forests and time series analysis, further support real-time consumer behavior forecasting. This automation reduces operational inefficiencies, speeds up decision-making, and ensures marketing strategies remain relevant and adaptive. Ethical considerations, including data privacy and algorithmic fairness, are integral to the framework, promoting responsible AI deployment. Case studies from industries such as e-commerce and streaming illustrate significant improvements in campaign efficiency, customer engagement, and return on investment. Autonomous AI enables scalable, data-driven solutions that give businesses a competitive edge in rapidly changing markets.

安装插件收集

被引 8

Autonomous Agentic AI Architectures for Optimizing Security Operations Centers (SOC) KPIS: Methodology, Impact on Detection, Response, and Recovery

基于自主智能体的安全运营中心（SOC）关键绩效指标（KPI）优化架构：方法论、对检测、响应和恢复的影响

Miroslav Stefanov, K. Stefanov, Laxima Niure Kandel 等, 2025-Land Forces Academy Review

Abstract Security Operations Centers (SOCs) face significant challenges due to the large volume, diversity, and dynamics of incident events. Alarm fatigue, delayed initiation of response, and the high share of false positives or missed threats limit team effectiveness and increase organizational risk. This study presents a methodology for automated management of key performance indicators (KPIs) in an SOC environment through an Agentic AI architecture and machine learning. Within the project, 214 CSV files were processed, comprising over 8.6 million data rows extracted from SIEM, Incident Management, Task Tracking, and CRM systems. Sixteen specific indicators were used, grouped into four categories: detection and filtering (TTD, FNR, FPR), response and resolution (TTR, IRR, SIHR), recovery and operations (MTTR, OE), satisfaction and risk management (CSR, SIER). The system includes ten specialized Agentic AI agents with clearly defined roles ‒ monitoring time parameters, predicting false alarm probabilities, automatically triggering playbooks, calculating operational metrics, and analyzing customer satisfaction. Five machine learning models were trained: two XGBoost classifiers for FPR and FNR, two LightGBM regressors for TTR and MTTR, and a BERT model for textual feedback analysis. The results demonstrate reduced detection and response times, a lower rate of false alarms, and improved operational predictability in calculating KPI values. The methodology shows the applicability of Agentic AI for optimizing SOC processes on real and public data, without the need for manual intervention in most processing phases.

安装插件收集

被引 2

Toward the Autonomous AI Doctor: Quantitative Benchmarking of an Autonomous Agentic AI Versus Board-Certified Clinicians in a Real World Setting

迈向自主人工智能医生：在现实世界环境中对自主智能体AI与执业医师的量化基准比较

Hashim Hayat, Maksim Kudrautsau, E.A. Makarov 等, 2025-medRxiv

Background: Globally we face a projected shortage of 11 million healthcare practitioners by 2030, and administrative burden consumes 50% of clinical time. Artificial intelligence (AI) has the potential to help alleviate these problems. However, no end-to-end autonomous large language model (LLM)-based AI system has been rigorously evaluated in real-world clinical practice. In this study, we evaluated whether a multi-agent LLM-based AI framework can function autonomously as an AI doctor in a virtual urgent care setting. Methods: We retrospectively compared the performance of the multi-agent AI system Doctronic and board-certified clinicians across 500 consecutive urgent-care telehealth encounters. The primary end points: diagnostic concordance, treatment plan consistency, and safety metrics, were assessed by blinded LLM-based adjudication and expert human review. Results: The top diagnosis of Doctronic and clinician matched in 81% of cases, and the treatment plan aligned in 99.2% of cases. No clinical hallucinations occurred (e.g., diagnosis or treatment not supported by clinical findings). In an expert review of discordant cases, AI performance was superior in 36.1%, and human performance was superior in 9.3%; the diagnoses were equivalent in the remaining cases. Conclusions: In this first large-scale validation of an autonomous AI doctor, we demonstrated strong diagnostic and treatment plan concordance with human clinicians. These findings indicate that multi-agent AI systems can achieve comparable clinical decision-making to human providers and offer a potential solution to healthcare workforce shortages. Key words: large language models, artificial intelligence, autonomous AI doctor, diagnostic accuracy

安装插件收集

被引 2

Autonomous Agentic AI for Clinical Workflow Orchestration: Self-Managing Healthcare Operations

自主代理人工智能在临床工作流程编排中的应用：自我管理的医疗运营

Arjun Warrier, Abhilash K S, 2025-2025 6th International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS)

Healthcare operations are inherently complex, involving dynamic coordination across emergency triage, diagnostics, surgery, and discharge processes. Traditional orchestration methods such as manual scheduling, static bed boards, and siloed communication struggle to manage this complexity, often resulting in delayed interventions, inefficiencies, and suboptimal resource utilization, especially during high-acuity surges or emergencies. Agentic Artificial Intelligence (AI) introduces a transformative paradigm by embedding autonomy, reasoning, and negotiation capabilities into intelligent digital agents that perceive, learn, and act within clinical workflows. Unlike conventional AI systems that rely on predefined rules or static predictions, agentic AI employs multi-agent reinforcement learning (MARL) to enable decentralized decision-making, adaptive resource allocation, and cooperative policy optimization across interconnected hospital systems. This study presents an autonomous agentic AI framework for clinical workflow orchestration, integrating agents for triage, bed management, laboratory, imaging, transport, and discharge operations. Using HL7 Fast Healthcare Interoperability Resources (FHIR) and DICOM standards, agents exchange real-time information while adhering to governance and safety protocols aligned with the NIST AI Risk Management Framework and the EU AI Act. The architecture further incorporates three core design elements: (i) inter-hospital communication for mutual-aid and load sharing, (ii) decentralized ambulance routing that rebalances transport in real time based on dynamic capacity and patient acuity, and (iii) distributed crisis-management protocols for maintaining operational equilibrium during mass-casualty events. Evaluation through digital-twin simulations and shadow-mode deployments demonstrated substantial operational gains, including 60% faster ambulance response, 38% shorter door-to-clinician intervals, and 22% higher operating room throughput. These results confirm that agentic AI transforms reactive, human-initiated workflows into proactive, self-governing systems, enhancing responsiveness, equity, and resilience across healthcare networks.

安装插件收集

Autonomous agentic AI with policy adaptation for physics-informed spectral learning in Structural Health Monitoring

具有策略自适应的物理信息光谱学习在结构健康监测中的自主智能体AI

Anshu Sharma, B. Bhowmik, 2026-Advanced Engineering Informatics1区 TopIF 9.9

No abstract available

安装插件收集

被引 1

AGENTIC AI: A COMPREHENSIVE FRAMEWORK FOR AUTONOMOUS DECISION-MAKING SYSTEMS IN ARTIFICIAL INTELLIGENCE

Panneer Selvam Viswanathan, 2025-INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING AND TECHNOLOGY

No abstract available

安装插件收集

被引 17

Agentic AI for Autonomous Micro-Frontend User Interfaces and Microservices Evolution in Cloud Platforms

自主微前端用户界面和云平台中微服务演化的代理人工智能

Jyoti Kunal, Jyoti Kunal Shah, 2025-Journal of Computer Science and Technology Studies

Cloud-native organizations increasingly rely on microservices for backend modularity and micro-frontends for scalable user interface delivery. Yet, real-world systems still struggle to evolve these layers coherently under high release velocity, shifting product goals, and variable workloads. This paper presents a unified Agentic AI framework that autonomously coordinates the co-evolution of micro-frontend UIs (implemented in ReactJS and Angular) and microservices. The proposed architecture integrates reinforcement learning for continuous control, large language models for code and configuration synthesis, and a policy-governed multi-agent control plane that executes progressive delivery (feature flags, canary, blue-green) via Kubernetes and service meshes. We formalize decisions using Markov Decision Processes, propose drift detection models for UI-API compatibility, and formulate traffic-shifting optimization for safe rollouts. A mini empirical study across e-commerce, SaaS analytics, and multi-cloud migration scenarios demonstrates reductions in adaptation latency, error rates, and manual intervention relative to strong DevOps baselines. We discuss reliability, explainability, and governance challenges, and lay out future research on hybrid RL-LLM agents, knowledge-graph-aware planning, digital twins, and compliance-aware rewards.

安装插件收集

被引 10

Agentic AI: A Paradigm for Autonomous Decision-Making

代理人工智能：自主决策的范式

Meethun Panda, 2025-International Journal of Innovative Research in Science Engineering and Technology

In this paper, we explore the paradigm of Agentic AI, where generative AI systems are not limited to generate responses but become the agent in autonomous and context aware decisions. It discusses defining characteristics, architectural components, and the ethical considerations; real world applications included in autonomous vehicles, collaborative robots and personalized services. Challenges such as scalability and interpretability, as well as future opportunities such as interdisciplinary research and general-purpose adaptability, are also identified in the paper. In emerging arche of Agentic AI, dynamic decision making is addressed by methodological needs in industries.

安装插件收集

被引 3

Agentic AI for Self-Sovereign Identity: A Decentralized Zero Trust Framework for Autonomous Microservices

基于代理人工智能的自主身份：一种用于自主微服务的去中心化零信任框架

Damodhara Reddy Palavali, 2025-The International Journal of Computational Mathematical Ideas

The rapid evolution of Agentic Artificial Intelligence (AI)—autonomous, context-aware agents capable of self-directed decision-making—has introduced unprecedented security challenges for microservices architectures. Traditional session-based authentication, dependent on static tokens and centralized identity providers, is ill-suited for the dynamic, ephemeral, and machine-to-machine (M2M) interactions prevalent in zero trust environments. This paper investigates the convergence of Agentic AI and decentralized identity (DID) frameworks, emphasizing the role of verifiable credentials (VCs), dynamic token issuance, and contextual access control in enabling scalable, trust-minimized (i.e., reducing reliance on centralized authorities) service interactions. We propose a decentralized authentication and authorization framework where DIDs, maintained on blockchain-based registries, replace conventional identity silos, enabling autonomous agents to cryptographically prove trustworthiness without relying on persistent session states. Context-aware policy engines evaluate real-time telemetry such as location, workload, and behavioural patterns to issue short-lived, ephemeral access tokens with adaptive time-to-live (TTL) values. Experimental results from a Kubernetes-based microservices testbed with 50 simulated agents show that the proposed approach reduces authentication latency by 50% (from 180 ms to 90 ms), eliminates token replay vulnerabilities, and increases authentication throughput by 75% (from 800 to 1,400 agents/min) compared to OAuth2/JWT baselines. Furthermore, dynamic policy adaptation ensures immediate revocation of access when agents deviate from expected operational norms, minimizing attack surfaces. This work offers a novel synthesis of AI autonomy and decentralized identity principles, delivering both performance gains and enhanced security in zero trust microservices. The proposed architecture paves the way for resilient, self-governing ecosystems where Agentic AI can operate securely, efficiently, and adaptively in highly dynamic environments.

安装插件收集

被引 3

The Accidental Pump and Dump: When Agentic AI Meets Autonomous Trading

偶然的泵吸与倾倒：当代理人工智能遇见自主交易

David Byrd, 2025-Proceedings of the 6th ACM International Conference on AI in Finance

Through every economic sector, but especially among financial firms and enthusiasts, agentic AI systems are being tool-enabled, giving them control over large language models (LLM), reinforcement learning (RL) models, and more. We present a timely paper to explore the potential consequences with a novel combined system: a deep RL-based autonomous trading agent which also controls an LLM capable of posting to a simulated social media feed observed by other traders. As the agent trades, it also supplies order flow information to the LLM, which produces and posts natural language market analysis at the agent’s direction. We empirically investigate the performance and impact of such an agent using two DeepRL algorithms, finding that it learns to augment profit by manipulating sentiment in a sort of accidental pump and dump scheme. Along the way, we present confidence-building baseline results and specific insights from our investigation, before concluding with a discussion of results, limitations, and suggestions for future work.

安装插件收集

被引 2

A4FN: an Agentic AI Architecture for Autonomous Flying Networks

A4FN：一种用于自主飞行网络的代理人工智能架构

André Coelho, Pedro Ribeiro, Helder Fontes 等, 2025-2025 IEEE 36th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC)

This position paper presents A4FN, an Agentic Artificial Intelligence (AI) architecture for intent-driven automation in Flying Networks (FNs) using Unmanned Aerial Vehicles (UAVs) as access nodes. A4FN leverages Generative AI and Large Language Models (LLMs) to enable real-time, context-aware network control via a distributed agentic system. It comprises two components: the Perception Agent (PA), which semantically interprets multimodal input – including imagery, audio, and telemetry data – from UAV-mounted sensors to derive Service Level Specifications (SLSs); and the Decision-and-Action Agent (DAA), which reconfigures the network based on inferred intents. A4FN embodies key properties of Agentic AI, including autonomy, goal-driven reasoning, and continuous perception-action cycles. Designed for mission-critical, infrastructure-limited scenarios such as disaster response, it supports adaptive reconfiguration, dynamic resource management, and interoperability with emerging wireless technologies. The paper details the A4FN architecture, its core innovations, and open research challenges in multi-agent coordination and Agentic AI integration in next-generation FNs.

安装插件收集

被引 2

Edge Agentic AI Framework for Autonomous Network Optimisation in O-RAN

Abdelaziz Salama, Zeinab Nezami, Mohammed M. H. Qazzaz 等, 2025-2025 IEEE 36th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC)

The deployment of AI agents within legacy Radio Access Network (RAN) infrastructure poses significant safety and reliability challenges for future 6G networks. This paper presents a novel Edge AI framework for autonomous network optimisation in Open RAN environments, addressing these challenges through three core innovations: (1) a persona-based multi-tools architecture enabling distributed, context-aware decision-making; (2) proactive anomaly detection agent powered by traffic predictive tool; and (3) a safety, aligned reward mechanism that balances performance with operational stability.Integrated into the RAN Intelligent Controller (RIC), our framework leverages multimodal data fusion, including network KPIs, a traffic prediction model, and external information sources, to anticipate and respond to dynamic network conditions. Extensive evaluation using realistic 5G scenarios demonstrates that the edge framework achieves zero network outages under high-stress conditions, compared to 8.4% for traditional fixed-power networks and 3.3% for large language model (LLM) agent-based approaches, while maintaining near real-time responsiveness and consistent QoS. These results establish that, when equipped with the right tools and contextual awareness, AI agents can be safely and effectively deployed in critical network infrastructure, laying the framework for intelligent and autonomous 5G and beyond network operations.

安装插件收集

被引 8

Autonomous agents in the cloud: Advancing application management with agentic AI

云中的自主代理：通过代理式AI推进应用管理

Vamsi Krishna, Kumar Karanam, 2025-World Journal of Advanced Research and Reviews

Autonomous agents in cloud computing represent a transformative evolution beyond traditional automation approaches, enabling self-directed management of complex application environments. This article explores the architectural framework, implementation patterns, and operational benefits of Agentic AI in cloud-based application management. Unlike conventional automation systems constrained by static rules and predetermined workflows, autonomous agents leverage advanced machine learning techniques to perceive environmental conditions, learn from interactions, and take independent actions aligned with organizational objectives. The architectural foundation integrates sensing, reasoning, action, and feedback layers to create cognitive systems capable of addressing the inherent complexity of modern distributed applications. Key implementation patterns examined include intelligent auto-remediation, proactive capacity management, autonomous patch management, and continuous compliance enforcement—each demonstrating distinctive operational advantages across diverse industry contexts. Benefits include significant operational efficiency improvements, cost optimization through intelligent resource management, enhanced risk mitigation through proactive security measures, and scalability advantages in multi-cloud environments. The article addresses technical challenges related to decision boundaries and explainability, organizational considerations including skills gaps and operational model transformation, and governance requirements for responsible autonomous operations. Mitigation strategies incorporate phased implementation approaches, comprehensive explainability frameworks, and appropriate human oversight models to ensure effective and responsible deployment.

安装插件收集

The (R)evolution of Scientific Workflows in the Agentic AI Era: Towards Autonomous Science

代理人工智能时代科学工作流程的（R）革命：迈向自主科学

Woong Shin, Renan Souza, Daniel Rosendo 等, 2025-Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Modern scientific discovery increasingly requires coordinating distributed facilities and heterogeneous resources, forcing researchers to act as manual workflow coordinators rather than scientists. Advances in AI leading to AI agents show exciting new opportunities that can accelerate scientific discovery by providing intelligence as a component in the ecosystem. However, it is unclear how this new capability would materialize and integrate in the real world. To address this, we propose a conceptual framework where workflows evolve along two dimensions which are intelligence (from static to intelligent) and composition (from single to swarm) to chart an evolutionary path from current workflow management systems to fully autonomous, distributed scientific laboratories. With these trajectories in mind, we present an architectural blueprint that can help the community take the next steps towards harnessing the opportunities in autonomous science with the potential for 100x discovery acceleration and transformational scientific workflows.CCS Concepts • Computing methodologies→Distributed computing methodologies; Multi-agent systems; Distributed computing methodologies; Parallel computing methodologies.

安装插件收集

被引 4

Agentic AI Meets Edge Computing in Autonomous UAV Swarms

基于代理人工智能的边缘计算在自主无人机集群中的应用

Thuan Minh Nguyen, V. T. Truong, Long Bao Le, 2026-IEEE Internet of Things Magazine

The integration of agentic AI, powered by large language models (LLMs) with autonomous reasoning, planning, and execution, into unmanned aerial vehicle (UAV) swarms opens new operational possibilities and brings the vision of the Internet of Drones closer to reality. However, infrastructure constraints, dynamic environments, and the computational demands of multi-agent coordination limit real-world deployment in high-risk scenarios such as wildfires and disaster response. This paper investigates the integration of LLM-based agentic AI and edge computing to realize scalable and resilient autonomy in UAV swarms. We first discuss three architectures for supporting UAV swarms - standalone, edge-enabled, and edge-cloud hybrid deployment - each optimized for varying autonomy and connectivity levels. Then, a use case for wildfire search and rescue (SAR) is designed to demonstrate the efficiency of the edge-enabled architecture, enabling high SAR coverage, reduced mission completion times, and a higher level of autonomy compared to traditional approaches. Finally, we highlight open challenges in integrating LLMs and edge computing for mission-critical UAV-swarm applications.

安装插件收集

被引 2

Data Systems as Autonomous Agents: Applying Agentic AI to Big Data Platforms

数据系统作为自主代理：将代理人工智能应用于大数据平台

Narendra Reddy Mudiyala, 2026-Journal of Information Systems Engineering and Management

Enterprise data processing environments face increasing operational complexity that exceeds traditional manual management capabilities. Current Big Data platforms rely on reactive operational models that respond to system issues after they impact performance and user experience. This article introduces autonomous agent architectures that transform data platforms into intelligent systems capable of independent perception, reasoning, and action execution. The proposed framework integrates perception layers for comprehensive system monitoring, decision models that balance multiple competing objectives, and action orchestration mechanisms that implement optimizations automatically. Autonomous capabilities enable continuous performance tuning, intelligent failure recovery, dynamic cost optimization, and automated policy enforcement without human intervention. The architecture maintains scalability and fault tolerance characteristics while adding sophisticated reasoning capabilities that adapt to changing operational conditions. Implementation strategies offer practical deployment methods that reduce disruption while gradually adding autonomous features. The operational transformation enables proactive optimization that predicts and prevents issues before they affect system performance. Human-agent collaboration frameworks define effective interaction models that balance oversight with system autonomy. Risk mitigation strategies ensure safe autonomous operation through bounded decision-making and comprehensive safeguards. Performance evaluation metrics demonstrate significant improvements in operational efficiency, cost reduction, and system reliability through autonomous operation.

安装插件收集

Toward Agentic AI Networking in 6G: A Generative Foundation Model-as-Agent Approach

迈向6G中的代理式AI网络：基于生成基础模型作为代理的方法

Yong Xiao, Guangming Shi, Ping Zhang, 2025-IEEE Communications Magazine2区IF 8.2

The promising potential of AI and network convergence in improving networking performance and enabling new service capabilities has recently attracted significant interest. Existing network AI solutions, while powerful, are mainly built based on the close-loop and passive learning framework, resulting in major limitations in autonomous solution finding and dynamic environmental adaptation. Agentic AI has recently been introduced as a promising solution to address the above limitations and pave the way for true, generally intelligent, and beneficial AI systems. The key idea is to create a networking ecosystem to support a diverse range of autonomous and embodied AI agents in fulfilling their goals. In this article, we focus on the novel challenges and requirements of agentic AI networking. We propose AgentNet, a novel framework for supporting interaction, collaborative learning, and knowledge transfer among AI agents. We introduce a general architectural framework of AgentNet and then propose a generative foundation model (GFM)-based implementation in which multiple GFM-as-agents have been created as an interactive knowledge-base to bootstrap the development of embodied AI agents according to different task requirements and environmental features. We consider two application scenarios, digital-twin-based industrial automation and metaverse-based infotainment system, to describe how to apply AgentNet for supporting efficient task-driven collaboration and interaction among AI agents.

安装插件收集

被引 13

Intent-Based Infrastructure and Service Orchestration Using Agentic-AI

基于代理人工智能的意图驱动基础设施和服务编排

Dimitrios Brodimas, Alexios N. Birbas, Dimitrios Kapolos 等, 2025-IEEE Open Journal of the Communications Society3区IF 6.1

This paper introduces a novel framework that integrates agentic Artificial Intelligence (AI) with Intent-Based Networks (IBN) to enable autonomous management, configuration, and optimization of mobile network services and resources. Leveraging the advanced reasoning and natural language processing capabilities of an Large Language Model (LLM), the proposed architecture translates high-level user intents into precise network actions, facilitating user-friendly and scalable network orchestration. The framework employs a distributed multi-agent system, where specialized agents collaborate to decompose user intents, provide computational infrastructure, and deploy services using industry-standard Infrastructure-as-Code (IaC) tools. By supporting natural language interactions, the system reduces operational complexity and enhances accessibility for users with varying technical expertise. Experimental evaluations demonstrate significant improvements in task completion rates, response accuracy, and operational efficiency compared to traditional manual methods, particularly for complex network management tasks. In essence, this work creates an intelligent network orchestration framework that adapts to user needs by automatically configuring network and computing resources while operating with minimal human intervention.

安装插件收集

被引 13

Agentic AI Framework for End-to-End Medical Data Inference

端到端医学数据推理的代理人工智能框架

Soorya Ram Shimgekar, Shayan Vassef, Abhay Goyal 等, 2025-2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Developing clinical ML systems is costly and labor-intensive due to fragmented preprocessing, privacy constraints, and model-data alignment challenges. We introduce a modular agentic AI framework that automates the end-to-end ML lifecycle, from ingestion and anonymization to preprocessing, model selection, and interpretable inference. Each agent performs a well-defined task, enabling scalable workflows across structured and unstructured data. We evaluate the framework on public datasets from geriatrics, palliative care, and colonoscopy imaging. Data are automatically classified, anonymized via DLP, semantically represented, and mapped to suitable models using embedding- or LLM-based strategies. Preprocessing and inference agents ensure compatibility and produce interpretable outputs (e.g., SHAP, attention maps). By consolidating manual tasks into coordinated autonomous agents, our approach reduces expert intervention, lowers operational costs, and supports scalable clinical ML deployment.

安装插件收集

被引 12

Agentic AI in radiology: emerging potential and unresolved challenges

放射学中的代理人工智能：新兴潜力与未解决的挑战

Nicholas Dietrich, 2025-British Journal of Radiology4区IF 3.4

Abstract This commentary introduces agentic artificial intelligence (AI) as an emerging paradigm in radiology, marking a shift from passive, user-triggered tools to systems capable of autonomous workflow management, task planning, and clinical decision support. Agentic AI models may dynamically prioritize imaging studies, tailor recommendations based on patient history and scan context, and automate administrative follow-up tasks, offering potential gains in efficiency, triage accuracy, and cognitive support. While not yet widely implemented, early pilot studies and proof-of-concept applications highlight promising utility across high-volume and high-acuity settings. Key barriers, including limited clinical validation, evolving regulatory frameworks, and integration challenges, must be addressed to ensure safe, scalable deployment. Agentic AI represents a forward-looking evolution in radiology that warrants careful development and clinician-guided implementation.

安装插件收集

被引 10

Agentic AI: A Quantitative Analysis of Performance and Applications

代理人工智能：性能与应用的定量分析

P. Sawant, 2025-Journal of Advances in Artificial Intelligence

: This study presents a comprehensive quantitative analysis of Agentic AI performance and applications across various industries. Agentic Artificial Intelligent (AI), an emerging field combining advanced AI techniques with enterprise automation, has shown promise in creating autonomous agents capable of complex decision-making and problem-solving. Our research, conducted over a 12-month period, employed a mixed-methods approach, analyzing data from 500 organizations and incorporating insights from 50 industry experts. The study aimed to evaluate the efficiency, accuracy, and impact of Agentic AI systems compared to traditional AI approaches. Results demonstrate that Agentic AI systems significantly outperform traditional AI, with a 34.2% reduction in task completion time, 7.7% increase in accuracy, and 13.6% improvement in resource utilization. Productivity gains varied across industries, with the technology sector showing the highest improvement at 45%. The study also revealed high scalability of Agentic AI solutions across different organizational sizes, although implementation time increased with organization complexity. Key challenges identified include data privacy concerns, integration difficulties with legacy systems, skill gaps, and ethical considerations. Despite these challenges, the study concludes that Agentic AI has significant potential to transform business processes and decision-making across various sectors. Future research directions include enhancing interpretability, optimizing domain-specific applications, and exploring multi-agent collaborations. This research contributes valuable insights into the current state and future prospects of Agentic AI, providing a foundation for further development and implementation strategies in this rapidly evolving field.

安装插件收集

被引 8

Agentic AI Workflows in Cybersecurity: Opportunities, Challenges, and Governance via the MCP Model

网络安全中的代理人工智能工作流程：机遇、挑战与通过MCP模型进行治理

Sri Keerthi Suggu, 2025-Journal of Information Systems Engineering and Management

The rise of Agentic AI—autonomous systems capable of executing tasks with self-directed decision-making—presents transformative potential for cybersecurity operations. However, as these systems begin to operate across threat detection, response orchestration, and policy enforcement, they introduce novel attack surfaces, decision-making opacity, and governance complexity. This paper introduces the Model–Control–Policy (MCP) framework as a structured approach to governing agentic AI workflows in cybersecurity. Through deep technical analysis, case studies including autonomous SOC agents and adaptive threat mitigation bots, and an evaluation of existing controls (e.g., explainability, human-in-the-loop, red-teaming), we explore how governance strategies must evolve to meet this new paradigm. We also propose specific policy recommendations and architectural safeguards to ensure accountability, resilience, and trust in AI-driven cybersecurity systems.

安装插件收集

被引 6

Agentic AI for Intent-Based Industrial Automation

基于代理的人工智能在基于意图的工业自动化中的应用

Marcos Lima Romero, Ricardo Suyama, 2025-2025 16th IEEE International Conference on Industry Applications (INDUSCON)

The recent development of Agentic AI systems, empowered by autonomous large language models (LLMs) agents with planning and tool-usage capabilities, enables new possibilities for the evolution of industrial automation and reduces the complexity introduced by Industry 4.0. This work proposes a conceptual framework that integrates Agentic AI with the intent-based paradigm, originally developed in network research, to simplify human-machine interaction (HMI) and better align automation systems with the human-centric, sustainable, and resilient principles of Industry 5.0. Based on the intent-based processing, the framework allows human operators to express high-level business or operational goals in natural language, which are decomposed into actionable components. These intents are broken into expectations, conditions, targets, context, and information that guide sub-agents equipped with specialized tools to execute domain-specific tasks. A proof of concept was implemented using the CMAPSS dataset and Google Agent Developer Kit (ADK), demonstrating the feasibility of intent decomposition, agent orchestration, and autonomous decision-making in predictive maintenance scenarios. The results confirm the potential of this approach to reduce technical barriers and enable scalable, intent-driven automation, despite data quality and explainability concerns.

安装插件收集

被引 7

Agentic AI for Cultural Heritage: Embedding Risk Memory in Semantic Digital Twins

文化遗产领域的代理人工智能：嵌入语义数字孪生的风险记忆

Georgios Pavlidis, 2025-Computers4区IF 4.2

Cultural heritage preservation increasingly relies on data-driven technologies, yet most existing systems lack the cognitive and temporal depth required to support meaningful, transparent, and policy-informed decision-making. This paper proposes a conceptual framework for memory-enabled, semantically grounded AI agents in the cultural domain, showing how the integration of the ICCROM/CCI ABC method for risk assessment into the Panoptes ontology enables the structured encoding of risk cognition over time. This structured risk memory becomes the foundation for agentic reasoning, supporting prioritization, justification, and long-term preservation planning. It is argued that this approach constitutes a principled step toward the development of Cultural Agentic AI: autonomous systems that remember, reason, and act in alignment with cultural values. Proof-of-concept simulations illustrate how memory-enabled agents can trace evolving risk patterns, trigger policy responses, and evaluate mitigation outcomes through structured, explainable reasoning.

安装插件收集

被引 6

Autonomic Microservice Management via Agentic AI and MAPE-K Integration

基于代理人工智能和MAPE-K集成的自组织微服务管理

Matteo Esposito, Alexander Bakhtin, Noman Ahmad 等, 2025-European Conference on Software Architecture

While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.

安装插件收集

被引 4

Enterprise API & Platform Strategy in the era of Agentic AI

在代理式人工智能时代的企业API与平台战略

Ashay Satav, 2025-Journal of Computer Science and Technology Studies

This research paper investigates the critical importance of robust API and platform strategies for enterprises adapting to the proliferation of agentic AI, wherein AI systems autonomously execute tasks with limited human intervention. It addresses the imperative of facilitating seamless communication among AI agents, enterprise data systems, and external applications. The research examines the architectural and performance considerations essential for organizations to maintain competitiveness in this rapidly growing technological landscape of agentic AI projected to expand from $5.1 billion in 2024 to $47.1 billion by 2030. Key elements explored include unified data layer APIs, zero-trust authorization models, event-driven orchestration, and latency-sensitive design. Furthermore, the study considers emerging trends such as AI-powered SDKs, self-optimizing API gateways, autonomous API discovery, and ethical AI governance APIs. The findings emphasize that the adoption of modern API and platform architectures, optimization of performance metrics, and adherence to regulatory mandates are paramount for organizations to fully capitalize on the transformative potential of agentic AI. It is posited that enterprises embracing this paradigm shift will achieve a demonstrable competitive advantage, fostering innovation and operational excellence in the AI-driven future.

安装插件收集

被引 4

Efficient Tool Use with Chain-of-Abstraction Reasoning

基于抽象链推理的效率工具使用

Silin Gao, Jane Dwivedi-Yu, Ping Yu 等, 2024-International Conference on Computational Linguistics

To achieve faithful reasoning that aligns with human expectations, large language models (LLMs) need to ground their reasoning to real-world knowledge (e.g., web facts, math and physical rules). Tools help LLMs access this external knowledge, but there remains challenges for fine-tuning LLM agents (e.g., Toolformer) to invoke tools in multi-step reasoning problems, where inter-connected tool calls require holistic and efficient tool usage planning. In this work, we propose a new method for LLMs to better leverage tools in multi-step reasoning. Our method, Chain-of-Abstraction (CoA), trains LLMs to first decode reasoning chains with abstract placeholders, and then call domain tools to reify each reasoning chain by filling in specific knowledge. This planning with abstract chains enables LLMs to learn more general reasoning strategies, which are robust to shifts of domain knowledge (e.g., math results) relevant to different reasoning questions. It also allows LLMs to perform decoding and calling of external tools in parallel, which avoids the inference delay caused by waiting for tool responses. In mathematical reasoning and Wiki QA domains, we show that our method consistently outperforms previous chain-of-thought and tool-augmented baselines on both in-distribution and out-of-distribution test sets, with an average ~6% absolute QA accuracy improvement. LLM agents trained with our method also show more efficient tool use, with inference speed being on average ~1.4x faster than baseline tool-augmented LLMs.

安装插件收集

被引 43

AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning

AvaTaR：通过对比推理优化LLM代理以使用工具

Shirley Wu, Shiyu Zhao, Qian Huang 等, 2024-Advances in Neural Information Processing Systems 37

Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and reduce hallucinations. However, developing prompting techniques that enable LLM agents to effectively use these tools and knowledge remains a heuristic and labor-intensive task. Here, we introduce AvaTaR, a novel and automated framework that optimizes an LLM agent to effectively leverage provided tools, improving performance on a given task. During optimization, we design a comparator module to iteratively deliver insightful and comprehensive prompts to the LLM agent by contrastively reasoning between positive and negative examples sampled from training data. We demonstrate AvaTaR on four complex multimodal retrieval datasets featuring textual, visual, and relational information, and three general question-answering (QA) datasets. We find AvaTaR consistently outperforms state-of-the-art approaches across all seven tasks, exhibiting strong generalization ability when applied to novel cases and achieving an average relative improvement of 14% on the Hit@1 metric for the retrieval datasets and 13% for the QA datasets. Code and dataset are available at https://github.com/zou-group/avatar.

安装插件收集

被引 64

ACEBench: A Comprehensive Evaluation of LLM Tool Usage

Chen Chen, Xinlong Hao, Weiwen Liu 等, 2025-Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs’ tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.

安装插件收集

被引 7

Toolken+: Improving LLM Tool Usage with Reranking and a Reject Option

Toolken+：通过重排序和拒绝选项改进LLM工具使用

K. Yakovlev, Sergey Nikolenko, A. Bout, 2024-Conference on Empirical Methods in Natural Language Processing

The recently proposed ToolkenGPT tool learning paradigm demonstrates promising performance but suffers from two major issues: first, it cannot benefit from tool documentation, and second, it often makes mistakes in whether to use a tool at all. We introduce Toolken+ that mitigates the first problem by reranking top $k$ tools selected by ToolkenGPT and the second problem with a special"Reject"option such that the model will generate a vocabulary token if"Reject"is ranked first. We demonstrate the effectiveness of Toolken+ on multistep numerical reasoning and tool selection tasks.

安装插件收集

被引 4

TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning

TableMind：一种用于工具增强表格推理的自主程序化代理

Chuang Jiang, Mingyue Cheng, Xiaoyu Tao 等, 2025-Web Search and Data Mining

Table reasoning requires models to jointly perform comprehensive semantic understanding and precise numerical operations. Although recent large language model (LLM)-based methods have achieved promising results, most of them still rely on a single-turn reasoning paradigm that processes flattened tables in a single forward pass. This paradigm suffers from inherent limitations, including context overflow on large tables, weak sensitivity to continuous numerical values, and the absence of explicit tool-use and reflection. In this paper, we propose TableMind, a tuning-based autonomous programmatic table agent that simulates the human-like cognitive schema of multi-turn interaction within a lightweight LLM. Instead of adopting a training-free workflow design, TableMind learns to internalize planning, action, and reflection through a principled two-stage training strategy. To bootstrap structured table reasoning capabilities, we construct and filter high-quality reasoning data for the supervised fine-tuning (SFT) stage. To enable precise code generation, we introduce a designed multi-perspective reward scheme and a novel optimization objective in the reinforcement learning (RL) stage. Extensive experiments on diverse benchmarks demonstrate that TableMind consistently outperforms previous baselines, validating the effectiveness of training autonomous agents to improve overall performance.

安装插件收集

被引 13

Thinking Isn't an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations

思维并非幻觉：通过工具增强克服推理模型局限性

Zhao Song, Song Yue, Jiahao Zhang, 2025-Robotics4区IF 3.3

Large Reasoning Models (LRMs) have become a central focus in today’s large language model (LLM) research, where models are designed to output a step-by-step thinking process before arriving at a final answer to handle complex reasoning tasks. Despite their promise, recent empirical studies (e.g., [Shojaee et al., 2025] from Apple) suggest that this thinking process may not actually enhance reasoning ability, where LLMs without explicit reasoning actually outperform LRMs on tasks with low or high complexity. In this work, we revisit these findings and investigate whether the limitations of LRMs persist when tool augmentations are introduced. We incorporate two types of tools, Python interpreters and scratchpads, and evaluate three representative LLMs and their LRM counterparts on Apple’s benchmark reasoning puzzles. Our results show that, with proper tool use, LRMs consistently outperform their non-reasoning counterparts across all levels of task complexity. These findings challenge the recent narrative that reasoning is an illusion and highlight the potential of tool-augmented LRMs for solving complex problems. Our source code is available at https: //github.com/magiclinux/thinking_is_not_an_illusion.

安装插件收集

被引 7

MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration

MapAgent：一种具有动态地图工具集成的层次化地理空间推理智能体

Md Hasebul Hasan, Mahir Labib Dihan, Mohammed Eunus Ali 等, 2025-Conference of the European Chapter of the Association for Computational Linguistics

Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and fall short on geospatial tasks that require spatial reasoning, multi-hop planning, and real-time map interaction. To address these challenges, we introduce MapAgent, a hierarchical multi-agent plug-and-play framework with customized toolsets and agentic scaffolds for map-integrated geospatial reasoning. Unlike existing flat agent-based approaches that treat tools uniformly-often overwhelming the LLM when handling similar but subtly different geospatial APIs-MapAgent decouples planning from execution. A high-level planner decomposes complex queries into subgoals, which are routed to specialized modules. For tool-heavy modules-such as map-based services-we then design a dedicated map-tool agent that efficiently orchestrates related APIs adaptively in parallel to effectively fetch geospatial data relevant for the query, while simpler modules (e.g., solution generation or answer extraction) operate without additional agent overhead. This hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs. We evaluate MapAgent on four diverse geospatial benchmarks-MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA-and demonstrate substantial gains over state-of-the-art tool-augmented and agentic baselines. We open-source our framwork at https://github.com/Hasebul/MapAgent.

安装插件收集

被引 1

Robots can feel: LLM-based Framework for Robot Ethical Reasoning

机器人也能感知：基于LLM的机器人伦理推理框架

Artem Lykov, Miguel Altamirano Cabrera, Koffivi Fidele Gbagbe 等, 2024-2024 2nd International Conference on Foundation and Large Language Models (FLLM)

This paper presents the development of a novel ethical reasoning framework for robots. "Robots can feel" is the first system for robots that utilizes a combination of logic and human-like emotion simulation to make decisions in morally complex situations akin to humans. The key feature of the approach is the management of the Emotion Weight Coefficient, a customizable parameter to assign the role of emotions in robot decision-making. The system aims to serve as a tool that can equip robots of any form and purpose with ethical behavior close to human standards. Besides the platform, the system is independent of the choice of the base model. During the evaluation, the system was tested on 8 top-up-to-date LLMs (Large Language Models). This list included both commercial and open-source models developed by various companies and countries. The research demonstrated that, regardless of the model choice, the Emotions Weight Coefficient influences the robot’s decision similarly. According to ANOVA analysis, the use of different Emotion Weight Coefficients influenced the final decision in a range of situations, such as in a request for a dietary violation (F (4, 35) = 11.2, p = 0.0001) and in an animal compassion situation (F (4, 35) = 8.5441, p = 0.0001). A demonstration code repository is provided at: https://github.com/TemaLykov/robots_can_feel

安装插件收集

被引 4

Multi-tool Integration Application for Math Reasoning Using Large Language Model

基于大型语言模型的多工具集成应用在数学推理中的应用

Zhihua Duan, Jialin Wang, 2024-2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud (EdgeCom)

Mathematical reasoning is an important research direction in the field of artificial intelligence. This article proposes a novel multi tool application framework for mathematical reasoning, aiming to achieve more comprehensive and accurate mathematical reasoning by utilizing the collaborative effect of large language models (LLMs) and multiple external tools. Firstly, use a Math Tool to perform basic mathematical calculations during the inference process through interaction with LLM. Secondly, Code Tool can generate code fragments that comply with syntax rules and execute them, providing support for complex mathematical problems. Then, through the iterative reasoning of the CoT Tool, the logical coherence and accuracy of mathematical reasoning are enhanced. Ultimately, by using self consistency tools to select the final answer based on different parameters, the consistency and reliability of reasoning are improved. Through the synergistic effect of these tools, the framework has achieved significant performance improvement in mathematical reasoning tasks. We conducted experiments on the NumGLUE Task 4 test set, which includes 220 mathematical reasoning fill in the blank questions. The experimental results showed that, based on Math Tool, Code Tool, and CoT Tool, in Task 4 task,our method achieved an accuracy of 89.09,compared with the GPT3+FewShot baseline, Few Shot+ERNIE-4.0+self consistency improved by 49.09%, and compared with fine-tuning the Fine tuning baseline, Few Shot+ERNIE-4.0+self consistency improved by 52.29%

安装插件收集

被引 1

SMART: Self-Aware Agent for Tool Overuse Mitigation

SMART：用于工具过度使用缓解的自我意识智能体

Cheng Qian, Emre Can Acikgoz, Hongru Wang 等, 2025-Annual Meeting of the Association for Computational Linguistics

Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to Tool Overuse, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce SMART (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent's self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce SMART-ER, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop SMARTAgent, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4o. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.

安装插件收集

被引 37

ToolQA: A Dataset for LLM Question Answering with External Tools

ToolQA：用于具有外部工具的LLM问答的数据集

Yuchen Zhuang, Yue Yu, Kuan Wang 等, 2023-Neural Information Processing Systems

Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.

安装插件收集

被引 360

EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records

EHRAgent：代码赋能大型语言模型在电子健康记录上的少样本复杂表格推理

Wenqi Shi, Ran Xu, Yuchen Zhuang 等, 2024-Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Clinicians often rely on data engineers to retrieve complex patient information from electronic health record (EHR) systems, a process that is both inefficient and time-consuming. We propose EHRAgent, a large language model (LLM) agent empowered with accumulative domain knowledge and robust coding capability. EHRAgent enables autonomous code generation and execution to facilitate clinicians in directly interacting with EHRs using natural language. Specifically, we formulate a multi-tabular reasoning task based on EHRs as a tool-use planning process, efficiently decomposing a complex task into a sequence of manageable actions with external toolsets. We first inject relevant medical information to enable EHRAgent to effectively reason about the given query, identifying and extracting the required records from the appropriate tables. By integrating interactive coding and execution feedback, EHRAgent then effectively learns from error messages and iteratively improves its originally generated code. Experiments on three real-world EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate, verifying its strong capacity to tackle complex clinical tasks with minimal demonstrations.

安装插件收集

被引 83

TelAgentBench: A Multi-faceted Benchmark for Evaluating LLM-based Agents in Telecommunications

TelAgentBench：评估基于大型语言模型（LLM）的电信领域代理的多维度基准

Sunwoo Lee, Daseong Jang, Dhammiko Arya 等, 2025-Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

As Large Language Models (LLMs) evolve into powerful agentic systems, the telecommunications industry’s expansion into AI services necessitates industry-grounded benchmarks to evaluate their underexplored domain-specific capabilities. To address the gap left by generic benchmarks that fail to assess realistic, non-English performance, we present TelAgent-Bench, a Korean benchmark for the telecommunications domain evaluating five core agen-tic capabilities: Reasoning, Planning, Action (tool-use), Retrieval-Augmented Generation, and Instruction Following. Evaluations reveal significant performance disparities between models that employ explicit reasoning and those that do not, providing actionable insights for deploying agentic LLMs in real-world telecommunications tasks.

安装插件收集

被引 2

Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search

木星：通过笔记本和推理时价值引导搜索增强大型语言模型的数据分析能力

Shuocheng Li, Yihao Liu, Siling Du 等, 2025-AAAI Conference on Artificial Intelligence

Large language models (LLMs) have shown great promise in automating data science workflows. However, existing models still struggle with multi-step reasoning and tool use, limiting their effectiveness on complex data analysis tasks. To address this limitation, we propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks and associated data files. Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task–solution pairs that reflect authentic tool-use patterns in practical data science scenarios. To further enhance the multi-step reasoning capabilities, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning. During inference, Jupiter combines the value model and node visit counts to efficiently collect executable multi-step plans with minimal search steps. Experimental results show that Qwen2.5-7B and 14B-Instruct models on NbQA solve 77.82% and 86.38% of tasks on InfiAgent-DABench, respectively—matching or surpassing GPT-4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool-use reasoning across diverse multi-step reasoning tasks.

安装插件收集

被引 1

Measuring an LLM's Proficiency at using APIs: A Query Generation Strategy

评估大型语言模型使用API能力的方法：一种查询生成策略

Ying Sheng, Sudeep Gandhe, Bhargav Kanagal 等, 2024-Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Connecting Large Language Models (LLMs) with the ability to leverage APIs (Web Search, Charting, Calculators, Calendar, Flight Search, Hotel Search, Data Lookup, etc. ) is likely to allow us to solve a variety of new hard problems. Several research efforts have made this observation and suggested recipes for LLMs to emit API calls, and proposed mechanisms by which they can generate additional text conditioned on the output for the API call. However, in practice, the focus has been on relatively simple slot-filling tasks that make an API call rather unlocking novel capabilities by combining different tools, reasoning over the response from a tool, making multiple invocations, or complex planning. In this paper, we pose the following question: what does it mean to say that an LLM is proficient at using a set of APIs? We answer this question in the context of structured APIs by defining seven capabilities for API-use. We provide an approach for generating synthetic tasks that exercise each of these capabilities given only the description of an API. We argue that this provides practitioners with a principled way to construct a dataset to evaluate an LLM's ability to use a given set of APIs. Through human evaluations, we show that our approach produces high-quality tasks for each of the seven capabilities. We also describe how we used this approach to on-board new API and create principled evaluation sets for multiple LLM-based products.

安装插件收集

被引 9

Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents

野外的工具学习：赋予语言模型作为自动工具代理的能力

Zhengliang Shi, Shen Gao, Xiuyi Chen 等, 2024-Proceedings of the ACM on Web Conference 2025

Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extend their utility, enabling them to solve practical tasks. Previous methods manually parse tool documentation and create in-context demonstrations, transforming tools into structured formats for LLMs to use in their step-by-step reasoning. However, this manual process requires domain expertise and struggles to scale to large toolsets. Additionally, these methods rely heavily on ad-hoc inference techniques or special tokens to integrate free-form LLM generation with tool-calling actions, limiting the LLM's flexibility in handling diverse tool specifications and integrating multiple tools. In this work, we propose AutoTools, a framework that enables LLMs to automate the tool-use workflow. Specifically, the LLM automatically transforms tool documentation into callable functions, verifying syntax and runtime correctness. Then, the LLM integrates these functions into executable programs to solve practical tasks, flexibly grounding tool-use actions into its reasoning processes. Extensive experiments on existing and newly collected, more challenging benchmarks illustrate the superiority of our framework. Inspired by these promising results, we further investigate how to improve the expertise of LLMs, especially open-source LLMs with fewer parameters, within AutoTools. Thus, we propose the AutoTools-Learning approach, training the LLMs with three learning tasks on 34k instances of high-quality synthetic data, including documentation understanding, relevance learning, and function programming. Fine-grained results validate the effectiveness of our overall training approach and each individual task. Our methods are an important step towards the use of LLMs for solving real-world tasks with external tools.

安装插件收集

被引 54

LLM-Collab: a framework for enhancing task planning via chain-of-thought and multi-agent collaboration

LLM-Collab：一种通过思维链和多智能体协作增强任务规划的框架

Hongyu Cao, Rong Ma, Yanlong Zhai 等, 2024-Applied Computing and Intelligence

Large language models have shown strong capabilities in performing natural language planning tasks, largely due to the chain-of-thought method, which enhances their ability to solve complex tasks through explicit intermediate inference. However, they face challenges in acquiring new knowledge, executing calculations, and interacting with the environment. Although previous work has enabled large language models to use external tools to improve reasoning and environmental interaction, there was no scalable or cohesive structure for these technologies. In this paper, we present LLM-Collab, where Collab represents the cooperative interaction between two AI agents, and the large language model plays a key role in the creation of AI agents. For this method, we took large language models as the reasoning core for AI agents and designed two AI agents to cooperate on the planning tasks: One as an analyst for tool selection and phase validation, and the other as an executor of specific tasks. Our method provided a comprehensive list of external tools to facilitate the invocation and integration of agents, ensuring a seamless collaboration process. This paradigm established a unified framework for autonomous task-solving based on massive language models by demonstrating how language communication and tool selection enable multi-agent collaboration.

安装插件收集

被引 7

An Agentic Framework for Social Event Forecasting: Approaches Using Causality Contextualized Chain of Thought

基于代理框架的社会事件预测：因果情境化思维链方法

A. Thakur, Aditya Sampath, Siddharth Krishnan, 2025-2025 IEEE International Conference on Data Mining Workshops (ICDMW)

Recent advances in large language models (LLMs) have enabled impressive progress across diverse tasks, yet interpretability remains a core requirement for deployment in high-stakes domains such as crisis prevention and policy-making. Prior work on event prediction has largely prioritized accuracy, but the reasoning behind model outputs often remains opaque and difficult to audit. In this paper, we propose C3OT, Causality Contextualized Chain-of-Thought, which integrates causal reasoning into an agentic LLM framework using the ReAct paradigm. We design and evaluate multiple prompting strategies-including Causal Chain Learning, Chain-of-Thought, and more nuanced hybrid approaches. Experiments assess both predictive accuracy and interpretability, the latter measured through structured rubrics that capture transparency, causal coherence, and auditability. Results demonstrate that our causal reasoning approach attains competitive predictive performance while producing more transparent and auditable reasoning traces. These findings underscore the value of causal reasoning for enhancing both trustworthiness and robustness in sociopolitical forecasting.

安装插件收集

ReAct-Diffuse: An Integrated Agentic and Generative Diffusion Framework for Autonomous Multi-Step Task Reasoning and Execution

ReAct-Diffuse：一种用于自主多步任务推理和执行的集成代理和生成扩散框架

Madhur Thapliyal, Geetika Sharma, Raman Sharma, 2026-2026 8th International Conference on Intelligent Sustainable Systems (ICISS)

Generative AI systems and rational/active agents continue to struggle with long-horizon multi-step tasks, due to reasoning drift, unstable planning, and use of tools that are unreliable. ReAct-based agents currently are interpretable, but not robust to execution; diffusion-based planners generate smooth motion plans without a clear semantic grounding or tool-awareness. To overcome these shortcomings, this paper offers ReAct-Diffuse, a hybrid agentic-generative model that combines the structured ReAct reasoning with the diffusion-based plan refinement facilitating consistent and dependable autonomous task execution. The architecture consists of a twostage pipeline: A ReAct reasoning component first generates an explicit trace of reasoning and draft action plans; and a temporal-diffusion-refinement mechanism in the second stage denoises these interim plans while optimizing them for coherence, feasibility, and tool-use precision. The resulting location leading curves are implemented using an agentic control loop with feedback-based re-planning and safety constraints. We evaluate the proposed method on standard multi-step reasoning and tool-use benchmarks, e.g ALF World and BabyAI-MiniGrid with the evaluation metrics of plan coherence, execution success rate (ESR) and tool-use accuracy. In experiments3,17-19, the results indicate that ReAct-Diffuse is able to generate plans of 91.3% plan coherence rate, with a 88.7% execution success and with a 92.5% tool-use accuracy all outperforms state-of-the-art agentic systems including ReAct-GPT-4, Auto-GPT, Voyager and diffusion-only planners. These results demonstrate that complementing our explicit agentic reasoning with diffusion-based refinement considerably improves long-horizon autonomy, execution stability, and decision reliability in dynamic environments.

安装插件收集

Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling

基于代理的人工智能家庭能源管理系统：一种用于住宅负荷调度的大型语言模型框架

Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl, 2025-Results in Engineering2区IF 7.9

The electricity sector transition requires substantial increases in residential demand response capacity, yet Home Energy Management Systems (HEMS) adoption remains limited by user interaction barriers requiring translation of everyday preferences into technical parameters. While large language models have been applied to energy systems as code generators and parameter extractors, no existing implementation deploys LLMs as autonomous coordinators managing the complete workflow from natural language input to multi-appliance scheduling. This paper presents an agentic AI HEMS where LLMs autonomously coordinate multi-appliance scheduling from natural language requests to device control, achieving optimal scheduling without example demonstrations. A hierarchical architecture combining one orchestrator with three specialist agents uses the ReAct pattern for iterative reasoning, enabling dynamic coordination without hardcoded workflows while integrating Google Calendar for context-aware deadline extraction. Evaluation across three open-source models using real Austrian day-ahead electricity prices reveals substantial capability differences. Llama-3.3-70B successfully coordinates all appliances across all scenarios to match cost-optimal benchmarks computed via mixed-integer linear programming, while other models achieve perfect single-appliance performance but struggle to coordinate all appliances simultaneously. Progressive prompt engineering experiments demonstrate that analytical query handling without explicit guidance remains unreliable despite models'general reasoning capabilities. We open-source the complete system including orchestration logic, agent prompts, tools, and web interfaces to enable reproducibility, extension, and future research.

安装插件收集

被引 3

Reactive to Agentic:Next Gen AI Agent Transition Framework

从反应型到代理型：下一代人工智能代理过渡框架

Rinki Singh, Tammana Sachdeva, Sabita 等, 2025-2025 3rd International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT)

The development of the artificial intelligence (AI) system has advanced past simple reactivity-supported and rule-based agents to high-level learning mechanisms. The current shift that is emerging is associated with agentic AI systems, that are autonomous, goal-directed, and reason based and adaptive decision-making. The paper introduces a formal transition model that is used in developing agentic agents of the next generation. We describe the key characteristics, architectural requirements, technologies which enable this transformation as well aspects of ethics behind this transformation. In order to determine the validity of the framework, we provide a case study in the field of autonomous flying rebooking, showcasing how the use of multi-agent orchestration and goal planning can limit the user intervention and make the system more adaptive and better functioning. The effectiveness of our approach is confirmed by a comparative assessment of its work with well-known agentic systems (AutoGPT, ReAct, and CAMEL) in such important parameters as task success, autonomy, and coordination. The present work is contributing the potential to develop ethical and practical grounds of improving intelligent, scalable, and human-aligned AI systems.

安装插件收集

Task-Oriented Communications for Agentic IoT: An LLM-Driven QoS/Security Policy Generation via Dynamic Model Context Protocol

面向智能物联网的面向任务通信：通过动态模型上下文协议的LLM驱动的QoS/安全策略生成

Shuaishuai Guo, Jiabing Zhu, Jia Ye 等, 2025-2025 Seventeenth International Conference on Wireless Communications and Signal Processing (WCSP)

The emergence of Agentic IoT, where autonomous intelligent agents such as mobile robots, UAVs, and industrial actuators independently execute complex missions, demands communication and security configurations that can adapt to both fast mission-driven changes and slower environment-driven performance drifts. Existing control paradigms are inadequate. Specifically, static policies cannot react to real-time variations, while task-aware adaptive policies largely overlook environmental dynamics, leaving systems vulnerable to network degradation and latency spikes. To address these limitations, we propose the Dynamic Model Context Protocol (dMCP), a cognitive control framework that bridges high-level mission intents with lowlevel system configurations via the standardized MCP interface. dMCP employs a Large Language Model to reason over real-time mission and environment contexts, generating executable policy vectors. An event-driven trigger mechanism re-evaluates policies upon abrupt mission changes or significant environmental drifts, ensuring timely adaptation without overreacting to transient fluctuations. Simulation results demonstrate that dMCP achieves higher reliability, reduced tail latency, and improved Service Level Objective compliance compared with both static and taskaware adaptive baselines, making it a viable control paradigm for highly dynamic Agentic IoT deployments.

安装插件收集

Agentic RAG for Command Generation in Automated Penetration Testing

基于代理RAG的自动化渗透测试命令生成

Zhengkun Chen, Chuanjun Yi, Pan Jia, 2025-2025 International Conference on Signal Processing, Computer Networks and Communications (SPCNC)

Large language models (LLMs) have shown great potential in automated penetration testing (PT), but they need to rely on external knowledge to obtain precise output in complex command generation tasks. Although the traditional retrievalaugmented generation (RAG) can supplement external evidence to alleviate the hallucination problem, its static, single-round retrieval mechanism struggles to meet the dynamic knowledge needs of PT: the initial retrieval results may not cover the knowledge gaps generated during the reasoning process. To address it, this paper proposes a Reason-in-Document framework based on Agentic RAG, which transforms a single query into multi-step dynamic retrieval reasoning. Following the ReAct paradigm, the agent identifies knowledge gaps in the reasoning process, constructs targeted queries, and retrieves knowledge using the designated tools. By this mechanism, the system can gradually evolve the initial single query into a semantically progressive sub-query, enabling layer-by-layer knowledge refinement from conceptual-level understanding to implementation-level details.In this paper, a prototype system is implemented on the basis of PentestGPT, a structured knowledge base is constructed, and a variety of retrieval tool sets are designed. In the experiments of 22 real vulnerability scenarios, the Agentic RAG framework improves the accuracy of penetration command generation by 36% compared with the traditional RAG, which significantly enhances the knowledge utilization rate and the ability to generate complex commands. It provides a more efficient and accurate command generation mechanism for intelligent penetration testing.

安装插件收集

Automating agentic collaborative ontology engineering with role-playing simulation of LLM-powered agents and RAG technology

基于角色扮演模拟的LLM代理和RAG技术的自动化代理协作本体工程

Andreas Soularidis, Dimitrios Doumanas, Konstantinos Kotis 等, 2025-The Knowledge Engineering Review3区IF 2.0

Abstract Motivated by the astonishing capabilities of large language models (LLMs) in text-generation, reasoning, and simulation of complex human behaviors, in this paper, we propose a novel multi-component LLM-based framework, namely LLM4ACOE, that fully automates the collaborative ontology engineering (COE) process using role-playing simulation of LLM agents and retrieval augmented generation (RAG) technology. The proposed solution enhances the LLM-powered role-playing simulation with RAG ‘feeding’ the LLM with three different types of external knowledge. This knowledge corresponds to the knowledge required by each of the COE roles (agents), using a component-based framework, as follows: (a) domain-specific data-centric documents, (b) OWL documentation, and (c) ReAct guidelines. The aforementioned components are evaluated in combination, with the aim of investigating their impact on the quality of generated ontologies. The aim of this work is twofold, (a) to identify the capacity of LLM-based agents to generate acceptable (by human-experts) ontologies through agentic collaborative ontology engineering (ACOE) role-playing simulation, at specific levels of acceptance (accuracy, validity, and expressiveness of ontologies) without human intervention and (b) to investigate whether and/or to what extent the selected RAG components affect the quality of the generated ontologies. The evaluation of this novel approach is performed using ChatGPT-o in the domain of search and rescue (SAR) missions. To assess the generated ontologies, quantitative and qualitative measures are employed, focusing on coverage, expressiveness, structure, and human involvement.

安装插件收集

Agentic AI systems in the age of generative models: architectures, cloud scalability, and real-world applications

生成模型时代下的代理人工智能系统：架构、云可扩展性和实际应用

Lingareddy Alva, Bishwajeet Pandey, 2026-Artificial Intelligence Review1区 TopIF 13.9

No abstract available

安装插件收集

被引 11

ReAct Modular Agent: Orchestrating Tool-Use and Retrieval for Financial Workflows

ReAct模块化智能体：为金融工作流程编排工具使用和检索

Armando Hernandez, Victor Sabbia, Santiago Pérez, 2025-Proceedings of International Conference on Intelligent Systems and New Applications

The financial advisory profession demands extreme precision and speed in decision-making, compounded by the complexity of modern capital markets software. This often leads to high training overhead and reduces the time financial advisors can dedicate to client relations. This paper introduces an Agentic AI Co-Pilot designed as a significant architectural advancement beyond traditional Retrieval-Augmented Generation (RAG) systems. The core framework leverages a specialized Enterprise AI Flow to orchestrate a modular, decoupled agent architecture. The system's central component, the Reasoning and Action Agent (RAA), which implements the ReAct (Reasoning and Acting) paradigm that executes a fusion of explicit reasoning and external tool-use. This modularity allows the agent to: (1) interpret complex natural language queries, (2) articulate an internal step-by-step plan via Chain-of-Thought (CoT), and (3) autonomously execute a sequence of decoupled, modular API tools to perform high-stakes operations. This architectural separation ensures the seamless and incremental expansion of capabilities (e.g., integrating a risk-check API or a financial market forecasting module) without the need for retraining the core reasoning model. By providing both traceability and automated execution across complex workflows, the solution aims to substantially improve operational efficiency, enhance compliance through traceable decisions, and elevate the user experience in the highly regulated financial ecosystem.

安装插件收集

Med.ai ASK: an agentic system for biomedical question answering

Med.ai ASK：一种用于生物医学问答的代理系统

Nhung T. H. Nguyen, D. Lituiev, Zhimin Liu 等, 2026-Journal of the American Medical Informatics Association2区IF 4.6

Intelligent agent-driven research co-pilots, leveraging advances in generative AI, are transforming how scientists access biomedical knowledge. This paper presents Med.ai ASK, an agentic question-answering system designed to address biomedical inquiries through dynamic retrieval augmentation and tool-driven reasoning. We aim to develop a system capable of parsing the nuance in biomedical scientists’ research questions to provide reliable, grounded responses that are more accurate than other generative AI solutions. We adopt the ReAct framework’s tool-calling architecture and leverage atomic reasoning from Self-Discover to build Med.ai ASK. It selectively queries multiple biomedical knowledge bases and employs map-reduce tools for vector database retrieval, alongside external API and NER tool integration. We ingested 44 million biomedical documents from diverse sources. The agent is evaluated on a range of biomedical question-answering datasets. Human evaluation on an internal dataset shows strong performance and stability. Ratings from a large language model are aligned with human assessments, supporting its use in further experiments. Automatic evaluations indicate superior performance in long-form answers regarding accuracy, faithfulness, factuality, and reduced hallucinations. For short-form and multiple-choice answers, performance is competitive with state-of-the-art systems. The agent’s detailed answers are more interpretable than other systems attributed to its agentic design. The agent effectively selects tools based on question type and is deployed in a production-level chat platform with over 1600 users and 25 000 answered questions. Med.ai ASK dynamically orchestrates biomedical information retrieval tools to deliver robust interpretative, accurate, and factual answers, which is crucial in the biomedical domain.

安装插件收集

BPMN DMN decision table generation based on agentic AI for critical applications

基于代理人工智能的BPMN DMN决策表生成，应用于关键应用

Sourour Meddeb, Selma Batti, Habib Fathallah, 2026-Business Process Management Journal3区IF 5.8

This study investigates the feasibility of developing an automated solution that can generate dynamic decision tables from business process model and notation (BPMN)models using agentic artificial intelligence (AI). This work purpose is to reduce human error, inconsistencies and cognitive biases that can be introduced by traditional decision management in BPMN environments, which is often achieved by manually creating decision model and notation (DMN) decision tables (Richter et al., 2025). A novel AI-based solution is developed to generate dynamic decision tables from BPMN models. The proposed system integrates large language models within an agentic AI framework that autonomously analyses BPMN processes, identifies decision points and produces optimized DMN tables. The system employs agents for BPMN analysis, decision extraction, rule generation and validation, coordinated through a ReAct (Reasoning + Acting) engine with retrieval-augmented generation (RAG) capabilities (Zhang et al., 2025; Braunschweiler et al., 2025). Experimental evaluation of critical applications demonstrated that the system enhances decision-making by suggesting decision tables with values that humans might not intuitively identify. The system optimizes processes by transforming ambiguous paths into precise decisions. The framework is particularly effective in identifying non-obvious decision criteria and threshold parameters, resulting in significant process automation improvements. This approach establishes the foundation for intelligent, adaptive decision support systems within mission-critical environments and autonomous decision modeling that can dynamically adapt to evolving business requirements. This innovative approach represents the implementation of agentic AI specifically designed for automated DMN decision table generation from BPMN models, addressing a gap in the literature.

安装插件收集

From Retrieval to Cognitive Orchestration: Standardizing Context Management in Agentic AI Systems

从检索到认知编排：标准化智能体AI系统中的上下文管理

Bhaskara Reddy Udaru, 2026-International Journal of Computational and Experimental Science and Engineering

The proliferation of large language model-based agentic systems necessitates rigorous systems engineering approaches to context management. Contemporary frameworks, including Retrieval-Augmented Generation (RAG), ReAct, AutoGPT, and LangGraph, demonstrate autonomous capabilities but lack formal system specifications for context lifecycle, provenance tracking, and governance enforcement. This paper presents a systems engineering framework formalizing cognitive orchestration as a layered architecture with explicit invariants, interface contracts, and verification protocols. We introduce formal system models defining context as C = (K, M, P, T, V) with mathematical invariants ensuring consistency, completeness, and auditability. Our framework integrates Model Context Protocol (MCP) interfaces, establishing standardized contracts for agent coordination, memory management, and policy enforcement. Comparative analysis reveals systematic limitations in existing frameworks: RAG lacks multi-step context propagation (hallucination amplification 3.2×), ReAct exhibits unbounded memory growth (O(n²) with interaction length), AutoGPT suffers governance gaps (31% compliance violations), and LangGraph provides insufficient provenance tracking (34% audit coverage). Empirical validation through enterprise deployment, Annual Report Financial Analysis system processing 500+ documents across 15 regulatory frameworks, demonstrates quantifiable improvements: 94% reduction in compliance violations, 89% decrease in error propagation, 98% provenance completeness, and 3.1× mean time between failures compared to baseline architectures. System verification confirms invariant preservation across 10,000+ agent interactions with zero safety violations. This work establishes cognitive orchestration as essential infrastructure for production-grade agentic systems, providing formal foundations, architectural blueprints, and verification methodologies applicable across enterprise automation, financial analysis, regulatory compliance, and safety-critical domains.

安装插件收集

Deployed AI Agents for Industrial Asset Management: CodeReAct Framework for Event Analysis and Work Order Automation

Nianjun Zhou, Dhaval Patel, A. Bhattacharyya, 2026-Proceedings of the AAAI Conference on Artificial Intelligence

Maintenance of mission-critical industrial assets is frequently hindered by fragmented data, inconsistent record-keeping, and limited access to analytical expertise, resulting in reactive rather than predictive practices. We present \textit{CodeReAct}, an AI-powered agentic framework deployed in large-scale facilities to automate event analysis and work order (WO) management.CodeReAct extends the ReAct paradigm by embedding executable Python code within the Thought--Action--Observation (TAO) loop, enabling natural language interaction, grounding heterogeneous alerts and work orders into structured Business Objects (BOs), and dynamically invoking analytic functions for forecasting, anomaly correlation, and maintenance recommendations. This architecture reduces manual data science intervention, improves adaptability, and supports reuse across asset types. Deployed in a mission-critical data center and productionized in Maximo, CodeReAct manages pumps, chillers, AHUs, compressors, cooling towers, and other mechanical and electrical systems. Evaluation with 36 representative maintenance utterances showed that outer-loop reflection and adaptive temperature improved task completion by up to 20%, while ablation studies confirmed the importance of reasoning in addition to code execution. Business validation revealed seasonal failure patterns, bundling opportunities, and predictive accuracy trends. In production, site engineers reported 25--40% faster diagnostics, fewer unplanned downtime events, and reduced reliance on specialized analysts. Lessons learned highlight the importance of structured BOs for grounding analytics, runtime safeguards to mitigate hallucinations, and adaptive model control for consistent execution. These results demonstrate how deployed agentic AI can deliver measurable business value in predictive and strategic maintenance planning.

安装插件收集

An LLM-Powered Agent for Real-Time Analysis of the Vietnamese IT Job Market

基于LLM的越南IT就业市场实时分析代理

Minh-Thuan Nguyen, T. Vo-Thanh, Thai-Duy Dinh 等, 2025-2025 19th International Conference on Advanced Computing and Analytics (ACOMPA)

Individuals entering Vietnam’s dynamic Information Technology (IT) job market face a critical gap in reliable career guidance. Existing market reports are often outdated, while the manual analysis of thousands of job postings is impractical for most. To address this challenge, we present the AI Job Market Consultant, a novel conversational agent that delivers deep, data-driven insights directly from the labor market in realtime. The foundation of our system is a custom-built dataset created via an automated pipeline that crawls job portals using Playwright and leverages the Large Language Model (LLM) to intelligently structure unstructured posting data. The core of our system is a tool-augmented AI agent, based on the ReAct agentic framework, which enables the ability of autonomously reasoning, planning, and executing actions through a specialized toolbox for SQL queries, semantic search, and data visualization. Our prototype successfully collected and analyzed 3,745 job postings, demonstrating its ability to answer complex, multistep queries, generate on-demand visualizations, and provide personalized career advice grounded in real-world data. This work introduces a new paradigm for labor market analysis, showcasing how specialized agentic AI systems can democratize access to timely, trustworthy career intelligence for the next generation of professionals.

安装插件收集

AI-Based Application for Task Management and Scheduling Student Activity

基于人工智能的任务管理和学生活动排程应用

Bintang Nuralamsyah, Umi Laili Yuhana, Anny Yuniarti 等, 2025-2025 15th International Conference on Information &amp; Communication Technology and System (ICTS)

University students face significant challenges in managing academic demands, which often lead to procrastination, stress, and diminished mental well-being. To address this, we developed a proactive AI-based assistant designed to support student productivity and health. The application leverages Large Language Models (LLMs) and an agentic AI framework based on the ReAct pattern to offer personalized task prioritization, dynamic scheduling, and cognitive load reduction. It uniquely integrates academic, emotional, and biological factors, such as circadian rhythms, to provide holistic, context-aware support. A usability study involving 30 participants showed favorable outcomes, achieving a System Usability Scale (SUS) score of 73.67, a Task Success Rate (TSR) of 83% for the AI scheduling task, average Single Ease Question (SEQ) score of 5.61 on a 7-point scale (where higher is better) indicating that the system is good enough to use. Qualitative feedback highlighted user satisfaction with the system's stability and AI-driven scheduling capabilities. This research presents a novel, adaptive platform that shifts from reactive, siloed educational tools to an anticipatory support system. The findings validate the potential of agentic AI to enhance academic performance, offering a scalable model for future student support in higher education.

安装插件收集

Structured Elicitation Primitives for Reliable Multi-Agent Delegation and Recursive Planning

可靠多智能体委托与递归规划的有序提取原语

S. Karthik, Kota, 2025-British Journal of Multidisciplinary Studies

The transition of Large Language Models (LLMs) from passive information retrieval interfaces to agentic systems capable of multi-step execution represents a significant paradigm shift in artificial intelligence. However, the reliability of these agents is frequently compromised by stochastic drift, hallucination, and the inability to maintain coherent context over extended planning horizons. This paper proposes a theoretical framework for Reliable Agent Delegation (RAD), focusing on structured elicitation techniques that constrain the probabilistic output of foundation models into deterministic workflows. We analyze role assignment mechanisms, meta-reasoning prompts, and self-corrective failure recovery loops. Drawing upon existing literature in Chain-of-Thought (CoT) reasoning, ReAct frameworks, and formal verification, we posit that imposing rigid syntactic and semantic constraints on elicitation allows for verifiable delegation between orchestrator and worker agents. We discuss the security implications of such architectures, specifically regarding indirect prompt injection and cascading logic failures, and outline a methodology for constructing robust, self-healing agentic systems.

安装插件收集

AI Augmented CI/CD Pipelines: From Code Commit to Production with Autonomous Decisions

人工智能增强的CI/CD管道：从代码提交到生产的自主决策

Mohammad Baqar, Saba Naqvi, Rajat Khanda, 2025-2025 3rd International Conference on Foundation and Large Language Models (FLLM)

Modern software delivery has accelerated from quarterly releases to multiple deployments per day. While CI/CD tooling has matured, human decision points interpreting flaky tests, choosing rollback strategies, tuning feature flags, and deciding when to promote a canary remain major sources of latency and operational toil. We propose AI-Augmented CI/CD Pipelines, where large language models (LLMs) and autonomous agents act as policy-bounded co-pilots and progressively as decision makers. We contribute: (1) a reference architecture for embedding agentic decision points into CI/CD, (2) a decision taxonomy and policy-as-code guardrail pattern, (3) a trust-tier framework for staged autonomy, (4) an evaluation methodology using DevOps Research and Assessment ( DORA) metrics and AI-specific indicators, and (5) a detailed industrial-style case study migrating a React 19 microservice to an AI-augmented pipeline. We discuss ethics, verification, auditability, and threats to validity, and chart a roadmap for verifiable autonomy in production delivery systems.

安装插件收集

Invited: Polymath: Self-Improving Hierarchical Workflow for Multi-Domain Problem Solving

特邀：Polymath：用于多领域问题解决的自我改进分层工作流

Chia-Tung Ho, Jingyang Gong, Haoyu Yang 等, 2026-Proceedings of the 2026 International Symposium on Physical Design

Large language models (LLMs) excel at solving complex tasks by executing agentic workflows composed of detailed instructions and structured operations. However, building agents for diverse applications by manually embedding foundation models into agentic systems such as Chain-of-Thought, Self-Reflection, and ReACT through text interfaces limits scalability and efficiency. Recently, researchers have explored automating workflow generation using code-based representations, but most methods depend on labeled data, limiting their applicability to real-world, dynamic hardware design problems. We introduce Polymath, a self-improving agent with a dynamic hierarchical workflow that combines task flow graphs with code-represented workflows to address these challenges. Polymath employs an experience-driven optimization framework that integrates multi-level graph optimization using surrogate scores from historical evaluations with a self-reflection-guided evolutionary algorithm for workflow refinement, enabling unsupervised self-improvement without labeled data. Experiments show that Polymath outperforms a leading commercial agentic system by 16.23% pass@1 and 11.47% pass@3 on hardware benchmarks, and achieves an average 8.1% improvement over state-of-the-art baselines on coding, math, and multi-turn QA tasks.

安装插件收集

A multi-agentic framework for real-time, autonomous freeform metasurface design

实时、自主式自由曲面超表面设计的多代理框架

Robert Lupoiu, Yixuan Shao, Tianxiang Dai 等, 2025-Science Advances1区 TopIF 12.5

Innovation in nanophotonics currently relies on human experts who synergize specialized knowledge in photonics and coding with simulation and optimization algorithms, entailing design cycles that are time-consuming, computationally demanding, and frequently suboptimal. We introduce MetaChat, a multi-agentic design framework that can translate semantically described photonic design goals into high-performance, freeform device layouts in an automated, nearly real-time manner. Multistep reasoning is enabled by our Agentic Iterative Monologue paradigm, which coherently interfaces agents with code-based tools, other specialized agents, and human designers. Design acceleration is facilitated by Feature-wise Linear Modulation–conditioned Maxwell surrogate solvers that support the generalized evaluation of metasurface structures. We use freeform dielectric metasurfaces as a model system and demonstrate with MetaChat the design of multiobjective, multiwavelength metasurfaces orders of magnitude faster than conventional methods. These concepts present a scientific computing blueprint for using specialist design agents, surrogate solvers, and human interactions to drive multiphysics innovation and discovery.

安装插件收集

被引 14

Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools

代理推理：一种增强大型语言模型推理的简化框架

Junde Wu, Jiayuan Zhu, Yuyuan Liu 等, 2025-Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Agentic Reasoning dynamically leverages web search, code execution, and structured memory to address complex problems requiring deep research. A key innovation in our framework is the Mind-Map agent, which constructs a structured knowledge graph to store reasoning context and track logical relationships, ensuring coherence in long reasoning chains with extensive tool usage. Additionally, we conduct a comprehensive exploration of the Web-Search agent, leading to a highly effective search mechanism that surpasses all prior approaches. When deployed on DeepSeek-R1, our method achieves a new state-of-the-art (SOTA) among public models and delivers performance comparable to OpenAI Deep Research, the leading proprietary model in this domain. Extensive ablation studies validate the optimal selection of agentic tools and confirm the effectiveness of our Mind-Map and Web-Search agents in enhancing LLM reasoning. The code is at: https://github.com/theworldofagents/Agentic-Reasoning

安装插件收集

被引 74

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Agent S：一个像人类一样使用计算机的开放代理框架

Saaket Agashe, Jiuzhou Han, Shuyu Gan 等, 2024-International Conference on Learning Representations

We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at https://github.com/simular-ai/Agent-S.

安装插件收集

被引 129

Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

多模态深度研究员：基于代理框架从零开始生成文本-图表交织报告

Zhaorui Yang, Bo Pan, Han Wang 等, 2025-AAAI Conference on Artificial Intelligence

Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning and (4) multimodal report generation. For the evaluation of the generated reports, we develop MultimodalReportBench which contains 100 diverse topics as inputs, and a set of dedicated metrics for report and chart evaluation. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82% overall win rate over the baseline method.

安装插件收集

被引 9

A Unified Agentic Framework for Evaluating Conditional Image Generation

统一代理框架评估条件图像生成

Jifang Wang, Xue Yang, Longyue Wang 等, 2025-Annual Meeting of the Association for Computational Linguistics

Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.

安装插件收集

被引 7

The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework

福尔摩斯的眼睛：通过视觉-语言模型代理框架揭示用户隐私属性分析

Feiran Liu, Yuzhe Zhang, Xinyi Huang 等, 2025-Proceedings of the 33rd ACM International Conference on Multimedia

Our research reveals a new privacy risk associated with the vision language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term ''image private attribute profiling.'' This threat is particularly severe given that modern apps can easily access users' photo albums, and inference from image sets enables models to exploit inter-image relations for more sophisticated profiling. However, two main challenges hinder our understanding of how well VLMs can profile an individual from a few personal photos: (1) the lack of benchmark datasets with multi-image annotations for private attributes, and (2) the limited ability of current multimodal large language models (MLLMs) to infer abstract attributes from large image collections. In this work, we construct PAPI, the largest dataset for studying private attribute profiling in personal images, comprising 2,510 images from 251 individuals with 3,012 annotated privacy attributes. We also propose HolmesEye, a hybrid agentic framework that combines VLMs and LLMs to enhance privacy inference. HolmesEye uses VLMs to extract both intra-image and inter-image information and LLMs to guide the inference process as well as consolidate the results through forensic analysis, overcoming existing limitations in long-context visual reasoning. Experiments reveal that HolmesEye achieves a 10.8% improvement in average accuracy over state-of-the-art baselines and surpasses human-level performance by 15.0% in predicting abstract attributes. This work highlights the urgency of addressing privacy risks in image-based profiling and offers both a new dataset and an advanced framework to guide future research in this area.

安装插件收集

被引 8

An Agentic Framework for Autonomous Metamaterial Modeling and Inverse Design

自主式光子超材料建模与逆向设计的代理框架

Darui Lu, J. Malof, Willie J. Padilla, 2025-ACS Photonics1区 TopIF 6.7

Recent significant advances in integrating multiple Large Language Model (LLM) systems have enabled Agentic Frameworks capable of performing complex tasks autonomously, including novel scientific research. We develop and demonstrate such a framework specifically for the inverse design of photonic metamaterials. When queried with a desired optical spectrum, the Agent autonomously proposes and develops a forward deep learning model, accesses external tools via APIs for tasks like simulation and optimization, utilizes memory, and generates a final design via a deep inverse method. The framework's effectiveness is demonstrated in its ability to automate, reason, plan, and adapt. Notably, the Agentic Framework possesses internal reflection and decision flexibility, permitting highly varied and potentially novel outputs.

安装插件收集

被引 7

DocAgent: An Agentic Framework for Multi-Modal Long-Context Document Understanding

DocAgent：一种用于多模态长文本理解的代理框架

Li Sun, Liu He, S. Jia 等, 2025-Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

,

安装插件收集

被引 10

An LLM-based Agentic Framework for Accessible NetworkControl

基于大型语言模型的可访问网络控制代理框架

Samuel Lin, Jiawei Zhou, Minlan Yu, 2025-ACM SIGMETRICS Performance Evaluation Review

Traditional approaches to network management have been accessible only to a handful of highly-trained network operators with significant expert knowledge. This creates barriers for lay users to easily manage their networks without resorting to experts. With recent development of powerful large language models (LLMs) for language comprehension, we design a system to make network management accessible to a broader audience of non-experts by allowing users to converse with networks in natural language. To e!ectively leverage advancements in LLMs, we propose an agentic framework that uses an intermediate representation to streamline configuration across diverse vendor equipment, retrieves the network state from memory in real-time, and provides an interface for external feedback. We also conduct pilot studies to collect real user data of natural language utterances for network control, and present a visualization interface to facilitate dialogue-driven user interaction and enable large-scale data collection for future development. Preliminary experiments validate the e!ectiveness of our proposed system components with LLM integration on both synthetic and real user utterances. Through our data collection and visualization e!orts, we pave the way for more e!ective use of LLMs and democratize network control for everyday users.1

安装插件收集

被引 5

Autonomous Industrial Control using an Agentic Framework with Large Language Models

基于大型语言模型代理框架的自主工业控制

Javal Vyas, Mehmet Mercangöz, 2024-IFAC-PapersOnLine

As chemical plants evolve towards full autonomy, the need for effective fault handling and control in dynamic, unpredictable environments becomes increasingly critical. This paper proposes an innovative approach to industrial automation, introducing validation and reprompting architectures utilizing large language model (LLM)-based autonomous control agents. The proposed agentic system, comprising of operator, validator, and reprompter agents, enables autonomous management of control tasks, adapting to unforeseen disturbances without human intervention. By utilizing validation and reprompting architectures, the framework allows agents to recover from errors and continuously improve decision-making in real-time industrial scenarios. We hypothesize that this mechanism will enhance performance and reliability across a variety of LLMs, offering a path toward fully autonomous systems capable of handling unexpected challenges, paving the way for robust, adaptive control in complex industrial environments. To demonstrate the concept's effectiveness, we created a simple case study involving a temperature control experiment embedded on a microcontroller device, validating the proposed approach.

安装插件收集

被引 21

Agentic Surgical AI: Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion in a Vision-Language-Action Framework

代理式手术人工智能：基于视觉-语言-动作框架的手术风格指纹识别与隐私风险量化

Huixin Zhan, Jason H. Moore, 2025-No journal

Surgeons exhibit distinct operating styles shaped by training, experience, and motor behavior-yet most surgical AI systems overlook this personalization signal. We propose a novel agentic modeling approach for surgeon-specific behavior prediction in robotic surgery, combining a discrete diffusion framework with a vision-language-action (VLA) pipeline. Gesture prediction is framed as a structured sequence denoising task, conditioned on multimodal inputs including surgical video, intent language, and personalized embeddings of surgeon identity and skill. These embeddings are encoded through natural language prompts using third-party language models, allowing the model to retain individual behavioral style without exposing explicit identity. We evaluate our method on the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture sequences while learning meaningful motion fingerprints unique to each surgeon. To quantify the privacy implications of personalization, we perform membership inference attacks and find that more expressive embeddings improve task performance but simultaneously increase susceptibility to identity leakage. These findings demonstrate that while personalized embeddings improve performance, they also increase vulnerability to identity leakage, revealing the importance of balancing personalization with privacy risk in surgical modeling. Code is available at: https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.

安装插件收集

Agentic AI in Higher Education: A Low-Code Framework for Administrative Automation and Strategic Oversight

高等教育中的代理人工智能：一种低代码框架用于行政自动化和战略监督

Hossam Daoud, A. Ragab, Mohamed A. Ragheb 等, 2025-المجلة العربية للإدارة

A modularized, low-code automation system that follows the premise of the Agentic Artificial Intelligence (AI) internally known as Laila AI has been proposed in this paper which was completely designed, implemented and tested at the Graduate School of Business (GSB), Arab Academy for Science, Technology and Maritime Transport. Laila AI is self-generating, customizing, and real-time academic administrative workflow system with a dynamic interpretation of institutional input systems (LMS /SIS) and stakeholder data. In the system, the layer of the human-in-the-loop control is also implemented to guarantee its transparency, ethical regulatory use, and local adjustability. It was a mixed-methods case study that employed the use of system performance logs and structural surveys (n = 375) in addition to interviews with various stakeholders. The results indicate a high level of efficiency in operation: a more than 50% decrease in the time of task accomplishment, and automation of up to 70% of the assessment processes. More than 80% of the time, academic leadership reacted to strategic alerts within 48 hours. Qualitative information resonated with perceived gain in fairness, explainability, and trust among the stakeholders. Override and justification features provided active involvement of human reviewers, which supported the ethical dimension of governance. The above-stated findings assert that Laila AI encompasses a dualistic model of governance putting together independent decision reasoning and in-built moral control. Being a transparent, ethically controlled distributed digital form of administration, it presents a model that can be transferred to resource-restricted establishment of higher learning in disparate working environments.

安装插件收集

LIFE-CRAFT: A Multi-agentic Conversational RAG Framework for Lifestyle Medicine Coaching with Context Traceability and Case-Based Evidence Synthesis

LIFE-CRAFT：一种具有上下文追踪和基于案例的证据综合的多代理对话RAG框架，用于生活方式医学辅导

Hania Aslam, Gousia K. Malak, Max Renault 等, 2025-Lecture Notes in Computer Science

No abstract available

安装插件收集

An Agentic AI-based Multi-Agent Framework for Recommender Systems

基于代理人工智能的多智能体推荐系统框架

I. Portugal, Paulo S. C. Alencar, Donald D. Cowan, 2024-2024 IEEE International Conference on Big Data (BigData)

Agentic AI describes the use of LLMs in novel AI agents that can answer questions or collaborate to achieve goals. These LLM agents can be used to build a novel generation of recommender systems. However, little is known about the LLM agents or their relationships needed to provide recommendations. Once identified, a framework can be constructed. Moreover, evaluating this framework is still not well understood. In this paper, we propose an agentic AI-based, multi-agent framework for recommender systems. We first identify LLM agents proposed in the literature, followed by the identification of their relationships and we propose a framework to represent them. Next, we evaluate this framework with respect to the LLM agents and functionalities of a recommender system based on published studies. This study is a stepping stone in a novel paradigm shift in the construction of recommender systems.

安装插件收集

被引 11

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin 等, 2025-Conference on Empirical Methods in Natural Language Processing

Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce \textbf{Search-o1}, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks, paving the way for more reliable and versatile intelligent systems. The code is available at \url{https://github.com/sunnynexus/Search-o1}.

安装插件收集

被引 380

Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning

超越ReAct：面向复杂工具辅助的LLM推理的以规划器为中心的框架

Xiaolong Wei, Yuehu Dong, Xingliang Wang 等, 2025-AAAI Conference on Artificial Intelligence

Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner's tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.

安装插件收集

被引 6

A general AI agent framework for smart buildings based on large language models and ReAct strategy

基于大型语言模型和ReAct策略的智能建筑通用AI代理框架

Xia Yan, Xincong Yang, Nan Jin 等, 2025-Smart Construction

: Smart buildings represent a significant trend in the future of the construction industry. The performance of human-computer interaction plays a vital role in achieving this from a human perspective. However, existing human-computer interaction algorithms are often limited to simple commands and fail to meet the complex and diverse needs of users. To address this issue, this paper introduces large language models (LLMs) and AI agents into smart buildings, proposing a general AI agent framework based on the ReAct strategy. The LLM serves as the system’s brain, responsible for reasoning and action planning, while tool calling mechanism puts the LLM’s plans into practice. Through this framework, developers can rely on prompt engineering alone to enable the LLM to interpret user intent accurately, perform appropriate actions

安装插件收集

被引 6

Agentic Retrieval-Augmented Generation: Advancing AI-Driven Information Retrieval and Processing

代理检索增强生成：推动人工智能驱动的信息检索与处理技术进步

Abhai Pratap Singh, Adit Jamdar, Prerna Kaul, 2025-International Journal of Computer Trends and Technology

No abstract available

安装插件收集

被引 17

DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision

DecEx-RAG：通过过程监督的决策和执行优化提升检索增强生成中的代理检索

Yongqi Leng, Yikun Lei, Xikai Liu 等, 2025-Conference on Empirical Methods in Natural Language Processing

Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks through dynamic retrieval and adaptive workflows. Recent advances (e.g., Search-R1) have shown that outcome-supervised reinforcement learning demonstrate strong performance. However, this approach still suffers from inefficient exploration, sparse reward signals, and ambiguous global reward feedback. To address these challenges, we propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution, while introducing an efficient pruning strategy to optimize data expansion. Through comprehensive process-level policy optimization, DecEx-RAG significantly enhances the autonomous task decomposition, dynamic retrieval, and high-quality answer generation capabilities of large language models (LLMs). Experiments show that DecEx-RAG achieves an average absolute performance improvement of $6.2\%$ across six datasets, significantly outperforming existing baselines. Moreover, the pruning strategy improves data construction efficiency by nearly $6 \times$, providing an efficient solution for process-supervised RAG training. The code is available at https://github.com/sdsxdxl/DecEx-RAG.

安装插件收集

被引 5

KA-RAG: Integrating Knowledge Graphs and Agentic Retrieval-Augmented Generation for an Intelligent Educational Question-Answering Model

KA-RAG：融合知识图谱和代理检索增强生成以构建智能教育问答模型

Fangqun Gao, Shun-Yi Xu, Weiyang Hao 等, 2025-Applied Sciences4区IF 2.5

Generative artificial intelligence (AI) and large language models (LLMs) are reshaping the landscape of intelligent educational systems; however, existing solutions often suffer from unstructured resource organization, limited interpretability, and suboptimal retrieval precision. To address these challenges, this study introduces KA-RAG, a course-oriented question answering (QA) framework that integrates a structured Knowledge Graph (KG) with an Agentic Retrieval-Augmented Generation (Agentic-RAG) workflow. The system incorporates a responsive interface, a unified agent controller (ToolPlanner), a course knowledge graph, and a vector-based retrieval subsystem. By combining symbolic graph reasoning with dense semantic retrieval, the proposed dual-retrieval strategy supports interpretable, context-aware responses to course-related queries. Experiments conducted on a graduate-level Pattern Recognition course demonstrate that KA-RAG achieves a retrieval accuracy of 91.4%, semantic consistency of 87.6%, and an average response latency of 2.8 s. User surveys further reveal significant improvements in learning efficiency and satisfaction. The results validate the feasibility of integrating KG and Agentic-RAG techniques for knowledge-grounded educational applications, offering a practical pathway toward intelligent knowledge organization and interactive learning support.

安装插件收集

被引 4

Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

基于强化学习的病理视觉语言模型的多模态检索增强生成：Patho-AgenticRAG

Wenchuan Zhang, Jingru Guo, Heng Zhang 等, 2025-AAAI Conference on Artificial Intelligence

Although Vision Language Models (VLMs) have shown generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text–image search, enabling retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering.

安装插件收集

被引 7

Performance Enhancement of Agentic Retrieval Augmented Generation Using Relevance Generative Answering

基于相关性生成回答的代理检索增强生成性能提升

Sanjay Kukreja, Tarun Kumar, Vishal Bharate 等, 2025-2025 5th International Conference on Artificial Intelligence and Education (ICAIE)

The aim of this research paper is to present a novel approach of using Relevance Generative Answering (RGA) in the trending field of Agentic Retrieval Augmented Generation (RAG). The paradigm shift in the RAG system by the introduction of Agentic RAG has opened a new research paradigm. The major issue of hallucination is overcome with the use of a traditional RAG system with some limitations like accuracy and relevance, lack of reasoning, the lost in the middle problem, etc. The Agentic RAG system attempts to address a few of these limitations. However, interpreting results based on the user's intent remains a significant area of research. This research aimed to understand user intent by introducing relevance detection block in the proposed architecture. Different performance metrics like precision, recall, F1 score, relevance, latency are used to validate the proposed approach. The results presented in this research reveal that the performance of the proposed system is much more relevant compared to agentic RAG system. For context and intent specific applications proposed framework suits well.

安装插件收集

被引 1

Pioneering agentic retrieval-augmented generation in software quality: a novel framework for code smell detection via dynamic retrieval

开创性的代理检索增强生成在软件质量中的应用——一种基于动态检索的代码异味检测新框架

Bushra Aljohani, Abdulmajeed Aljuhani, 2026-PeerJ Computer Science4区IF 2.5

Code smells—subtle indicators of poor design choices—pose significant challenges to software maintainability and readability, particularly in dynamic languages such as Python. Traditional detection methods, including rule-based heuristics and static machine learning classifiers, often suffer from limited adaptability, poor contextual awareness, and lack of explainability. These limitations hinder their effectiveness in evolving codebases and real-world development environments. This study introduces a novel Agentic retrieval-augmented generation (Agentic RAG) framework for code smell detection, marking the first application of agentic reasoning in this domain. By embedding autonomous agents into the retrieval and reasoning pipeline, the proposed system dynamically routes queries, selects optimal retrieval strategies, and synthesizes context-aware explanations using large language models (LLMs). Unlike static classifiers, the proposed framework leverages hybrid retrieval (sparse + dense) and structured prompting to detect and explain Long Method and Large Class smells with high interpretability. Experimental results demonstrate that Agentic RAG—particularly when paired with DeepSeek and chain-of-thought prompting—achieves superior performance, with 89.5% accuracy, a macro F1-score of 78.3%, and a weighted F1 of 88.7%. To assess generalization, Experiment 2 extended the framework to 21 distinct code smell types across multiple programming languages, achieving 94.85% accuracy, a macro F1-score of 90.24%, and a weighted F1-score of 94.93% through stratified five-fold cross-validation, thereby confirming the model’s robustness and scalability. Beyond academic benchmarks, this work lays the foundation for real-world integration into developer platforms, enabling real-time code review, contextual feedback, and actionable refactoring suggestions. By bridging LLMs with dynamic retrieval and agentic reasoning, this framework advances the frontier of intelligent software quality assurance.

安装插件收集

Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems

利用大型语言模型、检索增强生成和智能系统回答现实世界临床问题

Y. Low, M. Jackson, Rebecca J. Hyde 等, 2025-DIGITAL HEALTH3区IF 3.3

Objective The practice of evidence-based medicine can be challenging when relevant data are lacking or difficult to contextualize for a specific patient. Large language models (LLMs) could potentially address both challenges by summarizing published literature or generating new studies using real-world data. Materials and Methods We submitted 50 clinical questions to five LLM-based systems: OpenEvidence, which uses an LLM for retrieval-augmented generation (RAG); ChatRWD, which uses an LLM as an interface to a data extraction and analysis pipeline; and three general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini 1.5 Pro). Nine independent physicians evaluated the answers for relevance, quality of supporting evidence, and actionability (i.e., sufficient to justify or change clinical practice). Results General-purpose LLMs rarely produced relevant, evidence-based answers (2–10% of questions). In contrast, RAG-based and agentic LLM systems, respectively, produced relevant, evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. OpenEvidence produced actionable results for 48% of questions with existing evidence, compared to 37% for ChatRWD and <5% for the general-purpose LLMs. ChatRWD provided actionable results for 52% of questions that lacked existing literature compared to <10% for other LLMs. Discussion Special-purpose LLM systems greatly outperformed general-purpose LLMs in producing answers to clinical questions. Retrieval-augmented generation-based LLM (OpenEvidence) performed well when existing data were available, while only the agentic ChatRWD was able to provide actionable answers when preexisting studies were lacking. Conclusion Synergistic systems combining RAG-based evidence summarization and agentic generation of novel evidence could improve the availability of pertinent evidence for patient care.

安装插件收集

被引 25

RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for Retrieval Augmented Generation

RAG-Critic：利用自动化的评论家引导的代理工作流程进行检索增强生成

Guanting Dong, Jiajie Jin, Xiaoxi Li 等, 2025-Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval-augmented generation (RAG) has emerged as a pivotal technology in natural language processing, owing to its efficacy in generating factual content. However, its informative inputs and complex paradigms often lead to a greater variety of errors. Consequently, achieving automated on-policy assessment and error-oriented correction remains an unresolved issue. In this paper, we propose RAG-Critic, a novel framework that leverages a critic-guided agentic workflow to improve RAG capabilities autonomously. Specifically, we initially design a data-driven error mining pipeline to establish a hierarchical RAG error system. Based on this system, we progressively align an error-critic model using a coarse-to-fine training objective, which automatically provides fine-grained error feedback. Finally, we design a critic-guided agentic RAG workflow that cus-tomizes executor-based solution flows based on the error-critic model’s feedback, facilitating an error-driven self-correction process. Experimental results across seven RAG-related datasets confirm the effectiveness of RAG-Critic, while qualitative analysis offers practical insights for achieving reliable RAG systems. Our dataset and code are available at https: //github.com/RUC-NLPIR/RAG-Critic .

安装插件收集

被引 10

Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

连接法律知识与人工智能：基于向量存储、知识图谱和分层非负矩阵分解的检索增强生成

Ryan Barron, M. Eren, Olga M. Serafimova 等, 2025-Proceedings of the Twentieth International Conference on Artificial Intelligence and Law

Agentic Generative AI, powered by Large Language Models (LLMs) and enhanced with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable across specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain we focus on here comprises inherently complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research and decision-making. Here, we introduce a generative AI system, a jurisdiction-specific legal information retrieval that integrates RAG, VS, and KG, constructed via Hierarchical Non-Negative Matrix Factorization (HNMFk), to enhance information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends—challenging tasks essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This approach is demonstrated in legal document clustering, summarization, and cross-referencing tasks. The framework marks a significant step toward augmenting legal research with scalable, interpretable, and accurate retrieval methods for semi-structured data, advancing the intersection of computational law and artificial intelligence.

安装插件收集

被引 14

Retrieval-Augmented Generation to Generate Knowledge Assets and Creation of Action Drivers

检索增强生成在知识资产创造和行动驱动创建中的应用

A. James, Marcello Trovati, Simon Bolton, 2025-Applied Sciences4区IF 2.5

This article explores the application of Retrieval-Augmented Generation (RAG) to enhance the creation of knowledge assets and develop actionable insights from complex datasets. It begins by contextualising the limitations of large language models (LLMs), notably their knowledge cut-offs and hallucination tendencies, and it will present RAG as a promising solution that integrates external knowledge retrieval to improve factual accuracy and relevance. This study reviews current RAG architectures, including naïve and advanced models, emphasising techniques such as optimised indexing, query refinement, metadata utilisation, and the incorporation of autonomous AI agents in agentic RAG systems. Methodologies for effective data preprocessing, semantic-aware chunking, and retrieval strategies—such as multihop retrieval and reranking—are also discussed to address challenges such as irrelevant retrieval and semantic fragmentation. This work further examines embedding models, notably the use of state-of-the-art vector representations, to facilitate precise similarity searches within knowledge bases. A case study demonstrates the deployment of an RAG pipeline for analysing multisheet datasets, highlighting challenges in data structuring, prompt engineering, and ensuring output consistency.

安装插件收集

被引 10

CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning

CogPlanner：揭示具有规划能力的代理式多模态检索增强生成潜力

Xiaohan Yu, Zhihan Yang, Chong Chen, 2025-Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

Multimodal Retrieval Augmented Generation (MRAG) systems have shown promise in enhancing the generation capabilities of multimodal large language models (MLLMs). However, existing MRAG frameworks primarily adhere to rigid, single-step retrieval strategies that fail to address real-world challenges of information acquisition and query reformulation. In this work, we introduce the task of Multimodal Retrieval Augmented Generation Planning (MRAG Planning) that aims at effective information seeking and integration while minimizing computational overhead. Specifically, we propose CogPlanner, an agentic plug-and-play framework inspired by human cognitive processes, which iteratively determines query reformulation and retrieval strategies to generate accurate and contextually relevant responses. CogPlanner supports parallel and sequential modeling paradigms. Furthermore, we introduce CogBench, a new benchmark designed to rigorously evaluate the MRAG Planning task and facilitate lightweight CogPlanner integration with resource-efficient MLLMs, such as Qwen2-VL-7B-Cog. Experimental results demonstrate that CogPlanner significantly outperforms existing MRAG baselines, offering improvements in both accuracy and efficiency with minimal additional computational costs.

安装插件收集

被引 3

Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation

金融科技领域的检索增强生成（RAG）：代理式设计和评估

Thomas Cook, Richard Osuagwu, Liman Tsatiashvili 等, 2025-2025 3rd International Conference on Foundation and Large Language Models (FLLM)

Retrieval-Augmented Generation (RAG) systems often face limitations in specialized domains such as fintech, where domain-specific ontologies, dense terminology, and acronyms complicate effective retrieval and synthesis. This paper introduces an agentic RAG architecture designed to address these challenges through a modular pipeline of specialised agents. The proposed system supports intelligent query reformulation, iterative sub-query decomposition guided by keyphrase extraction, contextual acronym resolution, and cross-encoder-based context re-ranking. We evaluate our approach against a standard RAG baseline using a curated dataset of 85 question–answer–reference triples derived from an enterprise fintech knowledge base. Experimental results demonstrate that the agentic RAG system outperforms the baseline in retrieval precision and relevance, albeit with increased latency. These findings suggest that structured, multi-agent methodologies offer a promising direction for enhancing retrieval robustness in complex, domain-specific settings.

安装插件收集

被引 1

Agentic AI with retrieval-augmented generation for automated compliance assistance in finance

基于检索增强生成的代理人工智能在金融合规自动化辅助中的应用

Varun Pandey, 2025-International Journal of Science and Research Archive

Maintaining compliance with complex Know Your Customer (KYC) and Anti-Money Laundering (AML) regulations is a resource-intensive challenge for financial institutions. This paper presents an agentic AI approach that leverages Retrieval-Augmented Generation (RAG) to automate and enhance compliance research and decision-making. We define the inefficiencies in current U.S. KYC/AML compliance workflows – including lengthy onboarding times and costly manual processes – as motivation for a more dynamic solution. We then introduce an autonomous agent framework, implemented with LangChain, that integrates a RAG pipeline to perform contextual reasoning over regulatory knowledge bases. The technical architecture is detailed with an emphasis on the agent’s planning and tool use capabilities, and the RAG components for knowledge base construction (using U.S. regulations such as FinCEN guidance, Code of Federal Regulations (CFR) provisions, and OFAC sanctions data), transformer-based embedding and indexing, vector retrieval, and LLM-driven answer generation. We demonstrate how this agent can handle compliance queries (e.g., customer due diligence requirements and detection of transaction structuring) in a simulated proof-of-concept. We discuss key advantages of this approach over traditional rule-based or static NLP systems – notably greater adaptability to changing regulations, improved traceability via source citations, and higher precision in complex scenario handling. Finally, we address ethical considerations (hallucination risk, ensuring regulatory accuracy, and model governance) and explore practical applications such as automated audit support, compliance report drafting, and future directions including real-time monitoring and multimodal compliance agents.

安装插件收集

被引 1

AirRAG: Autonomous Strategic Planning and Reasoning Steer Retrieval Augmented Generation

AirRAG：自主战略规划与推理引导检索增强生成

Wenfeng Feng, Chuzhan Hao, Yuewei Zhang 等, 2025-Findings of the Association for Computational Linguistics: EMNLP 2025

Leveraging the autonomous decision-making capabilities of large language models (LLMs) has demonstrated superior performance in reasoning tasks. However, despite the success of iterative or agentic retrieval-augmented generation (RAG) techniques, these methods are often constrained to a single solution space when confronted with complex problems. In this paper, we propose a novel thinking pattern in RAG that integrates autonomous strategic planning with efficient reasoning actions, significantly activating intrinsic reasoning capabilities and expanding the solution space of specific tasks via Monte Carlo Tree Search (MCTS), which we refer to as AirRAG. Specifically, our approach designs five fundamental reasoning actions, which are expanded to a broad tree-based reasoning space using MCTS. The approach also incorporates self-consistency verification to explore potential reasoning paths and inference scaling law. Additionally, computationally optimal strategies are employed to allocate more inference resources to key actions, thereby enhancing overall performance. Experimental results demonstrate the effectiveness of AirRAG, showing significant performance gains on complex question-answering datasets. Furthermore, AirRAG is flexible and lightweight, making it easy to integrate with other advanced technologies and models.

安装插件收集

被引 7

HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases

HybGRAG：基于文本和关系知识库的混合检索增强生成

Meng-Chieh Lee, Qi Zhu, C. Mavromatis 等, 2024-Annual Meeting of the Association for Computational Linguistics

Given a semi-structured knowledge base (SKB), where text documents are interconnected by relations, how can we effectively retrieve relevant information to answer user questions? Retrieval-Augmented Generation (RAG) retrieves documents to assist large language models (LLMs) in question answering; while Graph RAG (GRAG) uses structured knowledge bases as its knowledge source. However, many questions require both textual and relational information from SKB - referred to as"hybrid"questions - which complicates the retrieval process and underscores the need for a hybrid retrieval method that leverages both information. In this paper, through our empirical analysis, we identify key insights that show why existing methods may struggle with hybrid question answering (HQA) over SKB. Based on these insights, we propose HybGRAG for HQA consisting of a retriever bank and a critic module, with the following advantages: (1) Agentic, it automatically refines the output by incorporating feedback from the critic module, (2) Adaptive, it solves hybrid questions requiring both textual and relational information with the retriever bank, (3) Interpretable, it justifies decision making with intuitive refinement path, and (4) Effective, it surpasses all baselines on HQA benchmarks. In experiments on the STaRK benchmark, HybGRAG achieves significant performance gains, with an average relative improvement in Hit@1 of 51%.

安装插件收集

被引 24

Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025

语言模型能否自我批判？在BioASQ 2025中探究检索增强生成中的自我反馈

Samy Ateia, U. Kruschwitz, 2025-Conference and Labs of the Evaluation Forum

Agentic Retrieval Augmented Generation (RAG) and'deep research'systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.

安装插件收集

被引 3

Synthetic Data Generation Using CTGAN with Agentic Workflows and Retrieval-Augmented Generation

基于代理工作流和检索增强生成的CTGAN合成数据生成

S. K C, Maria George Anthraper, Kusuma Sanjaykumar 等, 2025-International Conference on AI Research

Real-world data in domains such as finance and fraud detection can be rare, imbalanced, or inaccessible, necessitating synthetic data as a crucial alternative. Gathering and leveraging real-world data in such domains is subject to important challenges such as privacy issues, legality, high cost of annotation, and restricted access due to proprietary ownership. Synthetic data generation in this context offers a meaningful alternative to real data gathering, reducing both privacy and computational costs while allowing for the construction of flexible, scalable datasets. This paper presents a new paradigm for tabular data synthesis through CTGAN (Conditional Tabular GAN) with integration into agentic workflows and retrieval-augmented generation (RAG). The proposed system herein accepts partial data samples and column constraints as inputs from a user-friendly chatbot interface and augment the dataset intelligently through an AI-agent-based generation pipeline. These AI agents aid in the automation of preprocessing, column semantics interpretation, and the enforcement of user-specified constraints specified in natural language, minimizing manual intervention by a considerable margin. The framework further includes ChromaDB to enable semantic retrieval of past relevant datasets. With this semantic memory, the model can improve generation quality, apply schema-level consistency, and update even synthesis of new datasets based on column names or metadata alone. It allows for context-aware, structurally sound, and domain-conformant data generation—without the need to access sensitive or full datasets. The current research utilizes statistical measures like mean, variance, and the Kolmogorov–Smirnov (KS) test to confirm the fidelity of data produced. The approach maintains a mean difference of just 0.16% and a KS statistic of 0.0020, which reflects outstanding statistical consistency with original distributions of data. Preliminary results show significant enhancements in data realism, diversity, and variability without sacrificing domain coherence. The system introduced is particularly well-adapted to financial datasets, such as applications in credit card fraud detection, and offers a scalable, privacy-aware method of synthetic data generation in sensitive or data-scarce environments.

安装插件收集

Agentic AI Quiz-Based Learning System: Enhancing MCQ Generation via Long-Context Cached Retrieval-Augmented Generation

基于代理的人工智能测验学习系统：通过长上下文缓存检索增强生成来提升选择题生成

Devananda Sreekanth, Sreekanth Gopi, N. Dehbozorgi, 2025-2025 IEEE Frontiers in Education Conference (FIE)

This research-to-practice full paper investigates how Large Language Models (LLMs) with long-context window and enhanced retrieval efficiency can generate context-specific quizzes to address high attrition in engineering education. We aim to enable LLMs to process large, multidisciplinary Artificial Intelligence (AI) datasets, covering topics like Machine Learning, Generative AI, and Neural Networks, from foundational to advanced concepts. A systematic literature review identified gaps in Retrieval-Augmented Generation (RAG) systems, which often retrieve irrelevant chunks due to context limitations, leading to inaccurate or hallucinated responses [1], [2]. Traditional quiz generation lacks modular design, limiting scalability and interpretability. To address this, we developed an agentic long-context RAG architecture using Gemini 1.5's one-million-token window, integrating retrieval, reasoning, and evaluation in a unified pipeline. Our methodology employed a modular Agentic AI system. A Parsing Agent extracts text from academic sources, followed by a Chunking & Storage Agent segmenting content with character overlaps. An Embedding & Indexing Agent generates and indexes vector embeddings, verified by a Verification Agent for topical alignment. For quiz generation, a Retriever Agent uses cosine similarity and multilingual re-ranking, a Selector Agent filters meaningful chunks, a Response Agent leverages cached ground-truth MCQs with an LLM, and an Evaluator Agent assesses outputs. Experiments on a 150-question benchmark showed accuracy improvements: 78.00 % (raw), 84.00 (chunks), $\mathbf{8 9. 3 3 \%}$ (chunks+cache), and $\mathbf{9 3. 3 3 \%}$ ($\mathbf{1 M}$ context+cache) for Gemini, with GPT-4o and Claude Sonnet 3.7 revealing complementary strengths in precision and confidence. Future work includes deploying an interactive quiz application and expanding domain-specific datasets across engineering fields.

安装插件收集

LLM-POWERED RAN AUTOMATION USING RETRIEVAL-AUGMENTED GENERATION (RAG)

基于检索增强生成（RAG）的LLM驱动的无线接入网络（RAN）自动化

Venkat Chintha, 2025-International Journal of Applied Mathematics

The increasing complexity of Radio Access Network (RAN) environments, especially 5G and future 6G infrastructures, has prompted the development of smarter and more flexible network automation infrastructures. As a more advanced form of context-driven decision-making and process automation in wireless networks, Large Language Models (LLMs) have recently been refined using Retrieval-Augmented Generation (RAG). This paper reviews current developments in applying RAG-augmented LLMs to RAN automation, including spectrum and power allocation, fault detection in distributed RANs, and secure 5G/6G multi-agent automation. It also presents comparative studies with more conventional approaches, such as Deep Reinforcement Learning (DRL), and discusses multi-agent systems, graph-based retrieval mechanisms, and agentic AI systems. The review highlights potential limitations, including safety concerns, data management challenges, and scalability issues, as well as future research and implementation directions. The discussion demonstrates the disruptive potential of RAG-enhanced LLMs in reshaping automation and intelligence in next-generation wireless networks.

安装插件收集

Empowering Large Language Model Reasoning : Hybridizing Layered Retrieval Augmented Generation and Knowledge Graph Synthesis

赋能大型语言模型推理：融合分层检索增强生成与知识图谱合成

Vedanth Aggarwal, 2024-International Journal of High School Research

: Retrieval Augmented Generation has improved LLM question answering significantly. However, this mechanism still produces hallucinations and structural incoherence in knowledge-intensive tasks. Additionally, many existing techniques neither holistically leverage multiple properties of text nor integrate diverse prompting and agenting frameworks. To address these limitations, this paper proposes a novel methodology that extracts and utilizes unstructured and structured properties of text to construct layered RAG pipelines designed to enhance complex LLM reasoning. Our approach synthesizes three distinct RAG methodologies, each specialized in various aspects: textual entity knowledge graph extraction (Textual Entity RAG); community summary and entity generation (Microsoft GraphRAG), and structural link navigation (MetaWiki RAG). By cumulatively layering these techniques along with advanced prompting and agentic evaluation, we aim to capture a more comprehensive context, enabling the model to generate well-structured responses that reflect all relevant attributes of the text. The proposed framework not only enhances existing RAG mechanisms but also demonstrates the effective integration of knowledge graphs. Additionally, it showcases the application of this framework to advanced answer generation using Wikipedia, with extensions to similar knowledge networks. This novel approach offers a robust solution for social recommender systems and other practical applications, delivering holistic outcomes by synthesizing diverse RAG techniques.

安装插件收集

被引 2

Precedent-Aware Multi-Agent Retrieval-Augmented Generation in Case Law Analysis

案例法分析中的先例感知多智能体检索增强生成

Shatrunjay Kumar Singh, 2026-International Journal of Innovative Science and Research Technology

Retrieval-Augmented Generation (RAG) systems promise practical legal assistance by grounding Large Language Models (LLMs) in external authority. However, standard RAG optimizes semantic similarity and often fails to respect common-law constraints such as jurisdictional bindingness, court hierarchy, temporal validity, and negative treatment. We propose Precedent- Aware Multi-Agent RAG (PA-MA-RAG), an agentic architecture that decomposes legal research and writing into specialized agents for issue framing, authority planning, retrieval, precedent ranking, conflict resolution, drafting, and citation verification. Our method introduces an authority- constrained re-ranking objective that prioritizes controlling precedents while penalizing overruled or otherwise negatively treated cases. The verifier agent enforces evidence-grounded generation by requiring each legal proposition to be supported by retrieved holdings and quotations. We describe an evaluation protocol for both precedent retrieval and citation-grounded legal analysis generation, including authority correctness, supported-claim rate, and robustness to conflicting precedent.

安装插件收集

Traditional RAG vs. Agentic RAG: A Comparative Study of Retrieval-Augmented Systems

传统RAG与代理RAG的比较研究：检索增强系统的对比分析

Fnu Neha, Deepshikha Bhati, 2025-2025 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval to improve factual reliability. Traditional RAG employs a fixed, single-pass retrieval process, limiting its ability to handle multi-step reasoning, adaptive queries, and heterogeneous data sources. Agentic RAG extends this framework with autonomous agents that plan, iterate retrieval, integrate tools, and reason over intermediate results. This paper presents a comprehensive comparison of Traditional and Agentic RAG in terms of architecture, capabilities, evaluation metrics, and operational challenges. In addition to synthesizing representative systems, we provide a side-by-side analysis of comparative limitations, failure modes, and corresponding mitigations, mapping domain-specific applications across established and emerging fields. We also outline governance recommendations and propose future research directions, including graph-augmented, multimodal, human-in-the-loop, and domain-specialized Agentic RAG frameworks with standardized model cards. These insights offer both a technical and practical foundation for designing more adaptive, trustworthy, and context-aware retrieval-augmented systems.

安装插件收集

被引 2

Multi-Modal Retrieval Augmented Visual Understanding and Generation

多模态检索增强的视觉理解和生成

Zhucun Xue, 2025-Proceedings of the 33rd ACM International Conference on Multimedia

This paper presents a doctoral research focusing on integrating Retrieval-Augmented Generation (RAG) into video-related multimodal tasks. Existing RAG studies predominantly target text, images, or tabular data, overlooking the unique value of video as a knowledge carrier. We address this gap by: 1) proposing AdaVideoRAG, a framework that adaptively allocates retrieval strategies based on query complexity for long-video understanding; 2) developing REViG (RAG-Enhanced Video Generation) to optimize prompt engineering via retrieved knowledge for controllable video synthesis; 3) constructing the UltraVideo dataset (UHD-4K/8K resolution, 100+ themes, 10 structured captions per video) and HiVU/HiVG benchmarks to evaluate RAG-driven video tasks. Experiments validate the effectiveness of our methods, and we outline future plans to unify video understanding and generation through Agentic RAG for AGI-oriented research.

安装插件收集

Agentic memory-augmented retrieval and evidence grounding for medical question-answering tasks

基于代理的记忆增强检索与证据归因在医学问答任务中的应用

S. Jia, S. Bit, V. H. Jasodanand 等, 2025-medRxiv

Objective: To evaluate if a tool-using agent-based system utilizing large language models (LLMs) for medical question-answering (QA) tasks outperforms standalone LLMs. Methods: We developed a unified, open-source LLM-based agentic system that integrates document retrieval, re-ranking, evidence grounding, and diagnosis generation to support dynamic, multi-step medical reasoning. Our system features a lightweight retrieval-augmented generation pipeline coupled with a cache-and-prune memory bank, enabling efficient long-context inference beyond standard LLM limits. The system autonomously invokes specialized tools, eliminating the need for manual prompt engineering or brittle multi-stage templates. We compared the agentic system against standalone LLMs on various medical QA benchmarks. Results: Evaluated on five well-known medical QA benchmarks, our system outperforms or closely matches state-of-the-art proprietary and open-source medical LLMs in multiple-choice and open-ended formats. Specifically, our system achieved accuracies of 82.98% on USMLE Step 1 and 86.24% on USMLE Step 2, surpassing GPT-4's 80.67% and 81.67%, respectively, while closely matching on USMLE Step 3 (88.52% vs. 89.78%). Conclusion: Our findings highlight the value of combining tool-augmented and evidence-grounded reasoning strategies to build reliable and scalable medical AI systems.

安装插件收集

被引 1

ChatEED: An agentic retrieval assistant for accelerator operators

ChatEED：针对加速器操作员的代理式检索助手

A. Reed, Claudio Bisegni, S. Shrestha 等, 2025-Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Scientists and operators at SLAC National Accelerator Laboratory rely on electronic logbooks (ELOGs) to record and share critical information surrounding accelerator operations. However, since creating log entries is time-consuming and complex, they are often brief, incomplete, filled with jargon, and inconsistently structured. With thousands of records spanning decades, this makes it difficult for operators to search for and interpret information. Through interviews with operators, we identified two critical gaps: the lack of automated shift summarization and the difficulty of real-time ELOG information retrieval. Therefore, we introduce ChatEED, a novel agentic retrieval-augmented generation (RAG) system that is operator-centric and addresses these two needs while also prioritizing security, modularity, efficiency, and transparency. In this paper, we analyze the operator needs and workflow that guide the system design, detail the system architecture and deployment, and outline future directions for expansion and evaluation. This ongoing work demonstrates the potential for AI systems to improve continuity, communication, and efficiency in high-performance science facilities.CCS Concepts• Information systems → Question answering; Summarization; Information extraction.

安装插件收集

被引 1

RALLM-POI: Retrieval-Augmented LLM for Zero-shot Next POI Recommendation with Geographical Reranking

RALLM-POI：基于检索增强的LLM实现零样本下一地点推荐与地理重排序

Kunrong Li, Kwan Hui Lim, 2025-Pacific Rim International Conference on Artificial Intelligence

Next point-of-interest (POI) recommendation predicts a user's next destination from historical movements. Traditional models require intensive training, while LLMs offer flexible and generalizable zero-shot solutions but often generate generic or geographically irrelevant results due to missing trajectory and spatial context. To address these issues, we propose RALLM-POI, a framework that couples LLMs with retrieval-augmented generation and self-rectification. We first propose a Historical Trajectory Retriever (HTR) that retrieves relevant past trajectories to serve as contextual references, which are then reranked by a Geographical Distance Reranker (GDR) for prioritizing spatially relevant trajectories. Lastly, an Agentic LLM Rectifier (ALR) is designed to refine outputs through self-reflection. Without additional training, RALLM-POI achieves substantial accuracy gains across three real-world Foursquare datasets, outperforming both conventional and LLM-based baselines. Code is released at https://github.com/LKRcrocodile/RALLM-POI.

安装插件收集

被引 2

Agentic RAG with Human-in-the-Retrieval

具有检索中人类角色的代理式RAG

Xiwei Xu, Dawen Zhang, Qing Liu 等, 2025-2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C)

Retrieval-Augmented Generation (RAG) has emerged as a promising solution to address key challenges faced by GenAI, such as hallucination, outdated or non-removable parametric knowledge, and non-traceable reasoning processes. Existing RAG frameworks introduce dynamism into RAG process through adaptive, recursive and interactive usage of retriever and generator. More recently, agentic RAG adds another layer of intelligence to RAG by leveraging GenAI agents to further enhance dynamism by autonomously planning the retrieval process as a complex orchestration workflow with various external tools. However, current RAG architectures often overlook the significant role that domain experts can play in the retrieval process, alongside passive knowledge bases. This paper introduces a new paradigm for agentic RAG systems, capable of integrating external passive knowledge bases as well as active domain experts. This integration further enhances the versatility and factual accuracy of RAG systems. The paper discusses the key components of this new paradigm and examines the associated design challenges.

安装插件收集

被引 6

Open-source modular AI coupled with agentic AI for comprehensive breast cancer note generation and guideline-directed treatment comparison.

开源模块化人工智能与代理人工智能结合，用于全面乳腺癌笔记生成和指南指导治疗比较

Ahmed Sandhu, Elizabeth Jaewon Kim, Daniela Urueta Portillo 等, 2025-Journal of Clinical Oncology1区 TopIF 41.9

e13685 Background: Generative Artificial Intelligence (GenAI) has demonstrated promise as a clinical decision support tool. Previous studies utilized closed-source large language models (LLMs) such as GPT-4o (via the chatbot ChatGPT) to evaluate GenAI's role in healthcare. However, these LLMs may change, causing challenges with reliability and reproducibility. Hallucinations are especially concerning in healthcare, so methods such as grounding and retrieval augmented generation (RAG) are important tools that may reduce or eliminate hallucinations. Methods: The goal of this study was to enhance GenAI with agentic AI and vector-based RAG, using only open-source tools and LLMs to produce reliable breast cancer summaries and treatment evaluations. A container with Neo4j vector database, LangChain, Docling, and Jupyter was created to review HL7 patient charts containing mCODE data. Ollama was used to pull the LLMs llama3.2, gemma2:2b, qwen2.5, and phi3:mini. Synthetic Breast Cancer Dataset collected from The mCODE Project was collected, and a custom HL7-mCODE module was made to make patient data LLM-ingestible. The workflow was as follows: a modular (i.e., swappable) LLM with RAG would iterate over patient notes to extract all information related to cancer in their chart. A subsequent LLM (i.e., agentic AI) would compare the first AI's extraction with an mCODE summary to evaluate if there were any errors, remove them, and return a corrected cancer history. After this comparison was complete, another AI agent would evaluate for missing oncologic information (such as HER status) and return a list of known and unknown information for breast cancer. For the last step, NCCN Breast Cancer guidelines (Version 6.2024 11-11-2024) were converted to LLM-ingestible text via IBM's docling and placed in a vector database. The last AI agent would compare the patient's cancer details and treatment to compare with the guidelines. Results: 724 patient charts were generated with various modular AIs. No hallucinations were observed in the outputted data (i.e., no fabricated diagnoses, cancer details, treatments, etc.), and no incorrect interpretations were found. Most outputs correctly stated they could not assess NCCN guidelines due to insufficient information in the patient chart; charts with sufficient information to follow a specific guideline returned correct comparisons. In one case, Microsoft's phi3:mini was able to discern that while the guidelines were not followed, the provided guidelines are newer than the date the synthetic patient received treatment. Conclusions: Agentic AI as a utility for grounding, summarizing, and quality assurance demonstrates promise as an augmentation for GenAI to produce effective CDS tools for breast cancer history collection, evaluation, and treatment. Further studies with knowledge graphs may further improve their utility.

安装插件收集

被引 1

Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG

基于多智能体RAG的概念瓶颈方法实现可解释放射学报告生成

Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir 等, 2024-Lecture Notes in Computer Science

Deep learning has advanced medical image classification, but interpretability challenges hinder its clinical adoption. This study enhances interpretability in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs) and a multi-agent Retrieval-Augmented Generation (RAG) system for report generation. By modeling relationships between visual features and clinical concepts, we create interpretable concept vectors that guide a multi-agent RAG system to generate radiology reports, enhancing clinical relevance, explainability, and transparency. Evaluation of the generated reports using an LLM-as-a-judge confirmed the interpretability and clinical utility of our model's outputs. On the COVID-QU dataset, our model achieved 81% classification accuracy and demonstrated robust report generation performance, with five key metrics ranging between 84% and 90%. This interpretable multi-agent framework bridges the gap between high-performance AI and the explainability required for reliable AI-driven CXR analysis in clinical settings. Our code is available at https://github.com/tifat58/IRR-with-CBM-RAG.git.

安装插件收集

被引 17

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct：在语言模型中协同推理与行动

arxiv.org-Shunyu Yao, Jeffrey Zhao, Dian Yu 等, 2022-ArXiv Preprint

While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io

安装插件收集

Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

基于知识驱动的代理科学语料库蒸馏框架，用于生物医学大型语言模型训练

arxiv.org-Meng Xiao, Xunxin Cai, Qingqing Long 等, 2025-ArXiv Preprint

Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open-source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge-driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. This agentic framework collectively generates and refines domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.

安装插件收集

Agentic Observability: Automated Alert Triage for Adobe E-Commerce

代理可观察性：Adobe电子商务的自动警报分类

arxiv.org-Aprameya Bharadwaj, Kyle Tu, 2026-ArXiv Preprint

Modern enterprise systems exhibit complex interdependencies that make observability and incident response increasingly challenging. Manual alert triage, which typically involves log inspection, API verification, and cross-referencing operational knowledge bases, remains a major bottleneck in reducing mean recovery time (MTTR). This paper presents an agentic observability framework deployed within Adobe's e-commerce infrastructure that autonomously performs alert triage using a ReAct paradigm. Upon alert detection, the agent dynamically identifies the affected service, retrieves and analyzes correlated logs across distributed systems, and plans context-dependent actions such as handbook consultation, runbook execution, or retrieval-augmented analysis of recently deployed code. Empirical results from production deployment indicate a 90% reduction in mean time to insight compared to manual triage, while maintaining comparable diagnostic accuracy. Our results show that agentic AI enables an order-of-magnitude reduction in triage latency and a step-change in resolution accuracy, marking a pivotal shift toward autonomous observability in enterprise operations.

安装插件收集

Reason-Plan-ReAct: A Reasoner-Planner Supervising a ReAct Executor for Complex Enterprise Tasks

Reason-Plan-ReAct：一个监督ReAct执行器的推理规划器，用于复杂企业任务

arxiv.org-Gianni Molinari, Fabio Ciravegna, 2025-ArXiv Preprint

Despite recent advances, autonomous agents often struggle to solve complex tasks in enterprise domains that require coordinating multiple tools and processing diverse data sources. This struggle is driven by two main limitations. First, single-agent architectures enforce a monolithic plan-execute loop, which directly causes trajectory instability. Second, the requirement to use local open-weight models for data privacy introduces smaller context windows leading to the rapid consumption of context from large tool outputs. To solve this problem we introduce RP-ReAct (Reasoner Planner-ReAct), a novel multi-agent approach that fundamentally decouples strategic planning from low-level execution to achieve superior reliability and efficiency. RP-ReAct consists of a Reasoner Planner Agent (RPA), responsible for planning each sub-step, continuously analysing the execution results using the strong reasoning capabilities of a Large Reasoning Model, and one or multiple Proxy-Execution Agent (PEA) that translates sub-steps into concrete tool interactions using a ReAct approach. Crucially, we incorporate a context-saving strategy within the PEA to mitigate context window overflow by managing large tool outputs via external storage and on-demand access. We evaluate RP-ReAct, on the challenging, multi-domain ToolQA benchmark using a diverse set of six open-weight reasoning models. Our empirical results show that RP-ReAct achieves superior performance and improved generalization ability over state-of-the-art baselines when addressing diverse complex tasks across the evaluated domains. Furthermore we establish the enhanced robustness and stability of our approach across different model scales, paving the way for effective and deployable agentic solutions for enterprises.

安装插件收集

On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models

关于ReAct提示法在代理大型语言模型中脆弱基础的探讨

arxiv.org-Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati, 2024-ArXiv Preprint

The reasoning abilities of Large Language Models (LLMs) remain a topic of debate. Some methods such as ReAct-based prompting, have gained popularity for claiming to enhance sequential decision-making abilities of agentic LLMs. However, it is unclear what is the source of improvement in LLM reasoning with ReAct based prompting. In this paper we examine these claims of ReAct based prompting in improving agentic LLMs for sequential decision-making. By introducing systematic variations to the input prompt we perform a sensitivity analysis along the claims of ReAct and find that the performance is minimally influenced by the "interleaving reasoning trace with action execution" or the content of the generated reasoning traces in ReAct, contrary to original claims and common usage. Instead, the performance of LLMs is driven by the similarity between input example tasks and queries, implicitly forcing the prompt designer to provide instance-specific examples which significantly increases the cognitive burden on the human. Our investigation shows that the perceived reasoning abilities of LLMs stem from the exemplar-query similarity and approximate retrieval rather than any inherent reasoning abilities.

安装插件收集

SBOMs into Agentic AIBOMs: Schema Extensions, Agentic Orchestration, and Reproducibility Evaluation

从SBOM到代理AIBOM：模式扩展、代理编排和可重现性评估

arxiv.org-Petar Radanliev, Carsten Maple, Omar Santos 等, 2026-ArXiv Preprint

Software supply-chain security requires provenance mechanisms that support reproducibility and vulnerability assessment under dynamic execution conditions. Conventional Software Bills of Materials (SBOMs) provide static dependency inventories but cannot capture runtime behaviour, environment drift, or exploitability context. This paper introduces agentic Artificial Intelligence Bills of Materials (AIBOMs), extending SBOMs into active provenance artefacts through autonomous, policy-constrained reasoning. We present an agentic AIBOM framework based on a multi-agent architecture comprising (i) a baseline environment reconstruction agent (MCP), (ii) a runtime dependency and drift-monitoring agent (A2A), and (iii) a policy-aware vulnerability and VEX reasoning agent (AGNTCY). These agents generate contextual exploitability assertions by combining runtime execution evidence, dependency usage, and environmental mitigations with ISO/IEC 20153:2025 Common Security Advisory Framework (CSAF) v2.0 semantics. Exploitability is expressed via structured VEX assertions rather than enforcement actions. The framework introduces minimal, standards-aligned schema extensions to CycloneDX and SPDX, capturing execution context, dependency evolution, and agent decision provenance while preserving interoperability. Evaluation across heterogeneous analytical workloads demonstrates improved runtime dependency capture, reproducibility fidelity, and stability of vulnerability interpretation compared with established provenance systems, with low computational overhead. Ablation studies confirm that each agent contributes distinct capabilities unavailable through deterministic automation.

安装插件收集

AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

AEMA：可信且可控的智能体LLM系统的可验证评估框架

arxiv.org-YenTing Lee, Keerthi Koneru, Zahra Moslemi 等, 2026-ArXiv Preprint

Evaluating large language model (LLM)-based multi-agent systems remains a critical challenge, as these systems must exhibit reliable coordination, transparent decision-making, and verifiable performance across evolving tasks. Existing evaluation approaches often limit themselves to single-response scoring or narrow benchmarks, which lack stability, extensibility, and automation when deployed in enterprise settings at multi-agent scale. We present AEMA (Adaptive Evaluation Multi-Agent), a process-aware and auditable framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under human oversight. Compared to a single LLM-as-a-Judge, AEMA achieves greater stability, human alignment, and traceable records that support accountable automation. Our results on enterprise-style agent workflows simulated using realistic business scenarios demonstrate that AEMA provides a transparent and reproducible pathway toward responsible evaluation of LLM-based multi-agent systems. Keywords Agentic AI, Multi-Agent Systems, Trustworthy AI, Verifiable Evaluation, Human Oversight

安装插件收集

RA-Gen: A Controllable Code Generation Framework Using ReAct for Multi-Agent Task Execution

RA-Gen：一种基于ReAct的多智能体任务执行的可控代码生成框架

arxiv.org-Aofan Liu, Haoxuan Li, Bin Wang 等, 2025-ArXiv Preprint

Code generation models based on large language models (LLMs) have gained wide adoption, but challenges remain in ensuring safety, accuracy, and controllability, especially for complex tasks. Existing methods often lack dynamic integration of external tools, transparent reasoning, and user control over safety. To address these issues, we propose a controllable code generation framework utilizing the ReAct paradigm for multi-agent task execution. This framework is a multi-agent system designed to enable efficient, precise, and interpretable code generation through dynamic interactions between LLMs and external resources. The framework adopts a collaborative architecture comprising four specialized agents: a Planner for task decomposition, a Searcher that leverages the ReAct framework for reasoning and tool integration, a CodeGen agent for accurate code generation, and an Extractor for structured data retrieval. The ReAct-based Searcher alternates between generating reasoning traces and executing actions, facilitating seamless integration of internal knowledge with external tools (such as search engines) to enhance accuracy and user control. Experimental results show the framework's effectiveness across multiple languages, achieving a 94.8% security rate on the SVEN dataset with CodeQL, outperforming existing approaches. Its transparent reasoning process fosters user trust and improves controllability.

安装插件收集

Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows

基于LLM的代理工作流程中的多视角编码器用于性能预测

arxiv.org-Patara Trirat, Wonyong Jeong, Sung Ju Hwang, 2025-ArXiv Preprint

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes Agentic Predictor, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a multi-view workflow encoding technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs cross-domain unsupervised pretraining. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms several strong graph-based baselines in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.

安装插件收集

Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments

arxiv.org-Siwei Wu, Yizhi Li, Yuyang Song 等, 2026-ArXiv Preprint

Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \textbf{\emph{Executability}}, since each instance requires a suitable and often distinct Docker environment; and \textbf{\emph{Verifiability}}, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose \textbf{TerminalTraj}, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20\% on TB~1.0 and 10\% on TB~2.0 over their respective backbones. Notably, \textbf{TerminalTraj-32B} achieves strong performance among models with fewer than 100B parameters, reaching 35.30\% on TB~1.0 and 22.00\% on TB~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at https://github.com/Wusiwei0410/TerminalTraj.

安装插件收集

Autono: A ReAct-Based Highly Robust Autonomous Agent Framework

基于ReAct范式的自诺：一个高度鲁棒的自主代理框架

arxiv.org-Zihao Wu, 2025-ArXiv Preprint

This paper proposes a highly robust autonomous agent framework based on the ReAct paradigm, designed to solve complex tasks through adaptive decision making and multi-agent collaboration. Unlike traditional frameworks that rely on fixed workflows generated by LLM-based planners, this framework dynamically generates next actions during agent execution based on prior trajectories, thereby enhancing its robustness. To address potential termination issues caused by adaptive execution paths, I propose a timely abandonment strategy incorporating a probabilistic penalty mechanism. For multi-agent collaboration, I introduce a memory transfer mechanism that enables shared and dynamically updated memory among agents. The framework's innovative timely abandonment strategy dynamically adjusts the probability of task abandonment via probabilistic penalties, allowing developers to balance conservative and exploratory tendencies in agent execution strategies by tuning hyperparameters. This significantly improves adaptability and task execution efficiency in complex environments. Additionally, agents can be extended through external tool integration, supported by modular design and MCP protocol compatibility, which enables flexible action space expansion. Through explicit division of labor, the multi-agent collaboration mechanism enables agents to focus on specific task components, thereby significantly improving execution efficiency and quality.

安装插件收集

POLARIS: Typed Planning and Governed Execution for Agentic AI in Back-Office Automation

POLARIS：面向后台自动化中代理型AI的带类型规划与受控执行

arxiv.org-Zahra Moslemi, Keerthi Koneru, Yen-Ting Lee 等, 2026-ArXiv Preprint

Enterprise back office workflows require agentic systems that are auditable, policy-aligned, and operationally predictable, capabilities that generic multi-agent setups often fail to deliver. We present POLARIS (Policy-Aware LLM Agentic Reasoning for Integrated Systems), a governed orchestration framework that treats automation as typed plan synthesis and validated execution over LLM agents. A planner proposes structurally diverse, type checked directed acyclic graphs (DAGs), a rubric guided reasoning module selects a single compliant plan, and execution is guarded by validator gated checks, a bounded repair loop, and compiled policy guardrails that block or route side effects before they occur. Applied to document centric finance tasks, POLARIS produces decision grade artifacts and full execution traces while reducing human intervention. Empirically, POLARIS achieves a micro F1 of 0.81 on the SROIE dataset and, on a controlled synthetic suite, achieves 0.95 to 1.00 precision for anomaly routing with preserved audit trails. These evaluations constitute an initial benchmark for governed Agentic AI. POLARIS provides a methodological and benchmark reference for policy-aligned Agentic AI. Keywords Agentic AI, Enterprise Automation, Back-Office Tasks, Benchmarks, Governance, Typed Planning, Evaluation

安装插件收集

Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action

代理场景策略：统一空间、语义和功能以实现机器人动作

arxiv.org-Sacha Morin, Kumaraditya Gupta, Mahtab Sandhu 等, 2025-ArXiv Preprint

Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation. (Project page: https://montrealrobotics.ca/agentic-scene-policies.github.io/)

安装插件收集

Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance

在代理湖house中的可信AI：从并发到治理

arxiv.org-Jacopo Tagliabue, Federico Bianchi, Ciro Greco, 2025-ArXiv Preprint

Even as AI capabilities improve, most enterprises do not consider agents trustworthy enough to work on production data. In this paper, we argue that the path to trustworthy agentic workflows begins with solving the infrastructure problem first: traditional lakehouses are not suited for agent access patterns, but if we design one around transactions, governance follows. In particular, we draw an operational analogy to MVCC in databases and show why a direct transplant fails in a decoupled, multi-language setting. We then propose an agent-first design, Bauplan, that reimplements data and compute isolation in the lakehouse. We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan, which seamlessly couples agent reasoning with all the desired guarantees for correctness and trust.

安装插件收集

Agint: Agentic Graph Compilation for Software Engineering Agents

Agint：面向软件工程代理的代理图编译器

arxiv.org-Abhi Chivukula, Jay Somasundaram, Vijay Somasundaram, 2025-ArXiv Preprint

LLM-based coding agents are increasingly common but still face challenges in context management, latency, reliability, reproducibility, and scalability. We present Agint, an agentic graph compiler, interpreter, and runtime that incrementally and hierarchically converts natural-language instructions into typed, effect-aware code DAGs. Agint introduces explicit type floors (text to data to spec to code) grounded in semantic graph transformations and a hybrid LLM and function-based JIT runtime. This enables dynamic graph refinement, reproducible and optimizable execution, speculative evaluation, and interoperability with existing developer tools. Agint's typed graph bindings improve reliability and allow concurrent composition of concurrent codebases by construction, supporting accelerated development with smaller and faster models, lower latency, efficient context utilization, and higher throughput. Hierarchical compilation allows scalable graph edits, while the graph structure supports reproducibility and efficient parallel generation. Agint provides a composable unix-style toolchain: dagify (DAG compiler), dagent (hybrid JIT runtime), schemagin (schema generator), and datagin (data transformer) for realtime, low-latency code and dataflow creation. Human developers and coding agents refine graphs through the Agint CLI, while non-technical users use Agint Flow GUI for visual editing, conversational refinement, and debugging to promote prototype agentic workflows to production code. This continuous co-creation model allows teams to prototype quickly, refine seamlessly, and deploy reliably, bridging natural language, compiler methods, and developer tooling to enable a new generation of composable, team-centric coding agents at scale.

安装插件收集

Architecting Agentic Communities using Design Patterns

基于设计模式的代理社区架构设计

arxiv.org-Zoran Milosevic, Fethi Rabhi, 2026-ArXiv Preprint

The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.

安装插件收集

Real-Time Trust Verification for Safe Agentic Actions using TrustBench

基于TrustBench的实时可信验证，确保安全代理行为

arxiv.org-Tavishi Sharma, Vinayak Sharma, Pragya Sharma, 2026-ArXiv Preprint

As large language models evolve from conversational assistants to autonomous agents, ensuring trustworthiness requires a fundamental shift from post-hoc evaluation to real-time action verification. Current frameworks like AgentBench evaluate task completion, while TrustLLM and HELM assess output quality after generation. However, none of these prevent harmful actions during agent execution. We present TrustBench, a dual-mode framework that (1) benchmarks trust across multiple dimensions using both traditional metrics and LLM-as-a-Judge evaluations, and (2) provides a toolkit agents invoke before taking actions to verify safety and reliability. Unlike existing approaches, TrustBench intervenes at the critical decision point: after an agent formulates an action but before execution. Domain-specific plugins encode specialized safety requirements for healthcare, finance, and technical domains. Across multiple agentic tasks, TrustBench reduced harmful actions by 87%. Domain-specific plugins outperformed generic verification, achieving 35% greater harm reduction. With sub-200ms latency, TrustBench enables practical real-time trust verification for autonomous agents.

安装插件收集

AgentGuard: Runtime Verification of AI Agents

AgentGuard：AI代理的运行时验证

arxiv.org-Roham Koohestani, 2025-ArXiv Preprint

The rapid evolution to autonomous, agentic AI systems introduces significant risks due to their inherent unpredictability and emergent behaviors; this also renders traditional verification methods inadequate and necessitates a shift towards probabilistic guarantees where the question is no longer if a system will fail, but the probability of its failure within given constraints. This paper presents AgentGuard, a framework for runtime verification of Agentic AI systems that provides continuous, quantitative assurance through a new paradigm called Dynamic Probabilistic Assurance. AgentGuard operates as an inspection layer that observes an agent's raw I/O and abstracts it into formal events corresponding to transitions in a state model. It then uses online learning to dynamically build and update a Markov Decision Process (MDP) that formally models the agent's emergent behavior. Using probabilistic model checking, the framework then verifies quantitative properties in real-time.

安装插件收集

Architectures for Building Agentic AI

构建具有代理能力的AI架构

arxiv.org-Sławomir Nowaczyk, 2025-ArXiv Preprint

This chapter argues that the reliability of agentic and generative AI is chiefly an architectural property. We define agentic systems as goal-directed, tool-using decision makers operating in closed loops, and show how reliability emerges from principled componentisation (goal manager, planner, tool-router, executor, memory, verifiers, safety monitor, telemetry), disciplined interfaces (schema-constrained, validated, least-privilege tool calls), and explicit control and assurance loops. Building on classical foundations, we propose a practical taxonomy-tool-using agents, memory-augmented agents, planning and self-improvement agents, multi-agent systems, and embodied or web agents - and analyse how each pattern reshapes the reliability envelope and failure modes. We distil design guidance on typed schemas, idempotency, permissioning, transactional semantics, memory provenance and hygiene, runtime governance (budgets, termination conditions), and simulate-before-actuate safeguards.

安装插件收集

A cybersecurity AI agent selection and decision support framework

网络安全人工智能代理选择与决策支持框架

arxiv.org-Masike Malatji, 2025-ArXiv Preprint

This paper presents a novel, structured decision support framework that systematically aligns diverse artificial intelligence (AI) agent architectures, reactive, cognitive, hybrid, and learning, with the comprehensive National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) 2.0. By integrating agent theory with industry guidelines, this framework provides a transparent and stepwise methodology for selecting and deploying AI solutions to address contemporary cyber threats. Employing a granular decomposition of NIST CSF 2.0 functions into specific tasks, the study links essential AI agent properties such as autonomy, adaptive learning, and real-time responsiveness to each subcategory's security requirements. In addition, it outlines graduated levels of autonomy (assisted, augmented, and fully autonomous) to accommodate organisations at varying stages of cybersecurity maturity. This holistic approach transcends isolated AI applications, providing a unified detection, incident response, and governance strategy. Through conceptual validation, the framework demonstrates how tailored AI agent deployments can align with real-world constraints and risk profiles, enhancing situational awareness, accelerating response times, and fortifying long-term resilience via adaptive risk management. Ultimately, this research bridges the gap between theoretical AI constructs and operational cybersecurity demands, establishing a foundation for robust, empirically validated multi-agent systems that adhere to industry standards.

安装插件收集

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

AutoTool：基于代理推理的动态工具选择与集成框架

arxiv.org-Jiaru Zou, Ling Yang, Yunzhe Qi 等, 2025-ArXiv Preprint

Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

安装插件收集

Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance

通过行为引导实现可信赖的多轮LLM代理

arxiv.org-Gonca Gürsun, 2025-ArXiv Preprint

Large Language Models demonstrate strong reasoning and generation abilities, yet their behavior in multi-turn tasks often lacks reliability and verifiability. We present a task completion framework that enables LLM-based agents to act under explicit behavioral guidance in environments described by reinforcement learning formalisms with defined observation, action, and reward signals. The framework integrates three components: a lightweight task profiler that selects reasoning and generation strategies, a reasoning module that learns verifiable observation - action mappings, and a generation module that enforces constraint-compliant outputs through validation or deterministic synthesis. We show that as the agent interacts with the environment, these components co-evolve, yielding trustworthy behavior.

安装插件收集

An Explainable Agentic AI Framework for Uncertainty-Aware and Abstention-Enabled Acute Ischemic Stroke Imaging Decisions

一种可解释的代理人工智能框架，用于急性缺血性卒中影像的基于不确定性和弃权功能的决策

arxiv.org-Md Rashadul Islam, 2026-ArXiv Preprint

Artificial intelligence models have shown strong potential in acute ischemic stroke imaging, particularly for lesion detection and segmentation using computed tomography and magnetic resonance imaging. However, most existing approaches operate as black box predictors, producing deterministic outputs without explicit uncertainty awareness or structured mechanisms to abstain under ambiguous conditions. This limitation raises serious safety and trust concerns in high risk emergency radiology settings. In this paper, we propose an explainable agentic AI framework for uncertainty aware and abstention enabled decision support in acute ischemic stroke imaging. The framework follows a modular agentic pipeline in which a perception agent performs lesion aware image analysis, an uncertainty estimation agent computes slice level predictive reliability, and a decision agent determines whether to issue a prediction or abstain based on predefined uncertainty thresholds. Unlike prior stroke imaging systems that primarily focus on improving segmentation or classification accuracy, the proposed framework explicitly prioritizes clinical safety, transparency, and clinician aligned decision behavior. Qualitative and case based analyses across representative stroke imaging scenarios demonstrate that uncertainty driven abstention naturally emerges in diagnostically ambiguous regions and low information slices. The framework further integrates visual explanation mechanisms to support both predictive and abstention decisions, addressing a key limitation of existing uncertainty aware medical imaging systems. Rather than introducing a new performance benchmark, this work presents agentic control, uncertainty awareness, and selective abstention as essential design principles for developing safe and trustworthy medical imaging AI systems.

安装插件收集

FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation

FAIR-RAG：忠实自适应迭代细化用于检索增强生成

arxiv.org-Mohammad Aghajani Asl, Majid Asgari-Bidhendi, Behrooz Minaei-Bidgoli, 2025-ArXiv Preprint

While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3 points over the strongest iterative baseline -- establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.

安装插件收集

A Collaborative Multi-Agent Approach to Retrieval-Augmented Generation Across Diverse Data

arxiv.org-Aniruddha Salve, Saba Attar, Mahesh Deshmukh 等, 2024-ArXiv Preprint

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external, domain-specific data into the generative process. While LLMs are highly capable, they often rely on static, pre-trained datasets, limiting their ability to integrate dynamic or private data. Traditional RAG systems typically use a single-agent architecture to handle query generation, data retrieval, and response synthesis. However, this approach becomes inefficient when dealing with diverse data sources, such as relational databases, document stores, and graph databases, often leading to performance bottlenecks and reduced accuracy. This paper proposes a multi-agent RAG system to address these limitations. Specialized agents, each optimized for a specific data source, handle query generation for relational, NoSQL, and document-based systems. These agents collaborate within a modular framework, with query execution delegated to an environment designed for compatibility across various database types. This distributed approach enhances query efficiency, reduces token overhead, and improves response accuracy by ensuring that each agent focuses on its specialized task. The proposed system is scalable and adaptable, making it ideal for generative AI workflows that require integration with diverse, dynamic, or private data sources. By leveraging specialized agents and a modular execution environment, the system provides an efficient and robust solution for handling complex, heterogeneous data environments in generative AI applications.

安装插件收集

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

LLandMark：一种基于地标感知的多模态交互式视频检索的多智能体框架

arxiv.org-Minh-Chi Phung, Thien-Bao Le, Cam-Tu Tran-Thi 等, 2026-ArXiv Preprint

The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.

安装插件收集

CARROT: A Learned Cost-Constrained Retrieval Optimization System for RAG

CARROT：一种基于学习成本约束的RAG检索优化系统

arxiv.org-Ziting Wang, Haitao Yuan, Wei Dong 等, 2024-ArXiv Preprint

Large Language Models (LLMs) have demonstrated impressive ability in generation and reasoning tasks but struggle with handling up-to-date knowledge, leading to inaccuracies or hallucinations. Retrieval-Augmented Generation (RAG) mitigates this by retrieving and incorporating external knowledge into input prompts. In particular, due to LLMs' context window limitations and long-context hallucinations, only the most relevant "chunks" are retrieved. However, current RAG systems face three key challenges: (1) chunks are often retrieved independently without considering their relationships, such as redundancy and ordering; (2) the utility of chunks is non-monotonic, as adding more chunks can degrade quality; and (3) retrieval strategies fail to adapt to the unique characteristics of different queries. To overcome these challenges, we design a cost-constrained retrieval optimization framework for RAG. We adopt a Monte Carlo Tree Search (MCTS) based strategy to find the optimal chunk combination order, which considers the chunks' correlations. In addition, to address the non-monotonicity of chunk utility, instead of treating budget exhaustion as the termination condition, we design a utility computation strategy to identify the optimal chunk combination without necessarily exhausting the budget. Furthermore, we propose a configuration agent that predicts optimal configurations for each query domain, improving our framework's adaptability and efficiency. Experimental results demonstrate up to a 30% improvement over baseline models, highlighting the framework's effectiveness, scalability, and suitability. Our source code has been released at https://github.com/wang0702/CARROT.

安装插件收集

IGMiRAG: Intuition-Guided Retrieval-Augmented Generation with Adaptive Mining of In-Depth Memory

IGMiRAG：基于直觉引导的检索增强生成与自适应深度记忆挖掘

arxiv.org-Xingliang Hou, Yuyan Liu, Qi Sun 等, 2026-ArXiv Preprint

Retrieval-augmented generation (RAG) equips large language models (LLMs) with reliable knowledge memory. To strengthen cross-text associations, recent research integrates graphs and hypergraphs into RAG to capture pairwise and multi-entity relations as structured links. However, their misaligned memory organization necessitates costly, disjointed retrieval. To address these limitations, we propose IGMiRAG, a framework inspired by human intuition-guided reasoning. It constructs a hierarchical heterogeneous hypergraph to align multi-granular knowledge, incorporating deductive pathways to simulate realistic memory structures. During querying, IGMiRAG distills intuitive strategies via a question parser to control mining depth and memory window, and activates instantaneous memories as anchors using dual-focus retrieval. Mirroring human intuition, the framework guides retrieval resource allocation dynamically. Furthermore, we design a bidirectional diffusion algorithm that navigates deductive paths to mine in-depth memories, emulating human reasoning processes. Extensive evaluations indicate IGMiRAG outperforms the state-of-the-art baseline by 4.8% EM and 5.0% F1 overall, with token costs adapting to task complexity (average 6.3k+, minimum 3.0k+). This work presents a cost-effective RAG paradigm that improves both efficiency and effectiveness.

安装插件收集

ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation

ARAG：基于代理检索增强的个性化推荐生成

arxiv.org-Reza Yousefi Maragheh, Pratheek Vadla, Priyank Gupta 等, 2025-ArXiv Preprint

Retrieval-Augmented Generation (RAG) has shown promise in enhancing recommendation systems by incorporating external context into large language model prompts. However, existing RAG-based approaches often rely on static retrieval heuristics and fail to capture nuanced user preferences in dynamic recommendation scenarios. In this work, we introduce ARAG, an Agentic Retrieval-Augmented Generation framework for Personalized Recommendation, which integrates a multi-agent collaboration mechanism into the RAG pipeline. To better understand the long-term and session behavior of the user, ARAG leverages four specialized LLM-based agents: a User Understanding Agent that summarizes user preferences from long-term and session contexts, a Natural Language Inference (NLI) Agent that evaluates semantic alignment between candidate items retrieved by RAG and inferred intent, a context summary agent that summarizes the findings of NLI agent, and an Item Ranker Agent that generates a ranked list of recommendations based on contextual fit. We evaluate ARAG accross three datasets. Experimental results demonstrate that ARAG significantly outperforms standard RAG and recency-based baselines, achieving up to 42.1% improvement in NDCG@5 and 35.5% in Hit@5. We also, conduct an ablation study to analyse the effect by different components of ARAG. Our findings highlight the effectiveness of integrating agentic reasoning into retrieval-augmented recommendation and provide new directions for LLM-based personalization.

安装插件收集

DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation

DeepCodeSeek：面向上下文感知代码生成的实时API检索技术

arxiv.org-Esakkivel Esakkiraja, Denis Akhiyarov, Aditya Shanmugham 等, 2025-ArXiv Preprint

Current search techniques are limited to standard RAG query-document applications. In this paper, we propose a novel technique to expand the code and index for predicting the required APIs, directly enabling high-quality, end-to-end code generation for auto-completion and agentic AI applications. We address the problem of API leaks in current code-to-code benchmark datasets by introducing a new dataset built from real-world ServiceNow Script Includes that capture the challenge of unclear API usage intent in the code. Our evaluation metrics show that this method achieves 87.86% top-40 retrieval accuracy, allowing the critical context with APIs needed for successful downstream code generation. To enable real-time predictions, we develop a comprehensive post-training pipeline that optimizes a compact 0.6B reranker through synthetic dataset generation, supervised fine-tuning, and reinforcement learning. This approach enables our compact reranker to outperform a much larger 8B model while maintaining 2.5x reduced latency, effectively addressing the nuances of enterprise-specific code without the computational overhead of larger models.

安装插件收集

TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework

TeaRAG：一种基于代理的检索增强生成框架，具有高效的标记使用

arxiv.org-Chao Zhang, Yuhao Wang, Derong Xu 等, 2025-ArXiv Preprint

Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment Large Language Models' (LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at https://github.com/Applied-Machine-Learning-Lab/TeaRAG.

安装插件收集

Biomni: A General-Purpose Biomedical AI Agent.

Biomni：一种通用的生物医学AI代理

pubmed.ncbi.nlm.nih.gov-Kexin Huang, Serena Zhang, Hanchen Wang 等, 2025-bioRxiv : the preprint server for biology

Biomedical research underpins progress in our understanding of human health and disease, drug discovery, and clinical care. However, with the growth of complex lab experiments, large datasets, many analytical tools, and expansive literature, biomedical research is increasingly constrained by repetitive and fragmented workflows that slow discovery and limit innovation, underscoring the need for a fundamentally new way to scale scientific expertise. Here, we introduce Biomni, a general-purpose biomedical AI agent designed to autonomously execute a wide spectrum of research tasks across diverse biomedical subfields. To systematically map the biomedical action space, Biomni first employs an action discovery agent to create the first unified agentic environment - mining essential tools, databases, and protocols from tens of thousands of publications across 25 biomedical domains. Built on this foundation, Biomni features a generalist agentic architecture that integrates large language model (LLM) reasoning with retrieval-augmented planning and code-based execution, enabling it to dynamically compose and carry out complex biomedical workflows - entirely without relying on predefined templates or rigid task flows. Systematic benchmarking demonstrates that Biomni achieves strong generalization across heterogeneous biomedical tasks - including causal gene prioritization, drug repurposing, rare disease diagnosis, microbiome analysis, and molecular cloning - without any task-specific prompt tuning. Real-world case studies further showcase Biomni's ability to interpret complex, multi-modal biomedical datasets and autonomously generate experimentally testable protocols. Biomni envisions a future where virtual AI biologists operate alongside and augment human scientists to dramatically enhance research productivity, clinical insight, and healthcare. Biomni is ready to use at https://biomni.stanford.edu, and we invite scientists to explore its capabilities, stress-test its limits, and co-create the next era of biomedical discoveries.

安装插件收集

An AI Agent for Fully Automated Multi-Omic Analyses.

面向全自动多组学分析的AI代理

pubmed.ncbi.nlm.nih.gov-Juexiao Zhou, Bin Zhang, Guowei Li 等, 2024-Advanced science (Weinheim, Baden-Wurttemberg, Germany)1区 TopIF 14.1

With the fast-growing and evolving omics data, the demand for streamlined and adaptable tools to handle bioinformatics analysis continues to grow. In response to this need, Automated Bioinformatics Analysis (AutoBA) is introduced, an autonomous AI agent designed explicitly for fully automated multi-omic analyses based on large language models (LLMs). AutoBA simplifies the analytical process by requiring minimal user input while delivering detailed step-by-step plans for various bioinformatics tasks. AutoBA's unique capacity to self-design analysis processes based on input data variations further underscores its versatility. Compared with online bioinformatic services, AutoBA offers multiple LLM backends, with options for both online and local usage, prioritizing data security and user privacy. In comparison to ChatGPT and open-source LLMs, an automated code repair (ACR) mechanism in AutoBA is designed to improve its stability in automated end-to-end bioinformatics analysis tasks. Moreover, different from the predefined pipeline, AutoBA has adaptability in sync with emerging bioinformatics tools. Overall, AutoBA represents an advanced and convenient tool, offering robustness and adaptability for conventional multi-omic analyses.

安装插件收集

LLM Agent Based Protein Function Prediction.

基于大型语言模型的蛋白质功能预测代理系统

pubmed.ncbi.nlm.nih.gov-Fernando Zhapa-Camacho, Olga Mashkova, Robert Hoehndorf 等, 2026-Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

Protein function prediction remains a fundamental challenge in computational biology. Here, we present a Large Language Model (LLM) agent-based system that improves protein function prediction performance using knowledge-augmented reasoning and multi-source evidence synthesis.Our approach integrates computational predictions with structured protein metadata, scientific literature, and ontological knowledge through a multi-stage reasoning process. An LLM agent equipped with specialized tools progressively refines functional predictions by querying constraints, cross-referencing evidence, and ensuring biological plausibility. Furthermore, the system provides detailed explanations for each prediction update, documenting the reasoning process and evidence sources.We evaluate our approach against established baseline methods across three Gene Ontology sub-ontologies using four complementary metrics, achieving superior performance in threshold-dependent measures, attaining the lowest Smin scores across all ontologies and the best Fmax for Molecular Function and Cellular Component ontologies. We make our code publicly available at https://github.com/bio-ontology-research-group/go-agent.

安装插件收集

A multimodal LLM-agent framework for personalized clinical decision-making in hepatocellular carcinoma.

基于多模态LLM代理的个性化肝细胞癌临床决策框架

pubmed.ncbi.nlm.nih.gov-Liyang Wang, Fa Tian, Chengquan Li 等, 2025-Patterns (New York, N.Y.)2区IF 7.4

Hepatocellular carcinoma (HCC) treatment is challenging due to tumor heterogeneity and patient variability. Current guidelines often overlook individual factors, limiting treatment precision. We developed an integrated framework combining radiomics, deep learning, and large language model (LLM)-based decision agents to generate personalized HCC treatment recommendations. A modified GhostNet incorporating dilated convolutions, channel and spatial attention mechanism (CBAM), and residual channel attention (RCA) modules was trained on MRI to predict pathological markers such as microvascular invasion (MVI), capsule presence, and tumor differentiation. A fusion model integrating radiomics and deep learning enhanced prediction accuracy. Six AI agents processed structured multimodal data and generated individualized treatment strategies, which were evaluated by hepatobiliary surgeons. The fusion model significantly improved prediction accuracy, with MVI and capsule presence reaching 0.8902 and 0.8765, respectively. DeepSeek-R1 achieved the highest clinical relevance score, followed by GPT-4 and Med-PaLM 2. This framework demonstrates the feasibility of AI-assisted, patient-specific HCC decision-making, offering a promising direction for precision oncology.

安装插件收集

The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies.

人工智能代理虚拟实验室设计新型SARS-CoV-2纳米抗体

pubmed.ncbi.nlm.nih.gov-Kyle Swanson, Wesley Wu, Nash L Bulaong 等, 2025-Nature1区 TopIF 48.5

Science frequently benefits from teams of interdisciplinary researchers

安装插件收集

MRAgent: an LLM-based automated agent for causal knowledge discovery in disease via Mendelian randomization.

MRAgent：基于大型语言模型的疾病因果知识发现自动化代理

pubmed.ncbi.nlm.nih.gov-Wei Xu, Gang Luo, Weiyu Meng 等, 2025-Briefings in bioinformatics2区IF 7.7

Understanding causality in medical research is essential for developing effective interventions and diagnostic tools. Mendelian Randomization (MR) is a pivotal method for inferring causality through genetic data. However, MR analysis often requires pre-identification of exposure-outcome pairs from clinical experience or literature, which can be challenging to obtain. This poses difficulties for clinicians investigating causal factors of specific diseases. To address this, we introduce MRAgent, an innovative automated agent leveraging Large Language Models (LLMs) to enhance causal knowledge discovery in disease research. MRAgent autonomously scans scientific literature, discovers potential exposure-outcome pairs, and performs MR causal inference using extensive Genome-Wide Association Study data. We conducted both automated and human evaluations to compare different LLMs in operating MRAgent and provided a proof-of-concept case to demonstrate the complete workflow. MRAgent's capability to conduct large-scale causal analyses represents a significant advancement, equipping researchers and clinicians with a robust tool for exploring and validating causal relationships in complex diseases. Our code is public at https://github.com/xuwei1997/MRAgent.

安装插件收集

LLM-based pedagogical agent for ICU simulation instructor training: A quasi-experimental study.

基于大型语言模型的ICU模拟教学员培训教学法代理：一项准实验研究

pubmed.ncbi.nlm.nih.gov-Jingbang Liu, Ting Chen, Shan Li 等, 2026-Nurse education today2区IF 4.2

Intensive Care Unit (ICU) nursing is demanding, requiring advanced clinical decision-making and emergency management skills. Simulation-based instruction is central to ICU nursing education but remains constrained by the cost and time required for scenario authoring, limited faculty capacity for feedback, and slow content updates. Large language models (LLMs)-based pedagogical agents may augment instructor training by supporting rapid scenario generation, formative guidance, and on-demand assistance. However, evidence from real-world ICU instructor training is limited, and the balance between perceived benefits, usability, and objective educational outcomes is unclear. To evaluate the feasibility and learner-perceived impact of integrating an LLM-based pedagogical agent into ICU simulation instructor training. An exploratory quasi-experimental study was conducted with 40 ICU nurses from a tertiary hospital in February 2025. Participants were randomly assigned to an experimental group (n = 20) using the LLM-based AI teaching agent for simulation training, and a comparison group (n = 20) using traditional blended learning. The training effectiveness was assessed using the Chinese version of the Jeffries Simulation Design Scale (SDS), the System Usability Scale (SUS), the Adult Online Learning Self-Efficacy Scale, and a teaching satisfaction questionnaire. Data were analyzed using Wilcoxon rank-sum tests and t-tests. The experimental group outperformed the comparison group in multiple areas. Specifically, in the SDS, the experimental group scored higher in case authenticity (5.00 vs. 4.00, p < 0.001), scenario complexity (5.00 vs. 4.00, p < 0.001), feedback mechanisms (5.00 vs. 4.00, p < 0.001), interactivity (5.00 vs. 4.00, p < 0.001), and teaching objectives (5.00 vs. 4.25, p < 0.001). The experimental group also showed higher self-efficacy in learning ability (16.0 vs. 13.0, p < 0.001) and learning technology (18.0 vs. 16.0, p = 0.045). Satisfaction was high in both groups and demonstrated a pronounced ceiling effect. Embedding an LLM-based pedagogical agent into ICU simulation instructor training was feasible and associated with more favorable learner-perceived simulation design quality and online learning self-efficacy, while usability did not differ from traditional blended learning. Findings are preliminary and hypothesis-generating; future multi-centre, adequately powered randomized controlled trials are warranted to determine efficacy and isolate the LLM component's independent contribution.

安装插件收集

CRISPR-GPT for agentic automation of gene-editing experiments.

CRISPR-GPT：用于基因编辑实验的智能自动化

pubmed.ncbi.nlm.nih.gov-Yuanhao Qu, Kaixuan Huang, Ming Yin 等, 2026-Nature biomedical engineering1区 TopIF 26.6

Performing effective gene-editing experiments requires a deep understanding of both the CRISPR technology and the biological system involved. Meanwhile, despite their versatility and promise, large language models (LLMs) often lack domain-specific knowledge and struggle to accurately solve biological design problems. We present CRISPR-GPT, an LLM agent system to automate and enhance CRISPR-based gene-editing design and data analysis. CRISPR-GPT leverages the reasoning capabilities of LLMs for complex task decomposition, decision-making and interactive human-artificial intelligence (AI) collaboration. This system incorporates domain expertise, retrieval techniques, external tools and a specialized LLM fine tuned with open-forum discussions among scientists. CRISPR-GPT assists users in selecting CRISPR systems, experiment planning, designing guide RNAs, choosing delivery methods, drafting protocols, designing assays and analysing data. We showcase the potential of CRISPR-GPT by knocking out four genes with CRISPR-Cas12a in a human lung adenocarcinoma cell line and epigenetically activating two genes using CRISPR-dCas9 in a human melanoma cell line. CRISPR-GPT enables fully AI-guided gene-editing experiment design and analysis across different modalities, validating its effectiveness as an AI co-pilot in genome engineering.

安装插件收集

LLM-based multi-agent system for neuro-ophthalmic diagnosis and personalized treatment planning.

基于大型语言模型的多智能体系统用于神经眼科诊断和个性化治疗方案规划

pubmed.ncbi.nlm.nih.gov-Wenmiao Wang, 2025-Frontiers in neuroscience3区IF 3.2

Ophthalmic findings can non-invasively reflect nervous-system status. We present an LLM-based multi-agent framework that preserves diagnostic uncertainty to support neuro-ophthalmic screening and referral. Heterogeneous inputs (clinical text/PDFs and optional fundus/OCT images) are normalized by an Information Collection Agent. A Diagnosis Agent ensembles multiple LLMs and, when available, a CNN image branch; outputs are aggregated with an uncertainty-aware fusion. Across a curated ophthalmic corpus, the multi-agent framework improves robustness over single-model baselines and produces multi-candidate distributions suitable for downstream triage and monitoring. Uncertainty-aware, multi-candidate predictions align with clinical decision-making under ambiguity and suggest future work on calibration and knowledge-layer fusion.

安装插件收集

Auto-scaling LLM-based multi-agent systems through dynamic integration of agents.

基于大型语言模型的多元智能体系统通过智能体动态集成实现自动扩展

pubmed.ncbi.nlm.nih.gov-Ravindu Perera, Anuradha Basnayake, Manjusri Wickramasinghe, 2025-Frontiers in artificial intelligence4区IF 4.7

Large Language Model-based Multi-Agent Systems (LLM-based MASs) represent a groundbreaking paradigm where diverse LLM-based agents collaborate, leveraging their unique capabilities to achieve shared objectives. Although LLM-based MASs outperform individual agents, their current architectures are limited by predefined, fixed, and static agent designs, restricting adaptability and scalability in dynamic environments. To address these limitations, this study proposes two novel approaches: Initial Automatic Agent Generation (IAAG) and Dynamic Real-Time Agent Generation (DRTAG). These approaches enable the automatic creation and seamless integration of new agents into MASs, driven by evolving conversational and task-specific contexts, thereby reducing the need for human intervention. Our method leverages advanced prompt engineering techniques such as persona pattern prompting, chain prompting, and few-shot prompting to generate new agents through existing LLM agents. Additionally, several evaluation metrics were adapted to score and rank LLM-generated texts. Experimental results demonstrate that the DRTAG approach significantly improves system adaptability and task performance compared to static MAS architectures. The IAAG framework also enhances initial system flexibility, supporting the creation of contextually relevant agents. These findings highlight the potential of dynamic LLM-based MASs to overcome the limitations of static architectures to address complex real-world challenges, paving the way for innovative applications across diverse domains.

安装插件收集

Knowledge-Infused LLM-Powered Conversational Health Agent: A Case Study for Diabetes Patients.

基于知识注入的大型语言模型驱动的对话式健康助手：针对糖尿病患者的案例研究

pubmed.ncbi.nlm.nih.gov-Mahyar Abbasian, Zhongqi Yang, Elahe Khatibi 等, 2024-Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference

Effective diabetes management is crucial for maintaining health in diabetic patients. Large Language Models (LLMs) have opened new avenues for diabetes management, facilitating their efficacy. However, current LLM-based approaches are limited by their dependence on general sources and lack of integration with domain-specific knowledge, leading to inaccurate responses. In this paper, we propose a knowledge-infused LLM-powered conversational health agent (CHA) for diabetic patients. We customize and leverage the open-source openCHA framework, enhancing our CHA with external knowledge and analytical capabilities. This integration involves two key components: 1) incorporating the American Diabetes Association dietary guidelines and the Nutritionix information and 2) deploying analytical tools that enable nutritional intake calculation and comparison with the guidelines. We compare the proposed CHA with GPT4. Our evaluation includes 100 diabetes-related questions on daily meal choices and assessing the potential risks associated with the suggested diet. Our findings show that the proposed agent demonstrates superior performance in generating responses to manage essential nutrients.

安装插件收集

Task Offloading with LLM-Enhanced Multi-Agent Reinforcement Learning in UAV-Assisted Edge Computing.

基于LLM增强的多智能体强化学习在无人机辅助边缘计算中的任务卸载

pubmed.ncbi.nlm.nih.gov-Feifan Zhu, Fei Huang, Yantao Yu 等, 2024-Sensors (Basel, Switzerland)3区IF 3.5

Unmanned aerial vehicles (UAVs) furnished with computational servers enable user equipment (UE) to offload complex computational tasks, thereby addressing the limitations of edge computing in remote or resource-constrained environments. The application of value decomposition algorithms for UAV trajectory planning has drawn considerable research attention. However, existing value decomposition algorithms commonly encounter obstacles in effectively associating local observations with the global state of UAV clusters, which hinders their task-solving capabilities and gives rise to reduced task completion rates and prolonged convergence times. To address these challenges, this paper introduces an innovative multi-agent deep learning framework that conceptualizes multi-UAV trajectory optimization as a decentralized partially observable Markov decision process (Dec-POMDP). This framework integrates the QTRAN algorithm with a large language model (LLM) for efficient region decomposition and employs graph convolutional networks (GCNs) combined with self-attention mechanisms to adeptly manage inter-subregion relationships. The simulation results demonstrate that the proposed method significantly outperforms existing deep reinforcement learning methods, with improvements in convergence speed and task completion rate exceeding 10%. Overall, this framework significantly advances UAV trajectory optimization and enhances the performance of multi-agent systems within UAV-assisted edge computing environments.

安装插件收集

Personalizing prostate cancer education for patients using an EHR-Integrated LLM agent.

利用电子健康记录集成的大型语言模型（LLM）代理个性化前列腺癌患者教育

pubmed.ncbi.nlm.nih.gov-Yuexing Hao, Jason Holmes, Mark R Waddle 等, 2025-NPJ digital medicine1区 TopIF 15.1

Cancer patients often lack timely education and personalized support due to clinician workload. This quality improvement study develops and evaluates a Large Language Model (LLM) agent, MedEduChat, which is integrated with the clinic's electronic health records (EHR) and designed to enhance prostate cancer patient education. Fifteen non-metastatic prostate cancer patients and three clinicians recruited from the Mayo Clinic interacted with the agent between May 2024 and April 2025. Findings showed that MedEduChat has a high usability score (UMUX = 83.7/100) and improves patients' health confidence (Health Confidence Score rose from 9.9 to 13.9). Clinicians evaluated the patient-chat interaction history and rated MedEduChat as highly correct (2.9/3), complete (2.7/3), and safe (2.7/3), with moderate personalization (2.3/3). This study highlights the potential of LLM agents to improve patient engagement and health education.

安装插件收集

AgentLens: Visual Analysis for Agent Behaviors in LLM-Based Autonomous Systems.

AgentLens：基于LLM的自主系统中代理行为的可视化分析

pubmed.ncbi.nlm.nih.gov-Jiaying Lu, Bo Pan, Jieyi Chen 等, 2025-IEEE transactions on visualization and computer graphics2区 TopIF 6.5

Recently, Large Language Model based Autonomous System (LLMAS) has gained great popularity for its potential to simulate complicated behaviors of human societies. One of its main challenges is to present and analyze the dynamic events evolution of LLMAS. In this work, we present a visualization approach to explore the detailed statuses and agents' behavior within LLMAS. Our approach outlines a general pipeline that organizes raw execution events from LLMAS into a structured behavior model. We leverage a behavior summarization algorithm to create a hierarchical summary of these behaviors, arranged according to their sequence over time. Additionally, we design a cause trace method to mine the causal relationship between agent behaviors. We then develop AgentLens, a visual analysis system that leverages a hierarchical temporal visualization for illustrating the evolution of LLMAS, and supports users to interactively investigate details and causes of agents' behaviors. Two usage scenarios and a user study demonstrate the effectiveness and usability of our AgentLens.

安装插件收集

An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation.

基于大型语言模型的生理数据分析代理：基于PPG的心率估计案例研究

pubmed.ncbi.nlm.nih.gov-Mohammad Feli, Iman Azimi, Pasi Liljeberg 等, 2025-Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference

Large language models (LLMs) are revolutionizing healthcare by improving diagnosis, patient care, and decision support through interactive communication. More recently, they have been applied to analyzing physiological time-series like wearable data for health insight extraction. Existing methods embed raw numerical sequences directly into prompts, which exceeds token limits and increases computational costs. Additionally, some studies integrated features extracted from time-series in textual prompts or applied multimodal approaches. However, these methods often produce generic and unreliable outputs due to LLMs' limited analytical rigor and inefficiency in interpreting continuous waveforms. In this paper, we develop an LLM-powered agent for physiological time-series analysis aimed to bridge the gap in integrating LLMs with well-established analytical tools. Built on the OpenCHA, an open-source LLM-powered framework, our agent powered by OpenAI's GPT-3.5-turbo model features an orchestrator that integrates user interaction, data sources, and analytical tools to generate accurate health insights. To evaluate its effectiveness, we implement a case study on heart rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study. The agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o, with ECG serving as the gold standard for HR estimation. Results demonstrate that our agent significantly outperforms benchmark models by achieving lower error rates and more reliable HR estimations. The agent implementation is publicly available on GitHub

安装插件收集

MedAgentBench v2: Improving Medical LLM Agent Design.

MedAgentBench v2：提升医疗LLM代理设计

pubmed.ncbi.nlm.nih.gov-Eric Chen, Sam Postelnik, Kameron Black 等, 2026-Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

MedAgentBench is the first benchmark for evaluating LLM agents on clinical tasks in a FHIR-compliant EHR. In this paper, we present significant prompt engineering and tool design improvements over the original agent implementation and introduce a memory component that enables the agent to learn from prior failures. We added new tools for the agent to properly format its output for tasks, interact with an EHR without constructing explicit HTTP requests, which were prone to syntax errors, and make math calculations. We also wrote a new system prompt that asked the agent to outline its plan before making any tool calls and think step by step using chain of thought reasoning, and provided few shot examples of good vs. bad outputs. Using GPT-4.1 as the base model, our agent achieved a success rate of 91.0% without memory and 98.0% with memory. A surprising consequence is that the agent performed better on a different task that had no associated memory entry, possibly demonstrating that LLMs can adapt to the style of tasks presented by users. To contribute to the benchmark and evaluate the generalization of our agent, we developed 300 new multi-step clinically-driven tasks in collaboration with a physician. Lastly, we show the current limitations of these benchmarks and highlight the necessary next steps and challenges for the responsible deployment of AI agents in real-world healthcare settings. We hope that this paper leads to further development of EHR agents and benchmarks.

安装插件收集

Cognitive Agents in Urban Mobility: Integrating LLM Reasoning into Multi-Agent Simulations.

城市交通中的认知代理：将大型语言模型推理集成到多代理仿真中

pubmed.ncbi.nlm.nih.gov-Christian Calderón, Pasqual Martí, Jaume Jordán 等, 2025-Sensors (Basel, Switzerland)3区IF 3.5

Urban mobility systems face escalating challenges associated with sustainability, equity, and resilience, further compounded by environmental pressures. Traditional agent-based models (ABMs) often fail to capture cognitively rich, adaptive behaviors, limiting their ability to simulate realistic user responses to disruptions. In this work, we propose a cognitive agent architecture based on Large Language Models (LLMs), featuring multi-horizon memory-driven planning, reflection, and adaptation. Integrated into the SimFleet agent-based simulator with realistic sociodemographic profiles, the agents dynamically generate, adjust, and reflect upon travel plans across a 20-day simulation involving over 320 individuals. Experimental results reveal emergent adaptation patterns under both stable and disrupted transport conditions, and an ablation study under severe service disruption quantifies the contributions of short-term and long-term memory modules to memory-driven reasoning, demonstrating the potential of LLM-driven agents to enhance the realism, flexibility, and interpretability of urban mobility simulations.

安装插件收集

DanceAgent: Dance Movement Refinement With LLM Agent.

DanceAgent：基于LLM代理的舞蹈动作精炼

pubmed.ncbi.nlm.nih.gov-Cheng Shang, Xingyu Chen, Liang An 等, 2026-IEEE transactions on visualization and computer graphics2区 TopIF 6.5

Recent research on motion generation and text-to-motion synthesis focus on coarse-grained motion descriptions, neglecting fine-grained motion details and motion quality refinement. Additionally, current text-to-motion models, such as MotionGPT, lack multi-turn interaction capabilities, relying on single-turn and single-modality transformations, which limit their ability to integrate information from different modalities across interaction stages. These gaps leave critical questions, such as "How well is the motion performed" and "How can it be refined?" largely unaddressed. To address these issues, first, we introduce two fine-grained dance datasets-one focusing on jazz dance and the other on folk dance, which we have independently collected. Second, considering that dance motions are inherently complex and consist of long sequential actions, we introduce both global and local optimization during the motion encoding phase and employ Hidden Markov Model (HMM) temporal modeling to capture differential features between correct and incorrect movements, thereby optimizing the training process. Finally, we propose a multi-turn historical dialogue framework that enables three stages generation-motion assess, text instructions, and motion refinement-for input videos. This framework assists dance beginners by providing feedback on their movements, offering textual instructions, and delivering motion-based refinement. Experimental results on the jazz dance and folk dance datasets demonstrate that our method surpasses existing approaches in both quantitative and qualitative metrics, establishing a new benchmark for motion-text generation in the field of dance training.

安装插件收集

Spike sorting AI agent.

尖峰排序人工智能代理

pubmed.ncbi.nlm.nih.gov-Zuwan Lin, Arnau Marin-Llobet, Jongmin Baek 等, 2025-bioRxiv : the preprint server for biology

Spike sorting is a fundamental process for decoding neural activity, involving preprocessing, spike detection, feature extraction, clustering, and validation. However, conventional spike sorting methods are highly fragmented, labor-intensive, and heavily reliant on expert manual curation, limiting their scalability and reproducibility. This challenge has become more pressing with advances in neural recording technology, such as high-density Neuropixels for large-scale neural recording or flexible electrodes for long-term stable recording over months to years. The volume and complexity of these datasets make manual curation infeasible, requiring an automated and scalable solution. Here, we introduce SpikeAgent, a multimodal large language model (LLM)-based AI agent that automates and standardizes the entire spike sorting pipeline. Unlike traditional approaches, SpikeAgent integrates multiple LLM backends, coding functions, and established algorithms, autonomously performing spike sorting with reasoning-based decision-making and real-time interaction with intermediate results. It generates interpretable reports, providing transparent justifications for each sorting decision, enhancing transparency and reliability. We benchmarked SpikeAgent against human experts across various neural recording technology, demonstrating its versatility and ability to achieve curation consistency that are equal to, or even higher than human experts. It also drastically reduces the expertise barrier and accelerates the curation and validation time by orders of magnitude. Moreover, it enables automated interpretability of the neural spiking data, which cannot be achieved by any conventional methods. SpikeAgent presents a paradigm shift in processing signals for neuroscience and brain-computer interfaces, while laying the ground for AI agent-augmented science across various domains.

安装插件收集

Accelerating earth science discovery via multi-agent LLM systems.

通过多智能体大型语言模型系统加速地球科学发现

pubmed.ncbi.nlm.nih.gov-Dmitrii Pantiukhin, Boris Shapkin, Ivan Kuznetsov 等, 2025-Frontiers in artificial intelligence4区IF 4.7

This Perspective explores the transformative potential of multi-agent systems (MAS) powered by Large Language Models (LLMs) in the geosciences. Users of geoscientific data repositories face challenges due to the complexity and diversity of data formats, inconsistent metadata practices, and a considerable number of unprocessed datasets. MAS possesses transformative potential for improving scientists' interaction with geoscientific data by enabling intelligent data processing, natural language interfaces, and collaborative problem-solving capabilities. We illustrate this approach with "PANGAEA GPT," a specialized MAS pipeline integrated with the diverse PANGAEA database for Earth & Environmental Science, demonstrating how MAS-driven workflows can effectively manage complex datasets and accelerate scientific discovery. We discuss how MAS can address current data challenges in geosciences, highlight advancements in other scientific fields, and propose future directions for integrating MAS into geoscientific data processing pipelines. In this Perspective, we show how MAS can fundamentally improve data accessibility, promote cross-disciplinary collaboration, and accelerate geoscientific discoveries.

安装插件收集

Talk2Biomodels: AI agent-based open-source LLM initiative for kinetic biological models.

Talk2Biomodels：基于AI代理的开源LLM项目，用于动力学生物模型。

pubmed.ncbi.nlm.nih.gov-Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi 等, 2025-BMC bioinformatics4区IF 3.3

Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise. In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ . In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.

安装插件收集

ProactiveVA: Proactive Visual Analytics with LLM-Based UI Agent.

基于LLM的UI代理的主动视觉分析：ProactiveVA

pubmed.ncbi.nlm.nih.gov-Yuheng Zhao, Xueli Shu, Liwen Fan 等, 2026-IEEE transactions on visualization and computer graphics2区 TopIF 6.5

Visual analytics (VA) is typically applied to complex data, thus requiring complex tools. While visual analytics empowers analysts in data analysis, analysts may get lost in the complexity occasionally. This highlights the need for intelligent assistance mechanisms. However, even the latest LLM-assisted VA systems only provide help when explicitly requested by the user, making them insufficiently intelligent to offer suggestions when analysts need them the most. We propose a ProactiveVA framework in which LLM-powered UI agent monitors user interactions and delivers context-aware assistance proactively. To design effective proactive assistance, we first conducted a formative study analyzing help-seeking behaviors in user interaction logs, identifying when users need proactive help, what assistance they require, and how the agent should intervene. Based on this analysis, we distilled key design requirements in terms of intent recognition, solution generation, interpretability and controllability. Guided by these requirements, we develop a three-stage UI agent pipeline including perception, reasoning, and acting. The agent autonomously perceives users' needs from VA interaction logs, providing tailored suggestions and intuitive guidance through interactive exploration of the system. We implemented the framework in two representative types of VA systems, demonstrating its generalizability, and evaluated the effectiveness through an algorithm evaluation, case and expert study and a user study. We also discuss current design trade-offs of proactive VA and areas for further exploration.

安装插件收集

From prompt engineering to agent engineering: expanding the AI toolbox with autonomous agentic AI collaborators for biomedical discovery.

从提示工程到代理工程：利用自主代理型人工智能协作者扩展生物医学发现的AI工具箱

pubmed.ncbi.nlm.nih.gov-Jason H Moore, Nicholas P Tatonetti, 2025-BioData mining3区IF 6.1

No abstract

安装插件收集

ProteinMCP: An agentic AI framework for autonomous protein engineering.

ProteinMCP：一种用于自主蛋白质工程的代理人工智能框架

pubmed.ncbi.nlm.nih.gov-Xiaopeng Xu, Chenjie Feng, Chao Zha 等, 2026-Protein science : a publication of the Protein Society3区IF 5.2

Computational protein design is often constrained by slow, complex, inaccessible, and highly sophisticated and expert-dependent workflows that hinder its transferrability and generalization power for broader applications. We present ProteinMCP, an agentic AI framework designed to accelerate and democratize protein engineering. ProteinMCP automates end-to-end scientific tasks, delivering dramatic gains in efficiency; for instance, a comprehensive protein fitness modeling workflow was completed in just 11 min. This performance is achieved by an AI agent that intelligently orchestrates a unified ecosystem of 38 specialized tools, made accessible through a model-context-protocol (MCP). A cornerstone of the framework is an automated pipeline that converts existing software into MCP-compliant servers, ensuring the platform is both powerful and perpetually extensible. We further demonstrate its capabilities through the successful autonomous design and selection of high-affinity de novo binders and therapeutic nanobodies. By removing technical barriers, ProteinMCP has the potential to shorten the design-build-test cycle and make advanced computational protein design accessible to the broader scientific community.

安装插件收集

Agentic AI-driven autonomous decision support system for smart agriculture.

基于代理的人工智能驱动的智能农业自主决策支持系统

pubmed.ncbi.nlm.nih.gov-N L Padma Swati, S Vanshita Gupta, Navya Sri Duddela 等, 2026-Scientific reports3区IF 3.9

Crop productivity is heavily impacted by inefficient fertilizer usage, improper fertilizer handling, and inappropriately chosen crops. To address these issues, this research work proposes an AI-powered Smart Agriculture Prediction System utilizing intelligent agents for carrying out soil classification, estimating soil parameters, crop suggestion, and fertilizer suggestion. The soil classifier module is trained with 1,563 images of black soils, red soils, clay soils, and alluvial soils using MobileNet-V2, ResNet, and Custom CNN, with Custom CNN resulting in a higher accuracy of 92.88%, which performed better in classifying soils based on textures. A soil parameter estimation agent utilizes regression models for estimating pH and NPK content of soils using images. For crop suggestion, a crop dataset with 2,200 samples with parameters such as N, P, K, T, H, pH, and rainfall is used, in which Random Forest model performed better with an accuracy of 92.4% when compared with CNN and DNN models. For fertilizer suggestion, XGBoost performed better with an accuracy of 94.7% in estimating fertilizers such as Urea, DAP, NPK, Potash, and Compost. Real-time climatic parameters are obtained using API in order to make dynamic updates for climatic parameters. Real-time weather data obtained through APIs enables dynamic updates of climatic parameters, while Explainable AI techniques such as SHAP and LIME enhance model transparency and user trust. Additionally, the system incorporates an interactive agent-based framework that processes user inputs, including location, soil images, and nutrient levels, to generate adaptive outputs such as weather alerts, yield potential, and personalized recommendations. The experimental results demonstrate that the proposed system effectively integrates deep learning, ensemble learning, and explainability to deliver a scalable, efficient, and sustainable decision-support solution for precision agriculture, promoting optimized resource utilization and environmental stewardship.

安装插件收集

Agentic Lab: An Agentic-physical AI system for cell and organoid experimentation and manufacturing.

代理实验室：一种用于细胞和类器官实验与制造的代理-物理人工智能系统

pubmed.ncbi.nlm.nih.gov-Wenbo Wang, Simran Swain, Jaeyong Lee 等, 2025-bioRxiv : the preprint server for biology

Reproducibility in biological research and manufacturing remains constrained by the complexity of multi-step protocols, fragmented data-analysis pipelines, and the intrinsic variability of experimental execution. Here, we present Agentic Lab, an agentic-physical AI platform that unifies large language model and vision language model (LLMs/VLMs)-driven reasoning with real-world laboratory operations. Agentic Lab uses multi-agent orchestration architecture, comprising of specialized subagents for knowledge retrieval, protocol design, multimodal data analysis, and training-free segmentation and representation learning for intrinsically explainable single-cell and organoid phenotyping. These agents operate under the orchestration of a virtual principal investigator MolAgent that is linked to an augmented reality (AR)-based physical AI interface, which can bridge digital reasoning with human physical execution. Agentic Lab perceives real-world experimental activities, provides context-aware instructions, identifying procedural errors in real time for humans to correct, and continuously evolves with its long-term memory database expanding through the accumulation of experimental data logs from human scientists. This interaction allows scientists and AI agents to collaborate and co-evolve dynamically, closing the loop between planning, action, and analysis in the traditional cell and organoid research lifecycle. We demonstrate Agentic Lab in organoid differentiation from human pluripotent stem cells, where it autonomously generates protocols, monitors culture procedures, and identifies subtle morphological heterogeneity linked to growth conditions. The system interprets these phenotypes, grounds them in literature, and proposes targeted instructions for improving differentiation efficiency. By combining multi-agent reasoning with physical laboratory awareness, Agentic Lab transforms experimentation and biomanufacturing from a static workflow into an adaptive, feedback-driven, bidirectional process that integrates agentic AI into the research lifecycle. This framework establishes a foundation for intelligent laboratories that integrate design, execution, and interpretation within a unified agentic-physical system.

安装插件收集

An autonomous AI agent for universal behavior analysis.

一种用于通用行为分析的自主人工智能代理

pubmed.ncbi.nlm.nih.gov-Almir Aljović, Zuwan Lin, Wenbo Wang 等, 2025-bioRxiv : the preprint server for biology

Behavior analysis across species represents a fundamental challenge in neuroscience, psychology, and ethology, typically requiring extensive expert knowledge and labor-intensive processes that limit research scalability and accessibility. We introduce BehaveAgent, an autonomous multimodal AI agent designed to automate behavior analysis from video input without retraining or manual intervention. Unlike conventional methods that require manual behavior annotation, video segmentation, task-specific model training, BehaveAgent leverages the reasoning capabilities of multimodal large language models (LLM) to generalize across novel behavioral domains without need for additional training. It integrates LLMs, vision-language models (VLMs), and large-scale visual grounding modules, orchestrated through a multimodal context memory and goal-directed attention mechanism, to enable robust zero-shot visual reasoning across species and experimental paradigms, including plants, insects, rodents, primates, and humans. Upon receiving a video input, BehaveAgent autonomously identifies the correct analysis strategy and performs end-to-end behavior analysis and interpretation without human supervision. Leveraging vision-language representations, it performs general-purpose tracking, pose estimation and segmentation. We demonstrate BehaveAgent's universal applicability to autonomously (1) identify the behavioral paradigm and develop an action plan specialized for the identified paradigm, (2) identify relevant subjects and objects, (3) track those features, (4) identify behavioral sequences with explicit reasoning, (5) generate and execute code for targeted analysis and (6) generate comprehensive research reports that integrate behavioral findings with relevant scientific literature. Through interpretable agentic reasoning, BehaveAgent makes its internal decision-making process transparent, clarifying why particular features are tracked or behaviors inferred. By reducing the time and expertise required for behavior analysis, BehaveAgent introduces a scalable, generalizable, and explainable paradigm for advancing biological and behavioral research.

安装插件收集

Spatial transcriptomics AI agent charts hPSC-pancreas maturation

空间转录组学AI代理绘制hPSC-胰腺成熟图谱

pubmed.ncbi.nlm.nih.gov-Zuwan Lin, Wenbo Wang, Arnau Marin-Llobet 等, 2025-bioRxiv : the preprint server for biology

Spatial transcriptomics has revolutionized our understanding of tissue organization by simultaneously capturing gene expression and spatial localization within intact tissues. However, analyzing these increasingly complex datasets requires specialized expertise across computational biology, statistics, and biological context. To address this challenge, we introduce the Spatial Transcriptomics AI Agent (STAgent), an autonomous multimodal agentic AI that integrates multimodal large language models (LLMs) with specialized computational tools to transform weeks-long analysis tasks into minutes of automated processing. Unlike conventional machine learning approaches that are limited to narrow, predefined tasks, STAgent leverages the emergent capabilities of multimodal LLMs - such as flexible reasoning, contextual understanding, and cross-modal integration - which allow it to adapt to novel data, execute multi-step analyses, and generate biologically meaningful insights with minimal human input. STAgent enables autonomous deep research through integrated capabilities, including dynamic code generation for complex analytical workflows, visual reasoning for interpreting spatial patterns, real-time retrieval of relevant peer-reviewd scientific literature, and synthesis of comprehensive, actionable reports. We applied STAgent to investigate the

安装插件收集

Detection and diagnosis of diabetic retinopathy in retinal fundus images using agentic AI approaches.

基于代理人工智能方法的视网膜眼底图像糖尿病视网膜病变检测与诊断

pubmed.ncbi.nlm.nih.gov-R Sathya, A Valaramathi, 2025-Scientific reports3区IF 3.9

In today's world, Diabetic Retinopathy (DR) remains a leading cause of vision loss globally, necessitating early detection and accurate diagnosis for timely intervention. Traditional machine learning and deep learning-based approaches, while effective, often suffer from issues such as limited interpretability, static decision-making, and inadequate generalization across diverse patient data. This research introduces an Agentic-AI Driven Framework for Diabetic Retinopathy Analysis (AADR-AI), which leverages intelligent agent-based learning mechanisms to enhance decision-making autonomy, dynamic adaptability, and contextual understanding of retinal fundus images. The novelty lies in incorporating agentic intelligence principles, autonomy, reactivity, and proactivity into DR detection systems, allowing real-time analysis and adaptive feature learning based on patient-specific variations. The proposed AADR-AI framework integrates a multi-agent ensemble of convolutional and transformer-based networks, coordinated through a decision fusion layer for robust classification. Key contributions include improved classification accuracy (up to 96.7%), enhanced model efficiency with reduced computational overhead, and real-time adaptability to varying image qualities and disease progression stages. Extensive experimentation on benchmark datasets demonstrates superior performance compared to existing state-of-the-art methods. This work highlights the transformative potential of agentic AI in medical imaging, paving the way for more autonomous and interpretable clinical decision-support systems.

安装插件收集

Beyond Single Systems: How Multi-Agent AI Is Reshaping Ethics in Radiology.

超越单一系统：多智能体人工智能如何重塑放射学的伦理学

pubmed.ncbi.nlm.nih.gov-Sara Salehi, Yashbir Singh, Parnian Habibi 等, 2025-Bioengineering (Basel, Switzerland)3区IF 3.7

Radiology is undergoing a paradigm shift from traditional single-function AI systems to sophisticated multi-agent networks capable of autonomous reasoning, coordinated decision-making, and adaptive workflow management. These agentic AI systems move beyond simple pattern recognition to encompass complex radiological workflows including image analysis, report generation, clinical communication, and care coordination. While multi-agent radiological AI promises enhanced diagnostic accuracy, improved workflow efficiency, and reduced physician burden, it simultaneously amplifies the long-standing "black box" problem. Traditional explainable AI methods, which are adequate for understanding isolated diagnostic predictions, fail when applied to multi-step reasoning processes involving multiple specialized agents coordinating across imaging interpretation, clinical correlation, and treatment planning. This paper examines how agentic AI systems in radiology create "compound opacity" layers of inscrutability from agent interactions and distributed decision-making processes. We analyze the autonomy-transparency paradox specific to radiological practice, where increasing AI capability directly conflicts with interpretability requirements essential for clinical trust and regulatory oversight. Through examination of emerging multi-agent radiological workflows, we propose frameworks for responsible implementation that preserve both diagnostic innovation and the fundamental principles of medical transparency and accountability.

安装插件收集

MolAgent: Biomolecular Property Estimation in the Agentic Era.

MolAgent：代理时代中的生物分子性质估算

pubmed.ncbi.nlm.nih.gov-Jose Carlos Gómez-Tamayo, Joris Tavernier, Roy Aerts 等, 2025-Journal of chemical information and modeling2区 TopIF 5.3

The advent of agentic AI systems is leading to significant transformations across scientific and technological domains. Advances in large language models (LLMs), reasoning capabilities, and integration with external tools have ushered in a new era where agentic AI systems can autonomously perform computational tasks that were traditionally carried out by humans. Computer-aided drug design (CADD)─a multifaceted process encompassing complex, interdependent tasks─stands to benefit profoundly from these advancements. However, one of the key challenges in enabling agentic systems to autonomously take over tasks in CADD is constructing models for property estimation that match the quality and reliability of those developed by human experts. As this is not currently straightforward, this capability represents a major bottleneck for fully realizing the potential of autonomous pipelines in drug discovery. We present here MolAgent, a system-agnostic agentic AI framework designed for high-fidelity modeling of molecular properties in early-stage drug discovery. MolAgent autonomously implements expert-level pipelines for both classification and regression, empowering agentic systems to efficiently construct and deploy models. With integrated automated feature engineering, robust model selection, advanced ensemble methodologies, and comprehensive validation frameworks, MolAgent ensures optimal accuracy and model robustness. The platform seamlessly accepts 2D and 3D structural data for ligands and receptors and harmonizes traditional molecular descriptors with advanced deep learning features extracted from pretrained 2D and 3D encoders. Ultimately, the platform's fully automated, end-to-end workflow is designed for seamless agentic execution. Adherence to the Model Context Protocol (MCP) guarantees interoperability with diverse agentic AI infrastructures, ensuring flexible integration into complex, future discovery pipelines.

安装插件收集

Explainable Agentic AI for Big Data-Driven Evaluation and Visual Analytics of Digital Literacy in Higher Vocational Teacher Education.

可解释的代理人工智能在高职教师教育中大数据驱动的数字素养评价与可视化分析

pubmed.ncbi.nlm.nih.gov-Wen Shao, 2026-Big data4区IF 2.0

Large-scale, diverse data produced by higher vocational teacher colleges' digital transformation challenges traditional methods for evaluating digital literacy. The reliability of current analytics and black-box artificial intelligence (AI) models for educational decision-making is limited by their frequent lack of autonomy and transparency. In order to assess digital literacy at higher vocational teacher colleges using big data and visual analytics, this study suggests an Explainable Agentic AI framework. In order to facilitate adaptive data exploration, competency evaluation, and insight generation across multimodal educational data, such as learning behavior logs, assessment records, and digital engagement indicators, the framework combines autonomous agentic intelligence with explainable AI (XAI). While XAI methods offer clear explanations of literacy aspects, decision rationale, and uncertainty, agentic components dynamically handle data processing, feature reasoning, and model selection. Effective human-AI collaboration is made possible by an interactive visual analytics layer that allows for layered investigation of learner patterns, temporal dynamics, and cohort heterogeneity. When compared with traditional machine learning techniques, experimental results on large-scale datasets from higher vocational teacher colleges show better assessment accuracy, robustness, and interpretability. This work demonstrates the promise of agentic AI for explainable big data exploration and promotes reliable instructional intelligence by combining agentic autonomy, explainability, and visual analytics within a scalable big data paradigm.

安装插件收集

From visual question answering to intelligent AI agents in ophthalmology.

从视觉问答到眼科领域的智能AI代理

pubmed.ncbi.nlm.nih.gov-Xiaolan Chen, Ruoyu Chen, Pusheng Xu 等, 2025-The British journal of ophthalmology2区IF 3.5

Ophthalmic practice involves the integration of diverse clinical data and interactive decision-making, posing challenges for traditional artificial intelligence (AI) systems. Visual question answering (VQA) addresses this by combining computer vision and natural language processing to interpret medical images through user-driven queries. Evolving from VQA, multimodal AI agents enable continuous dialogue, tool use and context-aware clinical decision support. This review explores recent developments in ophthalmic conversational AI, spanning theoretical advances and practical implementations. We highlight the transformative role of large language models (LLMs) in improving reasoning, adaptability and task execution. However, key obstacles remain, including limited multimodal datasets, absence of standardised evaluation protocols, and challenges in clinical integration. We outline these limitations and propose future research directions to support the development of robust, LLM-driven AI systems. Realising their full potential will depend on close collaboration between AI researchers and the ophthalmic community.

安装插件收集

CACTUS: Chemistry Agent Connecting Tool Usage to Science.

CACTUS：连接工具使用与科学的化学代理

pubmed.ncbi.nlm.nih.gov-Andrew D McNaughton, Gautham Krishna Sankar Ramalaxmi, Agustin Kruel 等, 2024-ACS omega3区IF 4.3

Large language models (LLMs) have shown remarkable potential in various domains but often lack the ability to access and reason over domain-specific knowledge and tools. In this article, we introduce Chemistry Agent Connecting Tool-Usage to Science (CACTUS), an LLM-based agent that integrates existing cheminformatics tools to enable accurate and advanced reasoning and problem-solving in chemistry and molecular discovery. We evaluate the performance of CACTUS using a diverse set of open-source LLMs, including Gemma-7b, Falcon-7b, MPT-7b, Llama3-8b, and Mistral-7b, on a benchmark of thousands of chemistry questions. Our results demonstrate that CACTUS significantly outperforms baseline LLMs, with the Gemma-7b, Mistral-7b, and Llama3-8b models achieving the highest accuracy regardless of the prompting strategy used. Moreover, we explore the impact of domain-specific prompting and hardware configurations on model performance, highlighting the importance of prompt engineering and the potential for deploying smaller models on consumer-grade hardware without a significant loss in accuracy. By combining the cognitive capabilities of open-source LLMs with widely used domain-specific tools provided by RDKit, CACTUS can assist researchers in tasks such as molecular property prediction, similarity searching, and drug-likeness assessment.

安装插件收集

Hierarchical agent reflection for aligning LLM reasoning with clinical diagnostic processes.

分层代理反思以实现大型语言模型推理与临床诊断过程的对齐

pubmed.ncbi.nlm.nih.gov-Xinda Wang, Xiaotong Li, Deng Zhao 等, 2026-Health information science and systems2区IF 3.4

Medical diagnosis is a complex, iterative process that relies heavily on clinicians' reasoning and judgment. Traditional models, while able to provide consistent diagnostic results, fail to replicate the reasoning process of clinicians, making their outputs difficult to understand and justify. In this paper, we address this limitation by first generating clinical notes that capture the clinician's diagnostic reasoning. These notes are then used to train a large language model, allowing it to mimic the step-by-step reasoning employed by clinicians during diagnosis. Our method introduces a hierarchical agent reflection mechanism to generate clinical notes, which deconstructs the diagnostic process into key stages, each handled by specialized agents. This structured approach not only improves the accuracy and reliability of the generated clinical notes but also ensures that the model's reasoning aligns with human clinical practice. Experimental results show that models trained on this data outperform both general-purpose large language models and domain-specific medical models in diagnostic tasks. The proposed method enhances diagnostic transparency and interpretability, offering a valuable tool for AI-assisted clinical decision-making.

安装插件收集

Integrating visual large language model and reasoning chain for driver behavior analysis and risk assessment.

融合视觉大型语言模型与推理链的驾驶员行为分析与风险评估

pubmed.ncbi.nlm.nih.gov-Kunpeng Zhang, Shipu Wang, Ning Jia 等, 2024-Accident; analysis and prevention1区 TopIF 6.2

Driver behavior is a critical factor in driving safety, making the development of sophisticated distraction classification methods essential. Our study presents a Distracted Driving Classification (DDC) approach utilizing a visual Large Language Model (LLM), named the Distracted Driving Language Model (DDLM). The DDLM introduces whole-body human pose estimation to isolate and analyze key postural features-head, right hand, and left hand-for precise behavior classification and better interpretability. Recognizing the inherent limitations of LLMs, particularly their lack of logical reasoning abilities, we have integrated a reasoning chain framework within the DDLM, allowing it to generate clear, reasoned explanations for its assessments. Tailored specifically with relevant data, the DDLM demonstrates enhanced performance, providing detailed, context-aware evaluations of driver behaviors and corresponding risk levels. Notably outperforming standard models in both zero-shot and few-shot learning scenarios, as evidenced by tests on the 100-Driver dataset, the DDLM stands out as an advanced tool that promises significant contributions to driving safety by accurately detecting and analyzing driving distractions.

安装插件收集

A study on classification based concurrent API calls and optimal model combination for tool augmented LLMs for AI agent.

基于并发API调用和工具增强LLM模型组合优化的AI代理分类研究

pubmed.ncbi.nlm.nih.gov-HeounMo Go, SangHyun Park, 2025-Scientific reports3区IF 3.9

AI Agents have evolved to not only recommend content but also facilitate information retrieval and task processing. Developing AI Agents using general-purpose LLM models necessitates integration with external tools, leading to tool-augmented LLM studies. Despite the availability of multiple tools for the same purpose, existing research has not fully leveraged this diversity. This study categorizes external tools by type and proposes a method to simultaneously call tools of the same type. This allows for the utilization of diverse external tools in LLM inference, thereby achieving a higher accuracy compared to when only a single tool for one task is used. Experimental results show an accuracy improvement of 4.4-9.3% over existing studies. Furthermore, when utilizing tool-augmented LLM, a multi-step reasoning approach that divides the process into stages such as planning and tool invocation is widely employed. With the rapid advancement of LLMs, enhanced models continue to emerge. Considering the trade-offs between performance and cost in models, it is crucial to find an optimal combination of models in each stage of tool augmented LLM. In this study, we propose a novel method for efficiently utilizing both enhanced LLM models and existing models, which reduces response errors by up to 9%.

安装插件收集

Empowering AI data scientists using a multi-agent LLM framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses.

利用具有自我进化能力的多智能体LLM框架赋能AI生物医学数据科学家进行自主、工具感知的生物医学数据分析

pubmed.ncbi.nlm.nih.gov-Dechao Bu, Jingbo Sun, Kun Li 等, 2026-Nature biomedical engineering1区 TopIF 26.6

Artificial intelligence agents are emerging as powerful applications of large language models (LLMs), automating complex tasks and enabling scientific data exploration. However, their use in biomedical data analysis remains limited by the difficulty of handling specialized tools and multistep reasoning. Here we introduce BioMedAgent, a self-evolving LLM multi-agent framework, which learns to use diverse bioinformatics tools and chain them into executable workflows through interactive exploration and memory retrieval algorithms. It allows biomedical users to initiate tasks using natural language, without requiring computational expertise. Evaluated on our newly released BioMed-AQA benchmark comprising 327 biomedical data tasks, BioMedAgent achieved a 77% success rate, surpassing other LLM agents, and generalized robustly to the external BixBench dataset. Beyond benchmarks, it autonomously performs cross-omics analysis, machine-learning modelling and pathology image segmentation, highlighting its potential to advance biomedical research and extend to other scientific domains requiring complex tool integration and multistep reasoning.

安装插件收集

Use of Retrieval-Augmented Large Language Model for COVID-19 Fact-Checking: Development and Usability Study.

利用检索增强的大型语言模型进行COVID-19事实核查：开发与可用性研究

pubmed.ncbi.nlm.nih.gov-Hai Li, Jingyi Huang, Mengmeng Ji 等, 2025-Journal of medical Internet research2区 TopIF 6.0

The COVID-19 pandemic has been accompanied by an "infodemic," where the rapid spread of misinformation has exacerbated public health challenges. Traditional fact-checking methods, though effective, are time-consuming and resource-intensive, limiting their ability to combat misinformation at scale. Large language models (LLMs) such as GPT-4 offer a more scalable solution, but their susceptibility to generating hallucinations-plausible yet incorrect information-compromises their reliability. This study aims to enhance the accuracy and reliability of COVID-19 fact-checking by integrating a retrieval-augmented generation (RAG) system with LLMs, specifically addressing the limitations of hallucination and context inaccuracy inherent in stand-alone LLMs. We constructed a context dataset comprising approximately 130,000 peer-reviewed papers related to COVID-19 from PubMed and Scopus. This dataset was integrated with GPT-4 to develop multiple RAG-enhanced models: the naïve RAG, Lord of the Retrievers (LOTR)-RAG, corrective RAG (CRAG), and self-RAG (SRAG). The RAG systems were designed to retrieve relevant external information, which was then embedded and indexed in a vector store for similarity searches. One real-world dataset and one synthesized dataset, each containing 500 claims, were used to evaluate the performance of these models. Each model's accuracy, F The baseline GPT-4 model achieved an accuracy of 0.856 on the real-world dataset. The naïve RAG model improved this to 0.946, while the LOTR-RAG model further increased accuracy to 0.951. The CRAG and SRAG models outperformed all others, achieving accuracies of 0.972 and 0.973, respectively. The baseline GPT-4 model reached an accuracy of 0.960 on the synthesized dataset. The naïve RAG model increased this to 0.972, and the LOTR-RAG, CRAG, and SRAG models achieved an accuracy of 0.978. These findings demonstrate that the RAG-enhanced models consistently maintained high accuracy levels, closely mirroring ground-truth labels and significantly reducing hallucinations. The CRAG and SRAG models also provided more detailed and contextually accurate explanations, further establishing the superiority of agentic RAG frameworks in delivering reliable and precise fact-checking outputs across diverse datasets. The integration of RAG systems with LLMs substantially improves the accuracy and contextual relevance of automated fact-checking. By reducing hallucinations and enhancing transparency by citing retrieved sources, this method holds significant promise for rapid, reliable information verification to combat misinformation during public health crises.

安装插件收集

Large language model agents can use tools to perform clinical calculations.

大型语言模型代理可以利用工具进行临床计算

pubmed.ncbi.nlm.nih.gov-Alex J Goodell, Simon N Chu, Dara Rouholiman 等, 2025-NPJ digital medicine1区 TopIF 15.1

Large language models (LLMs) can answer expert-level questions in medicine but are prone to hallucinations and arithmetic errors. Early evidence suggests LLMs cannot reliably perform clinical calculations, limiting their potential integration into clinical workflows. We evaluated ChatGPT's performance across 48 medical calculation tasks, finding incorrect responses in one-third of trials (n = 212). We then assessed three forms of agentic augmentation: retrieval-augmented generation, a code interpreter tool, and a set of task-specific calculation tools (OpenMedCalc) across 10,000 trials. Models with access to task-specific tools showed the greatest improvement, with LLaMa and GPT-based models demonstrating a 5.5-fold (88% vs 16%) and 13-fold (64% vs 4.8%) reduction in incorrect responses, respectively, compared to the unimproved models. Our findings suggest that integration of machine-readable, task-specific tools may help overcome LLMs' limitations in medical calculations.

安装插件收集

From prompt to platform: an agentic AI workflow for healthcare simulation scenario design.

从提示到平台：一种用于医疗仿真场景设计的代理式AI工作流程

pubmed.ncbi.nlm.nih.gov-Federico Lorenzo Barra, Giovanna Rodella, Alessandro Costa 等, 2025-Advances in simulation (London, England)3区IF 4.7

Healthcare simulation scenario design remains a resource-intensive process, demanding significant time and expertise from educators. This article presents an innovative AI-driven agentic workflow for healthcare simulation scenario development, bridging technical capability with pedagogical effectiveness. The system evolved from an initial ChatGPT-based prototype to a sophisticated platform implementation utilizing multiple specialized AI agents. Each agent addresses specific sub-tasks, including objective formulation, patient narrative generation, diagnostic data creation, and debriefing point development. The workflow employs advanced AI methodologies including decomposition, prompt chaining, parallelization, retrieval-augmented generation, and iterative refinement, all orchestrated through a user-friendly conversational interface. Critical to implementation was the demonstration that healthcare professionals with modest technical skills could develop these complex workflows without specialized AI expertise. The system ensures consistent adherence to established simulation guidelines, including INACSL Standards of Best Practice and ASPiH Standards Framework, while significantly reducing scenario development time by approximately 70-80%. Designed for broad applicability across diverse clinical settings and learner levels, the workflow incorporates multilingual capabilities for global application. Potential pitfalls include the necessity for rigorous review of AI-generated content and awareness of bias in model outputs. Key lessons learned emphasize interdisciplinary collaboration, systematic prompt refinement, essential human oversight, and the democratization of AI tools in healthcare education. This innovation demonstrates how sophisticated agentic AI implementations can transform healthcare simulation through enhanced efficiency, consistency, and accessibility without sacrificing pedagogical integrity.

安装插件收集

Agentic AI and Large Language Models in Radiology: Opportunities and Hallucination Challenges.

代理人工智能与大型语言模型在放射学中的应用：机遇与幻觉挑战

pubmed.ncbi.nlm.nih.gov-Sara Salehi, Yashbir Singh, Kelly K Horst 等, 2025-Bioengineering (Basel, Switzerland)3区IF 3.7

The field of radiology is experiencing rapid adoption of large language models (LLMs), yet their tendency to generate hallucinations (plausible but incorrect information) remains a significant barrier to trust. This comprehensive review evaluates emerging agentic artificial intelligence (AI) approaches, including multi-agent role-based systems, retrieval-augmented generation (RAG), and uncertainty quantification, to assess their potential for reducing hallucinations in radiology workflows. Evidence from 2024 to 2025 demonstrates that agentic AI can improve diagnostic accuracy and reduce error rates, though these methods remain computationally demanding and lack comprehensive clinical validation. Multi-agent frameworks enable cross-validation through role-based specialization and systematic workflow orchestration, while RAG strategies enhance accuracy by grounding responses in verified medical literature. Within multi-agent systems, uncertainty quantification enables agents to communicate confidence levels to one another, allowing them to appropriately weigh each other's contributions during collaborative analysis. While multi-agent frameworks and RAG strategies show significant promise, practical deployment will require careful integration with human oversight, robust evaluation metrics tailored to medical imaging tasks, and regulatory adaptation to ensure safe clinical use in diverse patient populations and imaging modalities.

安装插件收集

AgCV: An Agentic framework for automating computer vision application.

AgCV：一种用于自动化计算机视觉应用的代理框架

pubmed.ncbi.nlm.nih.gov-Arav Saxena, Archana Y Chaudhari, Anilkumar Gupta, 2025-MethodsX4区IF 1.9

This paper proposed the Agentic Computer Vision (AgCV) framework designed to automate complex computer vision (CV) tasks through autonomous agents that communicate through Graphical User Interface (GUI). The AgCV framework leverages LangGraph, natural language processing, deep learning, and data science to build adaptive, user-driven CV pipelines. In the proposed AgCV each Agent works on a particular task ranging from object identification and classification to image segmentation. By incorporating Retrieval-Augmented Generation (RAG) and LangGraph, the AgCV enable fully automated pipelining through user interactions. The proposed Framework strategy reduces the need for technical expertise, allowing end-users to generate and configure CV operations using intuitive language commands. AgCV promotes accessibility, scalability, and flexibility of CV applications in different domains. The AgCV not only simplifies user interaction but also ensures that the system aligns with user expectations and needs.•The proposed system allows users to create and configure CV operations using simple natural language, making it accessible even to those with limited technical expertise.•The AgCV framework supports a wide range of CV tasks and can be easily adapted to different user needs and applications.

安装插件收集

Automatic biomarker discovery and enrichment with BRAD.

基于BRAD的自动生物标志物发现与富集

pubmed.ncbi.nlm.nih.gov-Joshua Pickard, Ram Prakash, Marc Andrew Choi 等, 2025-Bioinformatics (Oxford, England)3区IF 5.4

Integrating Large Language Models (LLMs) with research tools presents technical and reproducibility challenges for biomedical research. While commercial artificial intelligence (AI) systems are easy to adopt, they obscure data provenance, lack transparency, and can generates false information, making them unfit for many research problems. To address these challenges, we developed the Bioinformatics Retrieval Augmented Digital (BRAD) agent software system. Here, we introduce BRAD, an agentic system that integrates LLMs with external tools and data to streamline research workflows. BRAD's modular agents retrieve information from literature, custom software, and online databases while maintaining transparent protocols to increase the reliability of AI generated results. We apply BRAD to a biomarker discovery pipeline, automating both execution and the generation of enrichment reports. This workflow contextualizes user data within the literature, enabling a level of interpretation and automation that surpasses conventional research tools. Beyond the workflow we highlight here, BRAD is a flexible system that has been deployed in other applications including a chatbot, video RAG, and analysis of single cell data. The source code for BRAD is available at https://github.com/Jpickard1/BRAD; Information for pip installation, tutorials, documentation, and further information can be found at: ReadTheDocs.

安装插件收集

From reviews to real-time: dynamic evidence in dentistry.

从综述到实时：牙科领域的动态证据

pubmed.ncbi.nlm.nih.gov-A V Gavrilova, C Galli, 2026-Evidence-based dentistryIF 2.3

The exponential growth of biomedical literature-over a million new PubMed entries each year-has outpaced traditional evidence-synthesis methods. Systematic reviews, long the cornerstone of evidence-based dentistry, are resource-intensive and often outdated within a few years, widening the gap between current research and clinical practice. We outline Retrieval-Augmented Generation (RAG) as a methodology for dynamic evidence reviews. RAG strengthens Large Language Models (LLMs) by combining their generative capacity with real-time retrieval from a continuously updated, curated knowledge base. This design grounds every answer in verifiable sources and mitigates the factual errors and hallucinations seen in standalone LLMs. RAG enables on-demand dynamic synthesis of the latest evidence, allowing clinicians and researchers to ask complex, natural-language questions and receive concise, fully cited answers. For dental clinicians, this approach enables rapid, citation-linked answers to practice-relevant questions-such as material selection, healing outcomes, or procedural comparisons-without relying on outdated narrative summaries. We describe three complementary integration pathways-RAG on pre-retrieved article pools, public living review portals, and machine-actionable journal publications-each with distinct requirements and benefits. Looking forward, emerging agentic AI systems, capable of planning multi-step searches and iterative updates, may further enhance these capabilities. Although this framework is conceptually grounded and supported by emerging methodological evidence, prospective empirical validation, benchmarking against existing review approaches, and real-world deployment studies will be required to fully assess its performance, reliability, and impact on clinical decision-making. RAG offers a scalable, transparent alternative to static systematic reviews and can shorten the research-to-practice timeline. By automating retrieval and initial synthesis while keeping human critical appraisal and ethical judgment central, it points toward an era of augmented rather than automated intelligence in evidence-based dentistry.

安装插件收集

A multi-agent approach to neurological clinical reasoning.

基于多智能体的神经临床推理方法

pubmed.ncbi.nlm.nih.gov-Moran Sorka, Alon Gorenshtein, Dvir Aran 等, 2025-PLOS digital healthIF 7.7

Large language models (LLMs) have demonstrated impressive capabilities in medical domains, yet their ability to handle the specialized reasoning patterns required in clinical neurology warrants systematic evaluation. Neurological assessment presents distinctive challenges that combine anatomical localization, temporal pattern recognition, and nuanced symptom interpretation-cognitive processes that are specifically tested in board certification examinations. We developed a comprehensive benchmark comprising 305 questions from Israeli Board Certification Exams in Neurology and classified each along three dimensions of complexity: factual knowledge depth, clinical concept integration, and reasoning complexity. We evaluated ten LLMs of varying architectures and specializations using this benchmark, testing base models, retrieval-augmented generation (RAG) enhancement, and a novel multi-agent system. Our analysis revealed significant performance variation across models and methodologies. The OpenAI-o1 model achieved the highest base performance (90.9% accuracy), while specialized medical models performed surprisingly poorly (52.9% for Meditron-70B). RAG enhancement provided variable benefits across models; substantial improvements for mid-tier models like GPT-4o (80.5% to 87.3%) and smaller models, but limited effectiveness on the highest complexity questions regardless of model size. In contrast, our multi-agent framework-which decomposes neurological reasoning into specialized cognitive functions including question analysis, knowledge retrieval, answer synthesis, and validation-achieved dramatic improvements, especially for mid-range models. The LLaMA 3.3-70B-based agentic system reached 89.2% accuracy compared to 69.5% for its base model, with particularly substantial gains on level 3 complexity questions across all dimensions. External validation on MedQA revealed dataset-specific RAG effects: while RAG improved board certification performance, it showed minimal benefit on MedQA questions (LLaMA 3.3-70B: + 1.4% vs + 3.9% on board exams), reflecting alignment between our specialized neurology textbook and board examination content rather than the broader medical knowledge required for MedQA. Most notably, the multi-agent approach transformed inconsistent subspecialty performance into remarkably uniform excellence, effectively addressing the neurological reasoning challenges that persisted even with RAG enhancement. We further validated our approach using an independent dataset comprising 155 neurological cases extracted from MedQA. The results confirm that structured multi-agent approaches designed to emulate specialized cognitive processes significantly enhance complex medical reasoning offering promising directions for AI assistance in challenging clinical contexts.

安装插件收集

A Concept for Bio-Agentic Visual Communication: Bridging Swarm Intelligence with Biological Analogues.

生物驱动的视觉通信概念：将群体智能与生物类比相结合

pubmed.ncbi.nlm.nih.gov-Bryan Starbuck, Hanlong Li, Bryan Cochran 等, 2025-Biomimetics (Basel, Switzerland)3区IF 3.9

Biological swarms communicate through decentralized, adaptive behaviors shaped by local interactions, selective attention, and symbolic signaling. These principles of animal communication enable robust coordination without centralized control or persistent connectivity. This work presents a proof of concept that identifies, evaluates, and translates biological communication strategies into a generative visual language for unmanned aerial vehicle (UAV) swarm agents operating in radio-frequency (RF)-denied environments. Drawing from natural exemplars such as bee waggle dancing, white-tailed deer flagging, and peacock feather displays, we construct a configuration space that encodes visual messages through trajectories and LED patterns. A large language model (LLM), preconditioned using retrieval-augmented generation (RAG), serves as a generative translation layer that interprets perception data and produces symbolic UAV responses. Five test cases evaluate the system's ability to preserve and adapt signal meaning through within-modality fidelity (maintaining symbolic structure in the same modality) and cross-modal translation (transferring meaning across motion and light). Covariance and eigenvalue-decomposition analysis demonstrate that this bio-agentic approach supports clear, expressive, and decentralized communication, with motion-based signaling achieving near-perfect clarity and expressiveness (0.992, 1.000), while LED-only and multi-signal cases showed partial success, maintaining high expressiveness (~1.000) but with much lower clarity (≤0.298).

安装插件收集

A self-correcting Agentic Graph RAG for clinical decision support in hepatology.

一种用于肝脏病学临床决策支持的自我校正代理图RAG

pubmed.ncbi.nlm.nih.gov-Yalan Hu, Wenjie Xuan, Qingqing Zhou 等, 2025-Frontiers in medicine3区IF 3.0

Clinical decision-making in hepatology is currently challenged by the rapid expansion of medical knowledge and the limitations of Large Language Models (LLMs), specifically their unreliability and tendency to hallucinate. Furthermore, standard Retrieval-Augmented Generation (RAG) paradigms often fail to effectively leverage complex medical knowledge structures. To address these issues, we propose an Agentic Graph RAG framework built upon a clinically-verified hepatology knowledge graph. Our approach utilizes a state-driven agentic system employing a self-correcting "retrieve-evaluate-refine" loop. Within this workflow, agents dynamically generate, semantically validate, assess, and iteratively optimize graph search strategies to construct a comprehensive and accurate context, which is then used by an LLM to generate reliable responses. The framework was evaluated on a custom dataset of clinical questions. It significantly outperformed baseline models (including GPT-4, standard RAG, and Graph RAG) across all evaluation metrics. Specifically, our model achieved superior scores in faithfulness (0.94), context recall (0.92), and answer relevancy (0.91). This agentic approach effectively mitigates LLM hallucinations and provides accurate, interpretable answers. These findings demonstrate the framework's potential as a robust, next-generation intelligent clinical decision support tool for hepatology.

安装插件收集

From data silos to insights: the PRINCE multi-agent knowledge engine for preclinical drug development.

从数据孤岛到洞察：基于大型语言模型的PRINCE多智能体知识引擎在临床前药物开发中的应用

pubmed.ncbi.nlm.nih.gov-Carlos Henrique Vieira-Vieira, Sarang Sanjay Kulkarni, Adam Zalewski 等, 2025-Frontiers in artificial intelligence4区IF 4.7

The pharmaceutical industry faces pressure to improve the drug development process while reducing costs in an evolving regulatory landscape. This paper presents the Preclinical Information Center (PRINCE), a cloud-hosted data integration platform developed by Bayer AG in collaboration with Thoughtworks. PRINCE integrates decades of structured and unstructured safety study reports, leveraging a multi-agent architecture based on Large Language Models (LLMs) and advanced data retrieval methodologies, such as Retrieval-Augmented Generation and Text-to-SQL. In this paper, we describe the three-step evolution of PRINCE from a data search tool based on keyword matching to a resourceful research assistant capable of answering complex questions and drafting regulatory-critical documents. We highlight the iterative development process, guided by user feedback, that ensures alignment with evolving research needs and maximizes utility. Finally, we discuss the importance of building trust-based solutions and how transparency and explainability have been integrated into PRINCE. In particular, the integration of a human-in-the-loop approach enhances the accuracy and retains human accountability. We believe that the development and deployment of the PRINCE chatbot demonstrate the transformative potential of AI in the pharmaceutical industry, significantly improving data accessibility and research efficiency, while prioritizing data governance and compliance.

安装插件收集

Context-Aware Multi-Agent Architecture for Wildfire Insights.

基于上下文感知的多智能体架构用于森林火灾洞察

pubmed.ncbi.nlm.nih.gov-Ashen Sandeep, Sithum Jayarathna, Sunera Sandaruwan 等, 2026-Sensors (Basel, Switzerland)3区IF 3.5

Wildfires are environmental hazards with severe ecological, social, and economic impacts. Wildfires devastate ecosystems, communities, and economies worldwide, with rising frequency and intensity driven by climate change, human activity, and environmental shifts. Analyzing wildfire insights such as detection, predictive patterns, and risk assessment enables proactive response and long-term prevention. However, most of the existing approaches have been focused on isolated processing of data, making it challenging to orchestrate cross-modal reasoning and transparency. This study proposed a novel orchestrator-based multi-agent system (MAS), with the aim of transforming multimodal environmental data into actionable intelligence for decision making. We designed a framework to utilize Large Multimodal Models (LMMs) augmented by structured prompt engineering and specialized Retrieval-Augmented Generation (RAG) pipelines to enable transparent and context-aware reasoning, providing a cutting-edge Visual Question Answering (VQA) system. It ingests diverse inputs like satellite imagery, sensor readings, weather data, and ground footage and then answers user queries. Validated by several public datasets, the system achieved a precision of 0.797 and an F1-score of 0.736. Thus, powered by Agentic AI, the proposed, human-centric solution for wildfire management, empowers firefighters, governments, and researchers to mitigate threats effectively.

安装插件收集

Research on risk decision-making generation method for water conservancy project based on multimodal knowledge graph and large language model.

基于多模态知识图谱和大型语言模型的水利工程项目风险决策生成方法研究

pubmed.ncbi.nlm.nih.gov-Libo Yang, Yuan Li, Junhua Tan 等, 2025-PloS one3区IF 2.6

Traditional knowledge graphs of water conservancy project risks have supported risk decision-making. However, they are constrained by limited data modalities and low accuracy in information extraction. A multimodal water conservancy project risk knowledge graph is proposed in this study, along with a synergistic strategy involving multimodal large language models Risk decision-making generation is facilitated through a multi-agent agentic retrieval-augmented generation framework. To enhance visual recognition, a DenseNet-based image classification model is improved by incorporating single-head self-attention and coordinate attention mechanisms. For textual data, risk entities such as locations, components, and events are extracted using a BERT-BiLSTM-CRF architecture. These extracted entities serve as the foundation for constructing the multimodal knowledge graph. To support generation, a multi-agent agentic retrieval-augmented generation mechanism is introduced. This mechanism enhances the reliability and interpretability of risk decision-making outputs. In experiments, the enhanced DenseNet model outperforms the original baseline in both precision and recall for image recognition tasks. In risk decision-making tasks, the proposed approach-combining a multimodal knowledge graph with a multi-agent agentic retrieval-augmented generation method-achieves strong performance on BERTScore and ROUGE-L metrics. This work presents a novel perspective for leveraging multimodal knowledge graphs in water conservancy project risk management.

安装插件收集

Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education.

基于代理的LLM驱动的RAG框架的开发与评估，用于生成基于证据的阿拉伯语患者教育材料

pubmed.ncbi.nlm.nih.gov-AlHasan AlSammarraie, Ali Al-Saifi, Hassan Kamhia 等, 2025-BMJ health & care informatics3区IF 4.4

To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content. We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy. ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA. Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance. ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.

安装插件收集

Agentic Search Engine for Real-Time Internet of Things Data.

实时物联网数据代理搜索引擎

pubmed.ncbi.nlm.nih.gov-Abdelrahman Elewah, Khalid Elgazzar, Said Elnaffar, 2025-Sensors (Basel, Switzerland)3区IF 3.5

The Internet of Things (IoT) has enabled a vast network of devices to communicate over the Internet. However, the fragmentation of IoT systems continues to hinder seamless data sharing and coordinated management across platforms.However, there is currently no actual search engine for IoT data. Existing IoT search engines are considered device discovery tools, providing only metadata about devices rather than enabling access to IoT application data. While efforts such as IoTCrawler have striven to support IoT application data, they have largely failed due to the fragmentation of IoT systems and the heterogeneity of IoT data.To address this, we recently introduced SensorsConnect-a unified framework designed to facilitate interoperable content and sensor data sharing among collaborative IoT systems, inspired by how the World Wide Web (WWW) enabled shared and accessible information spaces for humans. This paper presents the IoT Agentic Search Engine (IoTASE), a real-time semantic search engine tailored specifically for IoT environments. IoTASE leverages LLMs and Retrieval-Augmented Generation (RAG) techniques to address the challenges of navigating and searching vast, heterogeneous streams of real-time IoT data. This approach enables the system to process complex natural language queries and return accurate, contextually relevant results in real time. To evaluate its effectiveness, we implemented a hypothetical deployment in the Toronto region, simulating a realistic urban environment using a dataset composed of 500 services and over 37,000 IoT-like data entries. Our evaluation shows that IoT-ASE achieved 92% accuracy in retrieving intent-aligned services and consistently generated concise, relevant, and preference-aware responses, outperforming generalized outputs produced by systems such as Gemini. These results underscore the potential of IoT-ASE to make real-time IoT data both accessible and actionable, supporting intelligent decision-making across diverse application domains.

安装插件收集

agent，ai，llm

本报告对 Agent、AI 及 LLM 研究进行了深度整合与分类，构建了涵盖底层理论、应用开发与安全治理的综合框架。研究呈现出明显的范式转变：从基础的 Agentic RAG 框架进化至复杂的科学自动化工作流；从单一的任务执行向多智能体协同协作演进；从通用架构设计走向各行业（如生物医药、工业运维、社会科学）的垂直落地。同时，随着智能体应用复杂性的提升，针对隐私安全、行为验证及人机信任评估的研究已成为不可或缺的基石，标志着 AI Agent 正步入高度自主与可控并行发展的工业化应用阶段。

共 239 篇文献，6 个研究方向

代理式检索增强生成 (Agentic RAG) 与知识集成架构

这些文献系统性地探讨了将自主代理逻辑融入 RAG 流程，包括动态检索、推理增强、图知识库结合以及多模态知识处理，旨在解决传统 RAG 静态响应的局限性。相关文献: Yongqi Leng et. al, 2025 等 49 篇文献

通用智能体架构、规划范式与评估方法

该类文献关注智能体的核心基础设计，包括ReAct范式、任务规划、工具调用、轨迹校准及衡量智能体能力的标准化评估基准研究。相关文献: Shunyu Yao et. al, 2022 等 49 篇文献

科学发现、医疗与生物工程中的智能体自动化

这些文献研究智能体在科研自动化、临床决策支持、药物发现、生物信息分析等专业领域中的应用，强调自动化工作流对科学探索的赋能。相关文献: Meng Xiao et. al, 2025 等 55 篇文献

工业运维、基础设施治理与企业级自动化

该组文献集中在智能体于企业 IT、云服务、工业自动化、DevOps 及网络安全等场景的部署，探讨了如何通过策略协同实现工业流程监控与安全治理。相关文献: Vamsi Krishna et. al, 2025 等 32 篇文献

社会认知、人机交互与具身感知智能

这些文献探讨智能体在社会模拟、教育辅导、城市交通、具身智能及人机协作中的应用，重点关注用户体验、意图理解及动态适应性。相关文献: Jiawei Wang et. al, 2024 等 42 篇文献

代理安全、隐私与信任评估

专门针对智能体运行时的安全挑战，包括隐私风险提取、对抗攻击防御、恶意行为监测、信任验证以及在复杂基础设施下的安全性保障。相关文献: Liang-bo Ning et. al, 2024 等 12 篇文献