agent,ai,llm
代理式检索增强生成 (Agentic RAG) 与知识集成架构
这些文献系统性地探讨了将自主代理逻辑融入 RAG 流程,包括动态检索、推理增强、图知识库结合以及多模态知识处理,旨在解决传统 RAG 静态响应的局限性。
- DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision(Yongqi Leng, Yikun Lei, Xikai Liu, Meizhi Zhong, Bojian Xiong, Yurong Zhang, Yan Gao, Yi Wu, Yao Hu, Deyi Xiong, 2025, Conference on Empirical Methods in Natural Language Processing)
- Empowering Large Language Model Reasoning : Hybridizing Layered Retrieval Augmented Generation and Knowledge Graph Synthesis(Vedanth Aggarwal, 2024, International Journal of High School Research)
- A self-correcting Agentic Graph RAG for clinical decision support in hepatology.(Yalan Hu, Wenjie Xuan, Qingqing Zhou, Zhi Li, Ya Li, Jili Hu, Fang Fang, 2025, Frontiers in medicine)
- ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation(Reza Yousefi Maragheh, Pratheek Vadla, Priyank Gupta, Kai Zhao, Aysenur Inan, Kehui Yao, Jianpeng Xu, Praveen Kanumala, Jason Cho, Sushant Kumar, 2025, ArXiv Preprint)
- Open-source modular AI coupled with agentic AI for comprehensive breast cancer note generation and guideline-directed treatment comparison.(Ahmed Sandhu, Elizabeth Jaewon Kim, Daniela Urueta Portillo, Becky Powers, Ronald Rodriguez, 2025, Journal of Clinical Oncology)
- A Unified Agentic Framework for Evaluating Conditional Image Generation(Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang, 2025, Annual Meeting of the Association for Computational Linguistics)
- Agentic RAG for Command Generation in Automated Penetration Testing(Zhengkun Chen, Chuanjun Yi, Pan Jia, 2025, 2025 International Conference on Signal Processing, Computer Networks and Communications (SPCNC))
- FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation(Mohammad Aghajani Asl, Majid Asgari-Bidhendi, Behrooz Minaei-Bidgoli, 2025, ArXiv Preprint)
- Retrieval-Augmented Generation to Generate Knowledge Assets and Creation of Action Drivers(A. James, Marcello Trovati, Simon Bolton, 2025, Applied Sciences)
- Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning(Wenchuan Zhang, Jingru Guo, Heng Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu, 2025, AAAI Conference on Artificial Intelligence)
- Traditional RAG vs. Agentic RAG: A Comparative Study of Retrieval-Augmented Systems(Fnu Neha, Deepshikha Bhati, 2025, 2025 IEEE International Conference on Future Machine Learning and Data Science (FMLDS))
- Search-o1: Agentic Search-Enhanced Large Reasoning Models(Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou, 2025, Conference on Empirical Methods in Natural Language Processing)
- A Collaborative Multi-Agent Approach to Retrieval-Augmented Generation Across Diverse Data(Aniruddha Salve, Saba Attar, Mahesh Deshmukh, Sayali Shivpuje, Arnab Mitra Utsab, 2024, ArXiv Preprint)
- MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration(Md Hasebul Hasan, Mahir Labib Dihan, Mohammed Eunus Ali, Md. Rizwan Parvez, 2025, Conference of the European Chapter of the Association for Computational Linguistics)
- IGMiRAG: Intuition-Guided Retrieval-Augmented Generation with Adaptive Mining of In-Depth Memory(Xingliang Hou, Yuyan Liu, Qi Sun, haoxiu wang, Hao Hu, Shaoyi Du, Zhiqiang Tian, 2026, ArXiv Preprint)
- You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects(Islem Bouzenia, Michael Pradel, 2024, Proceedings of the ACM on Software Engineering)
- Data Interpreter: An LLM Agent For Data Science(Sirui Hong, Yizhang Lin, Bangbang Liu, Binhao Wu, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Lingyao Zhang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Wenyi Wang, Xiangru Tang, Xiang Lu, Xinbing Liang, Yaying Fei, Yuheng Cheng, Zhibin Gou, Zongze Xu, Chenglin Wu, Li Zhang, Min Yang, Xiawu Zheng, 2024, Annual Meeting of the Association for Computational Linguistics)
- DanceAgent: Dance Movement Refinement With LLM Agent.(Cheng Shang, Xingyu Chen, Liang An, Jiajun Zhang, Yuxiang Zhang, Yebin Liu, Xubo Yang, 2026, IEEE transactions on visualization and computer graphics)
- Agentic memory-augmented retrieval and evidence grounding for medical question-answering tasks(S. Jia, S. Bit, V. H. Jasodanand, Yi Liu, V. Kolachalama, 2025, medRxiv)
- HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases(Meng-Chieh Lee, Qi Zhu, C. Mavromatis, Zhen Han, Soji Adeshina, V. Ioannidis, H. Rangwala, Christos Faloutsos, 2024, Annual Meeting of the Association for Computational Linguistics)
- Pioneering agentic retrieval-augmented generation in software quality: a novel framework for code smell detection via dynamic retrieval(Bushra Aljohani, Abdulmajeed Aljuhani, 2026, PeerJ Computer Science)
- KA-RAG: Integrating Knowledge Graphs and Agentic Retrieval-Augmented Generation for an Intelligent Educational Question-Answering Model(Fangqun Gao, Shun-Yi Xu, Weiyang Hao, Tao Lu, 2025, Applied Sciences)
- Agentic Search Engine for Real-Time Internet of Things Data.(Abdelrahman Elewah, Khalid Elgazzar, Said Elnaffar, 2025, Sensors (Basel, Switzerland))
- Agentic RAG with Human-in-the-Retrieval(Xiwei Xu, Dawen Zhang, Qing Liu, Qinghua Lu, Liming Zhu, 2025, 2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C))
- Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework(Zhaorui Yang, Bo Pan, Han Wang, Yiyao Wang, Xingyu Liu, Minfeng Zhu, Bo Zhang, Wei Chen, 2025, AAAI Conference on Artificial Intelligence)
- Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action(Sacha Morin, Kumaraditya Gupta, Mahtab Sandhu, Charlie Gauthier, Francesco Argenziano, Kirsty Ellis, Liam Paull, 2025, ArXiv Preprint)
- CogPlanner: Unveiling the Potential of Agentic Multimodal Retrieval Augmented Generation with Planning(Xiaohan Yu, Zhihan Yang, Chong Chen, 2025, Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region)
- AgCV: An Agentic framework for automating computer vision application.(Arav Saxena, Archana Y Chaudhari, Anilkumar Gupta, 2025, MethodsX)
- RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing(Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, Xiangyu Zhang, 2025, International Conference on Machine Learning)
- Spec2RTL-Agent: Automated Hardware Code Generation from Complex Specifications Using LLM Agent Systems(Zhongzhi Yu, Mingjie Liu, Michael Zimmer, Y. Lin, Yong Liu, Haoxing Ren, 2025, 2025 IEEE International Conference on LLM-Aided Design (ICLAD))
- DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation(Esakkivel Esakkiraja, Denis Akhiyarov, Aditya Shanmugham, Chitra Ganapathy, 2025, ArXiv Preprint)
- CARROT: A Learned Cost-Constrained Retrieval Optimization System for RAG(Ziting Wang, Haitao Yuan, Wei Dong, Gao Cong, Feifei Li, 2024, ArXiv Preprint)
- WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment(Hao Tang, Darren Key, Kevin Ellis, 2024, Neural Information Processing Systems)
- Large language model agents can use tools to perform clinical calculations.(Alex J Goodell, Simon N Chu, Dara Rouholiman, Larry F Chu, 2025, NPJ digital medicine)
- WaitGPT: Monitoring and Steering Conversational LLM Agent in Data Analysis with On-the-Fly Code Visualization(Liwenhan Xie, Chengbo Zheng, Haijun Xia, Huamin Qu, Zhu-Tian Chen, 2024, Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology)
- Enhancing Cognitive Digital Twin Interaction using an LLM Agent(Jan Sturm, Patrik Zajec, Maja Skrjanc, Dunja Mladenić, M. Grobelnik, 2024, 2024 47th MIPRO ICT and Electronics Convention (MIPRO))
- A Concept for Bio-Agentic Visual Communication: Bridging Swarm Intelligence with Biological Analogues.(Bryan Starbuck, Hanlong Li, Bryan Cochran, Marc Weissburg, Bert Bras, 2025, Biomimetics (Basel, Switzerland))
- Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education.(AlHasan AlSammarraie, Ali Al-Saifi, Hassan Kamhia, Mohamed Aboagla, Mowafa Househ, 2025, BMJ health & care informatics)
- Precedent-Aware Multi-Agent Retrieval-Augmented Generation in Case Law Analysis(Shatrunjay Kumar Singh, 2026, International Journal of Innovative Science and Research Technology)
- Multi-Modal Retrieval Augmented Visual Understanding and Generation(Zhucun Xue, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- AirRAG: Autonomous Strategic Planning and Reasoning Steer Retrieval Augmented Generation(Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Guochao Jiang, Jingyi Song, Hao Wang, 2025, Findings of the Association for Computational Linguistics: EMNLP 2025)
- MRAgent: an LLM-based automated agent for causal knowledge discovery in disease via Mendelian randomization.(Wei Xu, Gang Luo, Weiyu Meng, Xiaobing Zhai, Keli Zheng, Ji Wu, Yanrong Li, Abao Xing, Junrong Li, Zhifan Li, Ke Zheng, Kefeng Li, 2025, Briefings in bioinformatics)
- TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework(Chao Zhang, Yuhao Wang, Derong Xu, Haoxin Zhang, Yuanjie Lyu, Yuhao Chen, Shuochen Liu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, Enhong Chen, 2025, ArXiv Preprint)
- Performance Enhancement of Agentic Retrieval Augmented Generation Using Relevance Generative Answering(Sanjay Kukreja, Tarun Kumar, Vishal Bharate, Sweta Gadwe, Abhijit Dasgupta, Debashis Guha, 2025, 2025 5th International Conference on Artificial Intelligence and Education (ICAIE))
- RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for Retrieval Augmented Generation(Guanting Dong, Jiajie Jin, Xiaoxi Li, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation(Thomas Cook, Richard Osuagwu, Liman Tsatiashvili, Vrynsia Vrynsia, Koustav Ghosal, Maraim Masoud, Riccardo Mattivi, 2025, 2025 3rd International Conference on Foundation and Large Language Models (FLLM))
- Agentic Retrieval-Augmented Generation: Advancing AI-Driven Information Retrieval and Processing(Abhai Pratap Singh, Adit Jamdar, Prerna Kaul, 2025, International Journal of Computer Trends and Technology)
- Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization(Ryan Barron, M. Eren, Olga M. Serafimova, Cynthia Matuszek, Boian Alexandrov, 2025, Proceedings of the Twentieth International Conference on Artificial Intelligence and Law)
- Agentic AI with retrieval-augmented generation for automated compliance assistance in finance(Varun Pandey, 2025, International Journal of Science and Research Archive)
通用智能体架构、规划范式与评估方法
该类文献关注智能体的核心基础设计,包括ReAct范式、任务规划、工具调用、轨迹校准及衡量智能体能力的标准化评估基准研究。
- ReAct: Synergizing Reasoning and Acting in Language Models(Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, 2022, ArXiv Preprint)
- From prompt engineering to agent engineering: expanding the AI toolbox with autonomous agentic AI collaborators for biomedical discovery.(Jason H Moore, Nicholas P Tatonetti, 2025, BioData mining)
- ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary(Yutong Li, Lu Chen, Aiwei Liu, Kai Yu, Lijie Wen, 2024, International Conference on Computational Linguistics)
- STeCa: Step-level Trajectory Calibration for LLM Agent Learning(Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li, 2025, Annual Meeting of the Association for Computational Linguistics)
- AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning(Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, V. Ioannidis, Karthik Subbian, J. Leskovec, James Zou, 2024, Advances in Neural Information Processing Systems 37)
- Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025(Samy Ateia, U. Kruschwitz, 2025, Conference and Labs of the Evaluation Forum)
- Exploring and Controlling Diversity in LLM-Agent Conversation(Kuanchao Chu, Yi-Pei Chen, Hideki Nakayama, 2024, Conference on Empirical Methods in Natural Language Processing)
- AgentLens: Visual Analysis for Agent Behaviors in LLM-Based Autonomous Systems.(Jiaying Lu, Bo Pan, Jieyi Chen, Yingchaojie Feng, Jingyuan Hu, Yuchen Peng, Wei Chen, 2025, IEEE transactions on visualization and computer graphics)
- Architectures for Building Agentic AI(Sławomir Nowaczyk, 2025, ArXiv Preprint)
- PaSa: An LLM Agent for Comprehensive Academic Paper Search(Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, W. E, 2025, Annual Meeting of the Association for Computational Linguistics)
- AGENTIC AI: A COMPREHENSIVE FRAMEWORK FOR AUTONOMOUS DECISION-MAKING SYSTEMS IN ARTIFICIAL INTELLIGENCE(Panneer Selvam Viswanathan, 2025, INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING AND TECHNOLOGY)
- Thinking Isn't an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations(Zhao Song, Song Yue, Jiahao Zhang, 2025, Robotics)
- RA-Gen: A Controllable Code Generation Framework Using ReAct for Multi-Agent Task Execution(Aofan Liu, Haoxuan Li, Bin Wang, Ao Yang, Hui Li, 2025, ArXiv Preprint)
- LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval(Minh-Chi Phung, Thien-Bao Le, Cam-Tu Tran-Thi, Thu-Dieu Nguyen-Thi, Vu-Hung Dao, 2026, ArXiv Preprint)
- Use of Retrieval-Augmented Large Language Model for COVID-19 Fact-Checking: Development and Usability Study.(Hai Li, Jingyi Huang, Mengmeng Ji, Yuyi Yang, Ruopeng An, 2025, Journal of medical Internet research)
- Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search(Shuocheng Li, Yihao Liu, Siling Du, Wenxuan Zeng, Zhe Xu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, Dongmei Zhang, 2025, AAAI Conference on Artificial Intelligence)
- LIFE-CRAFT: A Multi-agentic Conversational RAG Framework for Lifestyle Medicine Coaching with Context Traceability and Case-Based Evidence Synthesis(Hania Aslam, Gousia K. Malak, Max Renault, R. Thomas, 2025, Lecture Notes in Computer Science)
- ACEBench: A Comprehensive Evaluation of LLM Tool Usage(Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Yue-Run Huang, Xiangcheng Liu, Xinzhi Wang, Wulong Liu, 2025, Findings of the Association for Computational Linguistics: EMNLP 2025)
- Agint: Agentic Graph Compilation for Software Engineering Agents(Abhi Chivukula, Jay Somasundaram, Vijay Somasundaram, 2025, ArXiv Preprint)
- AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems(YenTing Lee, Keerthi Koneru, Zahra Moslemi, Sheethal Kumar, Ramesh Radhakrishnan, 2026, ArXiv Preprint)
- On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models(Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati, 2024, ArXiv Preprint)
- Autono: A ReAct-Based Highly Robust Autonomous Agent Framework(Zihao Wu, 2025, ArXiv Preprint)
- A general AI agent framework for smart buildings based on large language models and ReAct strategy(Xia Yan, Xincong Yang, Nan Jin, Yu Chen, Jiaqi Li, 2025, Smart Construction)
- AgentGuard: Runtime Verification of AI Agents(Roham Koohestani, 2025, ArXiv Preprint)
- LLM-Collab: a framework for enhancing task planning via chain-of-thought and multi-agent collaboration(Hongyu Cao, Rong Ma, Yanlong Zhai, Jun Shen, 2024, Applied Computing and Intelligence)
- Reason-Plan-ReAct: A Reasoner-Planner Supervising a ReAct Executor for Complex Enterprise Tasks(Gianni Molinari, Fabio Ciravegna, 2025, ArXiv Preprint)
- Reactive to Agentic:Next Gen AI Agent Transition Framework(Rinki Singh, Tammana Sachdeva, Sabita, Shikha Arora, Deepak Garg, 2025, 2025 3rd International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT))
- Agentic AI: A Quantitative Analysis of Performance and Applications(P. Sawant, 2025, Journal of Advances in Artificial Intelligence)
- Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning(Xiaolong Wei, Yuehu Dong, Xingliang Wang, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin, 2025, AAAI Conference on Artificial Intelligence)
- AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning(Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang, 2025, ArXiv Preprint)
- Measuring an LLM's Proficiency at using APIs: A Query Generation Strategy(Ying Sheng, Sudeep Gandhe, Bhargav Kanagal, Nick Edmonds, Zachary Fisher, Sandeep Tata, Aarush Selvan, 2024, Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
- A study on classification based concurrent API calls and optimal model combination for tool augmented LLMs for AI agent.(HeounMo Go, SangHyun Park, 2025, Scientific reports)
- From Retrieval to Cognitive Orchestration: Standardizing Context Management in Agentic AI Systems(Bhaskara Reddy Udaru, 2026, International Journal of Computational and Experimental Science and Engineering)
- Autonomous agentic AI with policy adaptation for physics-informed spectral learning in Structural Health Monitoring(Anshu Sharma, B. Bhowmik, 2026, Advanced Engineering Informatics)
- Agentic AI systems in the age of generative models: architectures, cloud scalability, and real-world applications(Lingareddy Alva, Bishwajeet Pandey, 2026, Artificial Intelligence Review)
- Efficient Tool Use with Chain-of-Abstraction Reasoning(Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, Tianlu Wang, 2024, International Conference on Computational Linguistics)
- SMART: Self-Aware Agent for Tool Overuse Mitigation(Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tur, Gokhan Tur, Heng Ji, 2025, Annual Meeting of the Association for Computational Linguistics)
- AgentSquare: Automatic LLM Agent Search in Modular Design Space(Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, Yong Li, 2024, International Conference on Learning Representations)
- Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement(Weimin Xiong, Yifan Song, Xiutian Zhao, Wenhao Wu, Xun Wang, Ke Wang, Cheng Li, Wei Peng, Sujian Li, 2024, Conference on Empirical Methods in Natural Language Processing)
- Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents(Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Suzan Verberne, Zhaochun Ren, 2024, Proceedings of the ACM on Web Conference 2025)
- Small LLMs Are Weak Tool Learners: A Multi-LLM Agent(Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, Fei Huang, 2024, Conference on Empirical Methods in Natural Language Processing)
- Architecting Agentic Communities using Design Patterns(Zoran Milosevic, Fethi Rabhi, 2026, ArXiv Preprint)
- Auto-scaling LLM-based multi-agent systems through dynamic integration of agents.(Ravindu Perera, Anuradha Basnayake, Manjusri Wickramasinghe, 2025, Frontiers in artificial intelligence)
- Agentic AI: A Paradigm for Autonomous Decision-Making(Meethun Panda, 2025, International Journal of Innovative Research in Science Engineering and Technology)
- Toolken+: Improving LLM Tool Usage with Reranking and a Reject Option(K. Yakovlev, Sergey Nikolenko, A. Bout, 2024, Conference on Empirical Methods in Natural Language Processing)
- ToolQA: A Dataset for LLM Question Answering with External Tools(Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, Chao Zhang, 2023, Neural Information Processing Systems)
- Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows(Patara Trirat, Wonyong Jeong, Sung Ju Hwang, 2025, ArXiv Preprint)
- Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools(Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, Yueming Jin, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments(Siwei Wu, Yizhi Li, Yuyang Song, Wei Zhang, Yang Wang, Riza Batista-Navarro, Xian Yang, Mingjie Tang, Bryan Dai, Jian Yang, Chenghua Lin, 2026, ArXiv Preprint)
科学发现、医疗与生物工程中的智能体自动化
这些文献研究智能体在科研自动化、临床决策支持、药物发现、生物信息分析等专业领域中的应用,强调自动化工作流对科学探索的赋能。
- Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training(Meng Xiao, Xunxin Cai, Qingqing Long, Chengrui Wang, Yuanchun Zhou, Hengshu Zhu, 2025, ArXiv Preprint)
- Automating agentic collaborative ontology engineering with role-playing simulation of LLM-powered agents and RAG technology(Andreas Soularidis, Dimitrios Doumanas, Konstantinos Kotis, G. Vouros, 2025, The Knowledge Engineering Review)
- POLARIS: Typed Planning and Governed Execution for Agentic AI in Back-Office Automation(Zahra Moslemi, Keerthi Koneru, Yen-Ting Lee, Sheethal Kumar, Ramesh Radhakrishnan, 2026, ArXiv Preprint)
- AutoHMA-LLM: Efficient Task Coordination and Execution in Heterogeneous Multi-Agent Systems Using Hybrid Large Language Models(Tinging Yang, Ping Feng, Qixin Guo, Jindi Zhang, Xiufeng Zhang, Jiahong Ning, Xinghan Wang, Zhongyang Mao, 2025, IEEE Transactions on Cognitive Communications and Networking)
- BPMN DMN decision table generation based on agentic AI for critical applications(Sourour Meddeb, Selma Batti, Habib Fathallah, 2026, Business Process Management Journal)
- Invited: Polymath: Self-Improving Hierarchical Workflow for Multi-Domain Problem Solving(Chia-Tung Ho, Jingyang Gong, Haoyu Yang, Abhishek B. Akkur, Haoxing Ren, 2026, Proceedings of the 2026 International Symposium on Physical Design)
- Accelerating earth science discovery via multi-agent LLM systems.(Dmitrii Pantiukhin, Boris Shapkin, Ivan Kuznetsov, Antonia Anna Jost, Nikolay Koldunov, 2025, Frontiers in artificial intelligence)
- CRISPR-GPT for agentic automation of gene-editing experiments.(Yuanhao Qu, Kaixuan Huang, Ming Yin, Kanghong Zhan, Dyllan Liu, Di Yin, Henry C Cousins, William A Johnson, Xiaotong Wang, Mihir Shah, Russ B Altman, Denny Zhou, Mengdi Wang, Le Cong, 2026, Nature biomedical engineering)
- The (R)evolution of Scientific Workflows in the Agentic AI Era: Towards Autonomous Science(Woong Shin, Renan Souza, Daniel Rosendo, F. Suter, Feiyi Wang, Prasanna Balaprakash, R. Ferreira da Silva, 2025, Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis)
- From prompt to platform: an agentic AI workflow for healthcare simulation scenario design.(Federico Lorenzo Barra, Giovanna Rodella, Alessandro Costa, Antonio Scalogna, Luca Carenzo, Alice Monzani, Francesco Della Corte, 2025, Advances in simulation (London, England))
- 基于大模型的AI通话分析智能体研究与实现 - 汉斯出版社(Unknown Authors, Unknown Journal)
- 城市AI智能体与数字孪生指挥技术研究及应用 - 汉斯出版社(Unknown Authors, Unknown Journal)
- Personalizing prostate cancer education for patients using an EHR-Integrated LLM agent.(Yuexing Hao, Jason Holmes, Mark R Waddle, Brian J Davis, Nathan Y Yu, Kristin S Vickers, Heather Preston, Drew Margolin, Corinna E Löckenhoff, Aditya Vashistha, Saleh Kalantari, Marzyeh Ghassemi, Wei Liu, 2025, NPJ digital medicine)
- EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records(Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C. Ho, Carl Yang, M. D. Wang, 2024, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing)
- Detection and diagnosis of diabetic retinopathy in retinal fundus images using agentic AI approaches.(R Sathya, A Valaramathi, 2025, Scientific reports)
- An Explainable Agentic AI Framework for Uncertainty-Aware and Abstention-Enabled Acute Ischemic Stroke Imaging Decisions(Md Rashadul Islam, 2026, ArXiv Preprint)
- Autonomous Agentic AI for Clinical Workflow Orchestration: Self-Managing Healthcare Operations(Arjun Warrier, Abhilash K S, 2025, 2025 6th International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS))
- Agentic AI Framework for End-to-End Medical Data Inference(Soorya Ram Shimgekar, Shayan Vassef, Abhay Goyal, Navin Kumar, Koustuv Saha, 2025, 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM))
- Agentic Surgical AI: Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion in a Vision-Language-Action Framework(Huixin Zhan, Jason H. Moore, 2025, No journal)
- LLM-based multi-agent system for neuro-ophthalmic diagnosis and personalized treatment planning.(Wenmiao Wang, 2025, Frontiers in neuroscience)
- Agentic AI for Cultural Heritage: Embedding Risk Memory in Semantic Digital Twins(Georgios Pavlidis, 2025, Computers)
- Agentic AI Quiz-Based Learning System: Enhancing MCQ Generation via Long-Context Cached Retrieval-Augmented Generation(Devananda Sreekanth, Sreekanth Gopi, N. Dehbozorgi, 2025, 2025 IEEE Frontiers in Education Conference (FIE))
- Agentic AI in radiology: emerging potential and unresolved challenges(Nicholas Dietrich, 2025, British Journal of Radiology)
- The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies.(Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, James Zou, 2025, Nature)
- Agent Laboratory: Using LLM Agents as Research Assistants(Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, E. Barsoum, 2025, Findings of the Association for Computational Linguistics: EMNLP 2025)
- Agentic Lab: An Agentic-physical AI system for cell and organoid experimentation and manufacturing.(Wenbo Wang, Simran Swain, Jaeyong Lee, Zuwan Lin, Bradley Canales, Almir Aljović, Yaxuan Liu, Qiang Li, Arnau Marin-Llobet, Mai Liu, Zihan Gao, Ren Liu, Juan R Alvarez-Dominguez, Jia Liu, 2025, bioRxiv : the preprint server for biology)
- A multi-agentic framework for real-time, autonomous freeform metasurface design(Robert Lupoiu, Yixuan Shao, Tianxiang Dai, Chenkai Mao, K. Edee, Jonathan A. Fan, 2025, Science Advances)
- Biomni: A General-Purpose Biomedical AI Agent.(Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N Carter, Xin Zhou, Matthew Wheeler, Jonathan A Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, Michael Snyder, Le Cong, Aviv Regev, Jure Leskovec, 2025, bioRxiv : the preprint server for biology)
- An autonomous AI agent for universal behavior analysis.(Almir Aljović, Zuwan Lin, Wenbo Wang, Xinhe Zhang, Arnau Marin-Llobet, Ningyue Liang, Bradley Canales, Jaeyong Lee, Jongmin Baek, Ren Liu, Catherine Li, Na Li, Jia Liu, 2025, bioRxiv : the preprint server for biology)
- An Agentic Framework for Autonomous Metamaterial Modeling and Inverse Design(Darui Lu, J. Malof, Willie J. Padilla, 2025, ACS Photonics)
- ProteinMCP: An agentic AI framework for autonomous protein engineering.(Xiaopeng Xu, Chenjie Feng, Chao Zha, Wenjia He, Maolin He, Bin Xiao, Xin Gao, 2026, Protein science : a publication of the Protein Society)
- Empowering AI data scientists using a multi-agent LLM framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses.(Dechao Bu, Jingbo Sun, Kun Li, Zihao He, Wei Huang, Jinlin Hu, Shanshan Zhang, Shuangshuang Lei, Peipei Huo, Zhihao Wang, Sheng Wang, Tao Wang, Kai Gao, Yang Wu, Lianhe Zhao, Kai Wang, Gen Li, Huan Song, Yang Jin, Kang Zhang, Runsheng Chen, Yi Zhao, 2026, Nature biomedical engineering)
- An AI Agent for Fully Automated Multi-Omic Analyses.(Juexiao Zhou, Bin Zhang, Guowei Li, Xiuying Chen, Haoyang Li, Xiaopeng Xu, Siyuan Chen, Wenjia He, Chencheng Xu, Liwei Liu, Xin Gao, 2024, Advanced science (Weinheim, Baden-Wurttemberg, Germany))
- A multi-agent approach to neurological clinical reasoning.(Moran Sorka, Alon Gorenshtein, Dvir Aran, Shahar Shelly, 2025, PLOS digital health)
- Toward the Autonomous AI Doctor: Quantitative Benchmarking of an Autonomous Agentic AI Versus Board-Certified Clinicians in a Real World Setting(Hashim Hayat, Maksim Kudrautsau, E.A. Makarov, V. Melnichenko, Tim Tsykunou, Piotr Varaksin, Matt Pavelle, A. Oskowitz, 2025, medRxiv)
- ChatEED: An agentic retrieval assistant for accelerator operators(A. Reed, Claudio Bisegni, S. Shrestha, Michelle Huang, Daniel Ratner, 2025, Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis)
- LLM Agent Based Protein Function Prediction.(Fernando Zhapa-Camacho, Olga Mashkova, Robert Hoehndorf, Maxat Kulmanov, 2026, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing)
- From data silos to insights: the PRINCE multi-agent knowledge engine for preclinical drug development.(Carlos Henrique Vieira-Vieira, Sarang Sanjay Kulkarni, Adam Zalewski, Jobst Löffler, Jonas Münch, Annika Kreuchwig, 2025, Frontiers in artificial intelligence)
- Spike sorting AI agent.(Zuwan Lin, Arnau Marin-Llobet, Jongmin Baek, Yichun He, Jaeyong Lee, Wenbo Wang, Xinhe Zhang, Ariel J Lee, Ningyue Liang, Jin Du, Jie Ding, Na Li, Jia Liu, 2025, bioRxiv : the preprint server for biology)
- MolAgent: Biomolecular Property Estimation in the Agentic Era.(Jose Carlos Gómez-Tamayo, Joris Tavernier, Roy Aerts, Natalia Dyubankova, Dries Van Rompaey, Sairam Menon, Marvin Steijaert, Jörg Kurt Wegner, Hugo Ceulemans, Gary Tresadern, Hans De Winter, Mazen Ahmad, 2025, Journal of chemical information and modeling)
- A multimodal LLM-agent framework for personalized clinical decision-making in hepatocellular carcinoma.(Liyang Wang, Fa Tian, Chengquan Li, Jitao Wang, Jiahong Dong, Jiabin Cai, Shizhong Yang, Xiaobin Feng, 2025, Patterns (New York, N.Y.))
- Automatic biomarker discovery and enrichment with BRAD.(Joshua Pickard, Ram Prakash, Marc Andrew Choi, Natalie Oliven, Cooper Stansbury, Jillian Cwycyshyn, Nicholas Galioto, Alex Gorodetsky, Alvaro Velasquez, Indika Rajapakse, 2025, Bioinformatics (Oxford, England))
- CACTUS: Chemistry Agent Connecting Tool Usage to Science.(Andrew D McNaughton, Gautham Krishna Sankar Ramalaxmi, Agustin Kruel, Carter R Knutson, Rohith A Varikoti, Neeraj Kumar, 2024, ACS omega)
- LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research(Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, Xinya Du, 2025, Conference on Empirical Methods in Natural Language Processing)
- LLM-based pedagogical agent for ICU simulation instructor training: A quasi-experimental study.(Jingbang Liu, Ting Chen, Shan Li, Yeru Xia, Hong Zhu, Ruijuan Wu, Qinli Cao, Xiaoyan Gong, Lili Wu, 2026, Nurse education today)
- Context-Aware Multi-Agent Architecture for Wildfire Insights.(Ashen Sandeep, Sithum Jayarathna, Sunera Sandaruwan, Venura Samarappuli, Dulani Meedeniya, Charith Perera, 2026, Sensors (Basel, Switzerland))
- Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG(Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag, 2024, Lecture Notes in Computer Science)
- Knowledge-Infused LLM-Powered Conversational Health Agent: A Case Study for Diabetes Patients.(Mahyar Abbasian, Zhongqi Yang, Elahe Khatibi, Pengfei Zhang, Nitish Nagesh, Iman Azimi, Ramesh Jain, Amir M Rahmani, 2024, Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference)
- Spatial transcriptomics AI agent charts hPSC-pancreas maturation(Zuwan Lin, Wenbo Wang, Arnau Marin-Llobet, Qiang Li, Samuel D Pollock, Xin Sui, Almir Aljovic, Jaeyong Lee, Jongmin Baek, Ningyue Liang, Xinhe Zhang, Connie Kangni Wang, Jiahao Huang, Mai Liu, Zihan Gao, Hao Sheng, Jin Du, Stephen J Lee, Brandon Wang, Yichun He, Jie Ding, Xiao Wang, Juan R Alvarez-Dominguez, Jia Liu, 2025, bioRxiv : the preprint server for biology)
- An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation.(Mohammad Feli, Iman Azimi, Pasi Liljeberg, Amir M Rahmani, 2025, Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference)
- Med.ai ASK: an agentic system for biomedical question answering(Nhung T. H. Nguyen, D. Lituiev, Zhimin Liu, A. Kashyap, Garrett Jenkinson, Kevin Kuhl, Christopher Corrado, Naisargi Manishkumar Patel, Kirti Snigdha, Sirwe Saeedi, David Smith, Nicholas Baro, T. Schultz, 2026, Journal of the American Medical Informatics Association)
- MedAgentBench v2: Improving Medical LLM Agent Design.(Eric Chen, Sam Postelnik, Kameron Black, Yixing Jiang, Jonathan H Chen, 2026, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing)
- Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems(Y. Low, M. Jackson, Rebecca J. Hyde, Robert E. Brown, Neil M. Sanghavi, Julian D Baldwin, C. Pike, Jananee Muralidharan, Gavin Hui, Natasha Alexander, H. Hassan, Rahul Nene, Morgan Pike, Courtney J. Pokrzywa, Shivam Vedak, A. Yan, Dong-han Yao, A. Zipursky, Christina Dinh, Philip Ballentine, D. Derieg, Vladimir Polony, Rehan N. Chawdry, Jordan Davies, Brigham B. Hyde, Nigam H. Shah, S. Gombar, 2025, DIGITAL HEALTH)
- Talk2Biomodels: AI agent-based open-source LLM initiative for kinetic biological models.(Lilija Wehling, Gurdeep Singh, Ahmad Wisnu Mulyadi, Rakesh Hadne Sreenath, Henning Hermjakob, Tung V N Nguyen, Thomas Rückle, Mohammed H Mosa, Henrik Cordes, Tommaso Andreani, Thomas Klabunde, Rahuman S Malik Sheriff, Douglas McCloskey, 2025, BMC bioinformatics)
- Hierarchical agent reflection for aligning LLM reasoning with clinical diagnostic processes.(Xinda Wang, Xiaotong Li, Deng Zhao, Kehua Feng, Lei Liang, Zhiqiang Zhang, Keyan Ding, Huajun Chen, Bo Wan, Qiang Zhang, 2026, Health information science and systems)
工业运维、基础设施治理与企业级自动化
该组文献集中在智能体于企业 IT、云服务、工业自动化、DevOps 及网络安全等场景的部署,探讨了如何通过策略协同实现工业流程监控与安全治理。
- Autonomous agents in the cloud: Advancing application management with agentic AI(Vamsi Krishna, Kumar Karanam, 2025, World Journal of Advanced Research and Reviews)
- 基于Coze平台与DeepSeek大模型的智能辅助教学智能体 - 汉斯出版社(Unknown Authors, Unknown Journal)
- Deployed AI Agents for Industrial Asset Management: CodeReAct Framework for Event Analysis and Work Order Automation(Nianjun Zhou, Dhaval Patel, A. Bhattacharyya, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- From visual question answering to intelligent AI agents in ophthalmology.(Xiaolan Chen, Ruoyu Chen, Pusheng Xu, Xiaojie Wan, Weiyi Zhang, Bingjie Yan, Xianwen Shang, Mingguang He, Danli Shi, 2025, The British journal of ophthalmology)
- Explainable Agentic AI for Big Data-Driven Evaluation and Visual Analytics of Digital Literacy in Higher Vocational Teacher Education.(Wen Shao, 2026, Big data)
- ReAct Modular Agent: Orchestrating Tool-Use and Retrieval for Financial Workflows(Armando Hernandez, Victor Sabbia, Santiago Pérez, 2025, Proceedings of International Conference on Intelligent Systems and New Applications)
- Beyond Single Systems: How Multi-Agent AI Is Reshaping Ethics in Radiology.(Sara Salehi, Yashbir Singh, Parnian Habibi, Bradley J Erickson, 2025, Bioengineering (Basel, Switzerland))
- Agentic Observability: Automated Alert Triage for Adobe E-Commerce(Aprameya Bharadwaj, Kyle Tu, 2026, ArXiv Preprint)
- Agentic AI and Large Language Models in Radiology: Opportunities and Hallucination Challenges.(Sara Salehi, Yashbir Singh, Kelly K Horst, Quincy A Hathaway, Bradley J Erickson, 2025, Bioengineering (Basel, Switzerland))
- OpenFOAMGPT: A retrieval-augmented large language model (LLM) agent for OpenFOAM-based computational fluid dynamics(Sandeep Pandey, Ran Xu, Wenkang Wang, Xu Chu, 2025, Physics of Fluids)
- Synthetic Data Generation Using CTGAN with Agentic Workflows and Retrieval-Augmented Generation(S. K C, Maria George Anthraper, Kusuma Sanjaykumar, S. Kumari, U. D., 2025, International Conference on AI Research)
- TelAgentBench: A Multi-faceted Benchmark for Evaluating LLM-based Agents in Telecommunications(Sunwoo Lee, Daseong Jang, Dhammiko Arya, Gyoung-eun Han, Injee Song, Saerom Kim, Sang-Ju Kim, Seojin Lee, Seokyoung Hong, Sereimony Sek, Seung-Mo Cho, Sohee Park, Sungbin Yoon, Wonbeom Jang, Eric Davis, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track)
- LLM-POWERED RAN AUTOMATION USING RETRIEVAL-AUGMENTED GENERATION (RAG)(Venkat Chintha, 2025, International Journal of Applied Mathematics)
- 基于大语言模型的财务报告指标抽取智能体方法 - 汉斯出版社(Unknown Authors, Unknown Journal)
- 基于大语言模型的银行智能语音助手 - 汉斯出版社(Unknown Authors, 2025, Unknown Journal)
- 基于精神科护理伦理冲突事件学习微场景专属智能体的开发研究(Unknown Authors, Unknown Journal)
- AI智能体在国内医学教育领域的研究进展 - 汉斯出版社(Unknown Authors, Unknown Journal)
- Agentic AI Workflows in Cybersecurity: Opportunities, Challenges, and Governance via the MCP Model(Sri Keerthi Suggu, 2025, Journal of Information Systems Engineering and Management)
- AI Augmented CI/CD Pipelines: From Code Commit to Production with Autonomous Decisions(Mohammad Baqar, Saba Naqvi, Rajat Khanda, 2025, 2025 3rd International Conference on Foundation and Large Language Models (FLLM))
- OrcaLoca: An LLM Agent Framework for Software Issue Localization(Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, Jishen Zhao, 2025, International Conference on Machine Learning)
- Agentic AI for Autonomous Micro-Frontend User Interfaces and Microservices Evolution in Cloud Platforms(Jyoti Kunal, Jyoti Kunal Shah, 2025, Journal of Computer Science and Technology Studies)
- Agentic AI for Intent-Based Industrial Automation(Marcos Lima Romero, Ricardo Suyama, 2025, 2025 16th IEEE International Conference on Industry Applications (INDUSCON))
- Exploring LLM-Based Multi-Agent Situation Awareness for Zero-Trust Space-Air-Ground Integrated Network(Xinye Cao, Gu Nan, Hongcan Guo, Hanqing Mu, Longxiang Wang, Yihan Lin, Qinchuan Zhou, Jiayi Li, Baohua Qin, Qimei Cui, Xiaofeng Tao, He Fang, Haitao Du, Tony Q. S. Quek, 2025, IEEE Journal on Selected Areas in Communications)
- Task Offloading with LLM-Enhanced Multi-Agent Reinforcement Learning in UAV-Assisted Edge Computing.(Feifan Zhu, Fei Huang, Yantao Yu, Guojin Liu, Tiancong Huang, 2024, Sensors (Basel, Switzerland))
- Autonomic Microservice Management via Agentic AI and MAPE-K Integration(Matteo Esposito, Alexander Bakhtin, Noman Ahmad, Mikel Robredo, Ruoyu Su, Valentina Lenarduzzi, Davide Taibi, 2025, European Conference on Software Architecture)
- A cybersecurity AI agent selection and decision support framework(Masike Malatji, 2025, ArXiv Preprint)
- Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance(Jacopo Tagliabue, Federico Bianchi, Ciro Greco, 2025, ArXiv Preprint)
- Autonomous Industrial Control using an Agentic Framework with Large Language Models(Javal Vyas, Mehmet Mercangöz, 2024, IFAC-PapersOnLine)
- SBOMs into Agentic AIBOMs: Schema Extensions, Agentic Orchestration, and Reproducibility Evaluation(Petar Radanliev, Carsten Maple, Omar Santos, Kayvan Atefi, 2026, ArXiv Preprint)
- Task-Oriented Communications for Agentic IoT: An LLM-Driven QoS/Security Policy Generation via Dynamic Model Context Protocol(Shuaishuai Guo, Jiabing Zhu, Jia Ye, Anbang Zhang, Geyong Min, 2025, 2025 Seventeenth International Conference on Wireless Communications and Signal Processing (WCSP))
- Data Systems as Autonomous Agents: Applying Agentic AI to Big Data Platforms(Narendra Reddy Mudiyala, 2026, Journal of Information Systems Engineering and Management)
- Enterprise API & Platform Strategy in the era of Agentic AI(Ashay Satav, 2025, Journal of Computer Science and Technology Studies)
社会认知、人机交互与具身感知智能
这些文献探讨智能体在社会模拟、教育辅导、城市交通、具身智能及人机协作中的应用,重点关注用户体验、意图理解及动态适应性。
- Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation(Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Chuan Xiao, 2024, Neural Information Processing Systems)
- ReAct-Diffuse: An Integrated Agentic and Generative Diffusion Framework for Autonomous Multi-Step Task Reasoning and Execution(Madhur Thapliyal, Geetika Sharma, Raman Sharma, 2026, 2026 8th International Conference on Intelligent Sustainable Systems (ICISS))
- ELLMA-T: an Embodied LLM-agent for Supporting English Language Learning in Social VR(Mengxue Pan, Alexandra Kitson, Hongyu Wan, Mirjana Prpa, 2024, Proceedings of the 2025 ACM Designing Interactive Systems Conference)
- 多模态大模型驱动的空间智能:技术进展,评估体系与未来挑战(Unknown Authors, Unknown Journal)
- A4FN: an Agentic AI Architecture for Autonomous Flying Networks(André Coelho, Pedro Ribeiro, Helder Fontes, Rui Campos, 2025, 2025 IEEE 36th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC))
- LLM Agents for Smart City Management: Enhancing Decision Support Through Multi-Agent AI Systems(A. Kalyuzhnaya, Sergey Mityagin, E. Lutsenko, Andrey Getmanov, Yaroslav Aksenkin, Kamil Fatkhiev, Kirill Fedorin, Nikolay O. Nikitin, Natalia Chichkova, V. Vorona, A. Boukhanovsky, 2025, Smart Cities)
- Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling(Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl, 2025, Results in Engineering)
- Edge Agentic AI Framework for Autonomous Network Optimisation in O-RAN(Abdelaziz Salama, Zeinab Nezami, Mohammed M. H. Qazzaz, Maryam Hafeez, S. A. R. Zaidi, 2025, 2025 IEEE 36th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC))
- Intent-Based Infrastructure and Service Orchestration Using Agentic-AI(Dimitrios Brodimas, Alexios N. Birbas, Dimitrios Kapolos, S. Denazis, 2025, IEEE Open Journal of the Communications Society)
- An LLM-based Agentic Framework for Accessible NetworkControl(Samuel Lin, Jiawei Zhou, Minlan Yu, 2025, ACM SIGMETRICS Performance Evaluation Review)
- Integrating visual large language model and reasoning chain for driver behavior analysis and risk assessment.(Kunpeng Zhang, Shipu Wang, Ning Jia, Liang Zhao, Chunyang Han, Li Li, 2024, Accident; analysis and prevention)
- From reviews to real-time: dynamic evidence in dentistry.(A V Gavrilova, C Galli, 2026, Evidence-based dentistry)
- Research on risk decision-making generation method for water conservancy project based on multimodal knowledge graph and large language model.(Libo Yang, Yuan Li, Junhua Tan, Libo Mao, 2025, PloS one)
- An Agentic Framework for Social Event Forecasting: Approaches Using Causality Contextualized Chain of Thought(A. Thakur, Aditya Sampath, Siddharth Krishnan, 2025, 2025 IEEE International Conference on Data Mining Workshops (ICDMW))
- 矿业工程专业专门用途英语大模型智能体“Mining Lingua”的研发及其 ...(Unknown Authors, Unknown Journal)
- UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design(Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Laurence Li, Jiri Gesi, Qi He, T. Li, Dakuo Wang, 2025, Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems)
- An Agentic AI-based Multi-Agent Framework for Recommender Systems(I. Portugal, Paulo S. C. Alencar, Donald D. Cowan, 2024, 2024 IEEE International Conference on Big Data (BigData))
- The Accidental Pump and Dump: When Agentic AI Meets Autonomous Trading(David Byrd, 2025, Proceedings of the 6th ACM International Conference on AI in Finance)
- DocAgent: An Agentic Framework for Multi-Modal Long-Context Document Understanding(Li Sun, Liu He, S. Jia, Yangfan He, Chenyu You, 2025, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing)
- LightVA: Lightweight Visual Analytics With LLM Agent-Based Task Planning and Execution(Yuheng Zhao, Junjie Wang, Linbing Xiang, Xiaowen Zhang, Zifei Guo, C. Turkay, Yu Zhang, Siming Chen, 2024, IEEE Transactions on Visualization and Computer Graphics)
- AI-powered consumer segmentation and targeting: A theoretical framework for precision marketing by autonomous (Agentic) AI(Arunraju Chinnaraju, 2025, International Journal of Science and Research Archive)
- TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning(Chuang Jiang, Mingyue Cheng, Xiaoyu Tao, Qingyang Mao, Ouyang Jie, Qi Liu, 2025, Web Search and Data Mining)
- SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code(Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, Alireza Fathi, 2024, International Conference on Machine Learning)
- Robots can feel: LLM-based Framework for Robot Ethical Reasoning(Artem Lykov, Miguel Altamirano Cabrera, Koffivi Fidele Gbagbe, Dzmitry Tsetserukou, 2024, 2024 2nd International Conference on Foundation and Large Language Models (FLLM))
- 英语教育智能体:协作学习的设计与赋能(Unknown Authors, Unknown Journal)
- An LLM-Powered Agent for Real-Time Analysis of the Vietnamese IT Job Market(Minh-Thuan Nguyen, T. Vo-Thanh, Thai-Duy Dinh, Xuan-Quang Phan, Tan-Ha Mai, L. Lê, 2025, 2025 19th International Conference on Advanced Computing and Analytics (ACOMPA))
- Thematic-LM: A LLM-based Multi-agent System for Large-scale Thematic Analysis(Tingrui Qiao, Caroline Walker, Chris Cunningham, Yun Sing Koh, 2025, Proceedings of the ACM on Web Conference 2025)
- AI-Based Application for Task Management and Scheduling Student Activity(Bintang Nuralamsyah, Umi Laili Yuhana, Anny Yuniarti, Muhammad Rifqi Ma’ruf, Firania Putri Harsanti, Faiz Kautsar, Pelangi Masita Wati, 2025, 2025 15th International Conference on Information & Communication Technology and System (ICTS))
- 基于大语言模型的教育智能体个性化学习应用理论研究 - 汉斯出版社(Unknown Authors, Unknown Journal)
- 基于智能体的在地化农产品品牌视觉设计策略研究 - 汉斯出版社(Unknown Authors, Unknown Journal)
- Agentic AI in Higher Education: A Low-Code Framework for Administrative Automation and Strategic Oversight(Hossam Daoud, A. Ragab, Mohamed A. Ragheb, Passent Tantawi, 2025, المجلة العربية للإدارة)
- 生成式人工智能数据安全风险及刑法应对 - 汉斯出版社(Unknown Authors, Unknown Journal)
- ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?(Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, Ling Jiang, 2025, Conference on Empirical Methods in Natural Language Processing)
- 基于AI Agent与Jobs-to-Be-Done理论框架的研究 - 汉斯出版社(Unknown Authors, Unknown Journal)
- Elicitron: An LLM Agent-Based Simulation Framework for Design Requirements Elicitation(Mohammadmehdi Ataei, Hyunmin Cheong, Daniele Grandi, Ye Wang, Nigel Morris, Alexander Tessier, 2024, Journal of Computing and Information Science in Engineering)
- 构建面向AGI时代的开源IoT AI智能体架构与实践 - 汉斯出版社(Unknown Authors, Unknown Journal)
- ProactiveVA: Proactive Visual Analytics with LLM-Based UI Agent.(Yuheng Zhao, Xueli Shu, Liwen Fan, Lin Gao, Yu Zhang, Siming Chen, 2026, IEEE transactions on visualization and computer graphics)
- Cognitive Agents in Urban Mobility: Integrating LLM Reasoning into Multi-Agent Simulations.(Christian Calderón, Pasqual Martí, Jaume Jordán, Javier Palanca, Vicente Julian, 2025, Sensors (Basel, Switzerland))
- Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems(Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, Qingyun Wu, 2025, International Conference on Machine Learning)
- LLM Agent for Hyper-Parameter Optimization(Wanzhe Wang, Jianqiu Peng, Menghao Hu, Wei-chao Zhong, Tong Zhang, Shuai Wang, Yixin Zhang, Mingjie Shao, Wanli Ni, 2025, 2025 IEEE/CIC International Conference on Communications in China (ICCC Workshops))
- Multi-tool Integration Application for Math Reasoning Using Large Language Model(Zhihua Duan, Jialin Wang, 2024, 2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud (EdgeCom))
- RALLM-POI: Retrieval-Augmented LLM for Zero-shot Next POI Recommendation with Geographical Reranking(Kunrong Li, Kwan Hui Lim, 2025, Pacific Rim International Conference on Artificial Intelligence)
代理安全、隐私与信任评估
专门针对智能体运行时的安全挑战,包括隐私风险提取、对抗攻击防御、恶意行为监测、信任验证以及在复杂基础设施下的安全性保障。
- CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent(Liang-bo Ning, Shijie Wang, Wenqi Fan, Qing Li, Xin Xu, Hao Chen, Feiran Huang, 2024, Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
- Structured Elicitation Primitives for Reliable Multi-Agent Delegation and Recursive Planning(S. Karthik, Kota, 2025, British Journal of Multidisciplinary Studies)
- The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework(Feiran Liu, Yuzhe Zhang, Xinyi Huang, Yinan Peng, Xinfeng Li, Lixu Wang, Yutong Shen, Ranjie Duan, Simeng Qin, Xiaojun Jia, Qingsong Wen, Wei Dong, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance(Gonca Gürsun, 2025, ArXiv Preprint)
- Real-Time Trust Verification for Safe Agentic Actions using TrustBench(Tavishi Sharma, Vinayak Sharma, Pragya Sharma, 2026, ArXiv Preprint)
- Uncertainty Propagation on LLM Agent(Qiwei Zhao, Dong Li, Yanchi Liu, Wei Cheng, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Huaxiu Yao, Chen Zhao, Haifeng Chen, Xujiang Zhao, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
- Agentic AI Meets Edge Computing in Autonomous UAV Swarms(Thuan Minh Nguyen, V. T. Truong, Long Bao Le, 2026, IEEE Internet of Things Magazine)
- Unveiling Privacy Risks in LLM Agent Memory(Bo Wang, Weiyi He, Pengfei He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, 2025, Annual Meeting of the Association for Computational Linguistics)
- Red-Teaming LLM Multi-Agent Systems via Communication Attacks(Pengfei He, Yuping Lin, Shen Dong, Han Xu, Yue Xing, Hui Liu, 2025, Annual Meeting of the Association for Computational Linguistics)
- Agent S: An Open Agentic Framework that Uses Computers Like a Human(Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang, 2024, International Conference on Learning Representations)
- Agentic AI for Self-Sovereign Identity: A Decentralized Zero Trust Framework for Autonomous Microservices(Damodhara Reddy Palavali, 2025, The International Journal of Computational Mathematical Ideas)
- Toward Agentic AI Networking in 6G: A Generative Foundation Model-as-Agent Approach(Yong Xiao, Guangming Shi, Ping Zhang, 2025, IEEE Communications Magazine)
本报告对 Agent、AI 及 LLM 研究进行了深度整合与分类,构建了涵盖底层理论、应用开发与安全治理的综合框架。研究呈现出明显的范式转变:从基础的 Agentic RAG 框架进化至复杂的科学自动化工作流;从单一的任务执行向多智能体协同协作演进;从通用架构设计走向各行业(如生物医药、工业运维、社会科学)的垂直落地。同时,随着智能体应用复杂性的提升,针对隐私安全、行为验证及人机信任评估的研究已成为不可或缺的基石,标志着 AI Agent 正步入高度自主与可控并行发展的工业化应用阶段。
总计242篇相关文献
该智能体依托联通云犀平台与元景大模型,构建了“智能体实时动态调度CoE (Collaboration of Experts)”引擎和AI通话分析智能体。CoE引擎通过任务规划与多模型混合调度机制, ...
本系统是基于国产的DeepSeek大模型,结合Coze平台,采用了“AI + 教育专家”双轮驱动的手段构建了一套面向教育的教学辅助智能体。智能体设计了四大核心功能模块:智能备课、 ...
大模型驱动的教育智能体以多模态信息感知、智能推理决策和动态执行作为其核心技术,形成了“感知–决策–行动”的逻辑闭环。其中,大语言模型是教育智能体推理能力的核心,使其 ...
在自建的银行年报问答数据集(BAR-QA)上的实验结果表明,LedgerLens在指标抽取任务中取得了94.1%的F1分数,并在大多数任务上实现了领先表现。研究结果证明,引入基于智能体的 ...
基于大语言模型的智能体(以下简称“大模型智能体”)具有检索增强生成、推理与规划、交互与进化等核心功能[7],可有效避免通用大模型常见的“幻觉”、短期记忆等问题,通过调用 ...
在面向AGI的AIoT系统中,大语言模型(LLM)作为核心的认知与推理引擎,为AI智能体提供语义理解、任务规划、自然交互等高阶能力。考虑到不同应用场景对延迟、算力和网络的要求 ...
大模型智能体功能测试用例. 大模型智能体功能是系统的关键模块,主要包括智能体人设及回复逻辑、工作流、插件和知识库检索。下面测试用例覆盖了模块的全部功能,确保智能 ...
前端交互层为客户端应用,后端为以大语言模型为基础构建的银行智能语音智能体,主要包括语音交互引擎、对话引擎(本地化LLM)、银行知识库、银行业务系统。
本文系统梳理了多模态大模型在三维视觉理解,空间感知与推理,具身交互等方面的技术演进路径,重点分析了以视频,深度图,点云等多源异构数据为基础的空间表征方法,并归纳了当前 ...
健康教育领域基于大语言模型的多智能体系统通过知识普及智能体、健康计划智能体、沟通协调智能体等模块协同工作,实现患者教育全流程智能管理[21]。老年尿路结石患者 ...
本研究中的“智能体”指基于多模态大模型的人工智能系统,具备语义理解、视觉生成及用户反馈学习功能。与传统生成式AI不同,智能体可在循环交互中持续调整输出。然而,其在深层 ...
生成式人工智能(GenAI)与大语言模型(LLMs)推动英语教育从CALL向ICALL深度变革,英语教育智能体(EEA)为缓解协作学习交互浅表化、反馈滞后等痛点提供技术解决方案。
摘要: 针对传统指挥系统在复杂应急场景下面临的信息孤岛化、人工依赖度高、智能决策与物理执行脱节等问题,本文提出并构建了一种“AI智能体具身孪生指挥系统”。
本文基于AI Agent与Jobs-to-Be-Done理论框架,探讨数字经济时代电子商务用户决策的经济逻辑.研究针对传统用户行为分析方法的局限性,创新性构建融合行为经济学与智能技术 ...
... 智能体能够自主决定是否实施某种行为的能力。在判断时可以借鉴人类行为能力的规定,结合人工智能体的技术特点进行综合考量。例如,通过模拟不同的场景和情境,观察人工智能体 ...
摘要: 目的:开发基于智能体技术的“伦理沙盒”模拟训练系统,以解决精神科护理中常见的伦理冲突问题,切实提高护理专业学生的伦理决策能力。方法:系统构建结构化精神科伦理冲突 ...
Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications. They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black-box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent designer's and the attacker's perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.
Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization -- precisely identifying software problems by navigating to relevant code sections -- remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.
This work presents a large language model (LLM)-based agent OpenFOAMGPT tailored for OpenFOAM-centric computational fluid dynamics (CFD) simulations, leveraging two foundation models from OpenAI: the GPT-4o (GPT means Generative Pre-trained Transformer) and a chain-of-thought–enabled o1 preview model. Both agents demonstrate success across multiple tasks. While the price of token with o1 model is six times as that of GPT-4o, it consistently exhibits superior performance in handling complex tasks, from zero-shot/few-shot case setup to boundary condition modifications, zero-shot turbulence model adjustments, and zero-shot code translation. Through an iterative correction loop, the agent efficiently addressed single-phase and multiphase flow, heat transfer, Reynolds-averaged Navier–Stokes modeling, large eddy simulation, and other engineering scenarios, often converging in a limited number of iterations at low token costs. To embed domain-specific knowledge, we employed a retrieval-augmented generation pipeline, demonstrating how preexisting simulation setups can further specialize the agent for subdomains such as energy and aerospace. Despite the great performance of the agent, human oversight remains crucial for ensuring accuracy and adapting to shifting contexts. Fluctuations in model performance over time suggest the need for monitoring in mission-critical applications. Although our demonstrations focus on OpenFOAM, the adaptable nature of this framework opens the door to developing LLM-driven agents into a wide range of solvers and codes. By streamlining CFD simulations, this approach has the potential to accelerate both fundamental research and industrial engineering advancements.
Code auditing is the process of reviewing code with the aim of identifying bugs. Large Language Models (LLMs) have demonstrated promising capabilities for this task without requiring compilation, while also supporting user-friendly customization. However, auditing a code repository with LLMs poses significant challenges: limited context windows and hallucinations can degrade the quality of bug reports, and analyzing large-scale repositories incurs substantial time and token costs, hindering efficiency and scalability. This work introduces an LLM-based agent, RepoAudit, designed to perform autonomous repository-level code auditing. Equipped with agent memory, RepoAudit explores the codebase on demand by analyzing data-flow facts along feasible program paths within individual functions. It further incorporates a validator module to mitigate hallucinations by verifying data-flow facts and checking the satisfiability of path conditions associated with potential bugs, thereby reducing false positives. RepoAudit detects 40 true bugs across 15 real-world benchmark projects with a precision of 78.43%, requiring on average only 0.44 hours and $2.54 per project. Also, it detects 185 new bugs in high-profile projects, among which 174 have been confirmed or fixed. We have open-sourced RepoAudit at https://github.com/PurCL/RepoAudit.
We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.
Usability testing is a fundamental yet challenging research method for user experience (UX) researchers to evaluate a web design. Recent advances in Large Language Model-simulated Agent (LLM Agent) research inspired us to design UXAgent to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features an LLM Agent module and a universal browser connector module so that UX researchers can automatically generate thousands of simulated users to test the target website. The system can generate UX study results in qualitative (e.g., interviewing how an agent thinks), quantitative (e.g., # of actions), and video recording formats for UX researchers to analyze. Through a heuristic user evaluation with five UX researchers, participants praised the innovation of our system but also expressed concerns about the future of UX study with LLM Agents1.
Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.
Despite recent progress in generating hardware register transfer level (RTL) code with large language models (LLMs), existing solutions still suffer from a substantial gap between practical application scenarios and the requirements of real-world RTL code development. Prior approaches either focus on overly simplified hardware descriptions or depend on extensive human guidance to process complex specifications, limiting their scalability and automation potential. In this paper, we address this gap by proposing an LLM agent system, termed Spec2RTL-Agent, designed to directly process complex specification documentation and generate corresponding RTL code implementations, advancing LLM-based RTL code generation toward more realistic application settings. To achieve this goal, Spec2RTL-Agent introduces a novel multi-agent collaboration framework that integrates three key enablers: (1) a reasoning and understanding module that translates specifications into structured, step-by-step implementation plans; (2) a progressive coding and prompt optimization module that iteratively refines the code across multiple representations (pseudocode, Python, and C++) to enhance correctness and synthesisability for RTL conversion; and (3) an adaptive reflection module that identifies and traces the source of errors during generation, ensuring a more robust code generation flow. Instead of directly generating RTL from natural language, our system strategically generates synthesizable C++ code, which is then optimized for high-level synthesis (HLS). This agent-driven refinement ensures greater correctness and compatibility compared to naive direct RTL generation approaches. We evaluate Spec2RTL-Agent on a benchmark of three specification documents, demonstrating its effectiveness in generating accurate RTL code with as much as 75% fewer human interventions compared to existing approaches. These results underscore Spec2RTL-Agent’s role as the first fully automated multi-agent system for RTL generation from unstructured specification documents, reducing the reliance on human effort and expertise in hardware design.
In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. The code and data have been made publicly available at https://github.com/XiaoduoAILab/ECom-Bench to facilitate further research and development in this domain.
Large Language Model (LLM)-based agents have shown effectiveness across many applications. However, their use in data science scenarios requiring solving long-term interconnected tasks, dynamic data adjustments and domain expertise remains challenging. Previous approaches primarily focus on individual tasks, making it difficult to assess the complete data science workflow. Moreover, they struggle to handle real-time changes in intermediate data and fail to adapt dynamically to evolving task dependencies inherent to data science problems. In this paper, we present Data Interpreter, an LLM-based agent designed to automatically solve various data science problems end-to-end. Our Data Interpreter incorporates two key modules: 1) Hierarchical Graph Modeling, which breaks down complex problems into manageable subproblems, enabling dynamic node generation and graph optimization; and 2) Programmable Node Generation, a technique that refines and verifies each subproblem to iteratively improve code generation results and robustness. Extensive experiments consistently demonstrate the superiority of Data Interpreter. On InfiAgent-DABench, it achieves a 25% performance boost, raising accuracy from 75.9% to 94.9%. For machine learning and open-ended tasks, it improves performance from 88% to 95%, and from 60% to 97%, respectively. Moreover, on the MATH dataset, Data Interpreter achieves remarkable performance with a 26% improvement compared to state-of-the-art baselines. The code is available at https://github.com/geekan/MetaGPT.
This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.
No abstract available
This paper introduces a novel approach using Large Language Models (LLMs) integrated into an agent framework for flexible and effective personal mobility generation. LLMs overcome the limitations of previous models by effectively processing semantic data and offering versatility in modeling various tasks. Our approach addresses three research questions: aligning LLMs with real-world urban mobility data, developing reliable activity generation strategies, and exploring LLM applications in urban mobility. The key technical contribution is a novel LLM agent framework that accounts for individual activity patterns and motivations, including a self-consistency approach to align LLMs with real-world activity data and a retrieval-augmented strategy for interpretable activity generation. We evaluate our LLM agent framework and compare it with state-of-the-art personal mobility generation approaches, demonstrating the effectiveness of our approach and its potential applications in urban mobility. Overall, this study marks the pioneering work of designing an LLM agent framework for activity generation based on real-world human activity data, offering a promising tool for urban mobility analysis.
Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task-specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in-context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidating the collective efforts of research community. Code repo is available at https://github.com/tsinghua-fib-lab/AgentSquare.
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents'ability to autonomously reproduce scientific research
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete various tasks in a self-directed fashion. The challenge of tool use demands that LLMs not only understand user queries and generate answers accurately but also excel in task planning, tool invocation, and result summarization. While traditional works focus on training a single LLM with all these capabilities, performance limitations become apparent, particularly with smaller models. To overcome these challenges, we propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. Each component is implemented by a single LLM that focuses on a specific capability and collaborates with others to accomplish the task. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability. To effectively train this framework, we introduce a two-stage training paradigm. First, we fine-tune a backbone LLM on the entire dataset without discriminating sub-tasks, providing the model with a comprehensive understanding of the task. Second, the fine-tuned LLM is used to instantiate the planner, caller, and summarizer respectively, which are continually fine-tuned on respective sub-tasks. Evaluation across various tool-use benchmarks illustrates that our proposed multi-LLM framework surpasses the traditional single-LLM approach, highlighting its efficacy and advantages in tool learning.
The literature review is an indispensable step in the research process. It provides the benefit of comprehending the research problem and understanding the current research situation while conducting a comparative analysis of prior works. However, literature summary is challenging and time consuming. The previous LLM-based studies on literature review mainly focused on the complete process, including literature retrieval, screening, and summarization. However, for the summarization step, simple CoT method often lacks the ability to provide extensive comparative summary. In this work, we firstly focus on the independent literature summarization step and introduce ChatCite, an LLM agent with human workflow guidance for comparative literature summary. This agent, by mimicking the human workflow, first extracts key elements from relevant literature and then generates summaries using a Reflective Incremental Mechanism. In order to better evaluate the quality of the generated summaries, we devised a LLM-based automatic evaluation metric, G-Score, in refer to the human evaluation criteria. The ChatCite agent outperformed other models in various dimensions in the experiments. The literature summaries generated by ChatCite can also be directly used for drafting literature reviews.
The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it difficult to create a reliable, universal test execution method that works across different projects. This paper presents ExecutionAgent, an automated technique that prepares scripts for building an arbitrary project from source code and running its test cases. Inspired by the way a human developer would address this task, our approach is a large language model (LLM)-based agent that autonomously executes commands and interacts with the host system. The agent uses meta-prompting to gather guidelines on the latest technologies related to the given project, and it iteratively refines its process based on feedback from the previous steps. Our evaluation applies ExecutionAgent to 50 open-source projects that use 14 different programming languages and many different build and testing tools. The approach successfully executes the test suites of 33/50 projects, while matching the test results of ground truth test suite executions with a deviation of only 7.5%. These results improve over the best previously available technique by 6.6x. The costs imposed by the approach are reasonable, with an execution time of 74 minutes and LLM costs of USD 0.16, on average per project. We envision ExecutionAgent to serve as a valuable tool for developers, automated programming tools, and researchers that need to execute tests across a wide variety of projects.
We give a model-based agent that builds a Python program representing its knowledge of the world based on its interactions with the environment. The world model tries to explain its interactions, while also being optimistic about what reward it can achieve. We define this optimism as a logical constraint between a program and a planner. We study our agent on gridworlds, and on task planning, finding our approach is more sample-efficient compared to deep RL, more compute-efficient compared to ReAct-style agents, and that it can transfer its knowledge across environments by editing its code.
Large language models (LLMs) support data analysis through conversational user interfaces, as exemplified in OpenAI’s ChatGPT (formally known as Advanced Data Analysis or Code Interpreter). Essentially, LLMs produce code for accomplishing diverse analysis tasks. However, presenting raw code can obscure the logic and hinder user verification. To empower users with enhanced comprehension and augmented control over analysis conducted by LLMs, we propose a novel approach to transform LLM-generated code into an interactive visual representation. In the approach, users are provided with a clear, step-by-step visualization of the LLM-generated code in real time, allowing them to understand, verify, and modify individual data operations in the analysis. Our design decisions are informed by a formative study (N=8) probing into user practice and challenges. We further developed a prototype named WaitGPT and conducted a user study (N=12) to evaluate its usability and effectiveness. The findings from the user study reveal that WaitGPT facilitates monitoring and steering of data analysis performed by LLMs, enabling participants to enhance error detection and increase their overall confidence in the results.
Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the **I**terative step-level **P**rocess **R**efinement **(IPR)** framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical finds highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.
Hyper-parameters are essential and critical for the performance of communication algorithms. However, current hyper-parameters optimization approaches for Warm-Start Particles Swarm Optimization with Crossover and Mutation (WSPSO-CM) algorithm, designed for radio map-enabled unmanned aerial vehicle (UAV) trajectory and communication, are primarily heuristic-based, exhibiting low levels of automation and improvable performance. In this paper, we design an Large Language Model (LLM) agent for automatic hyper-parameters-tuning, where an iterative framework and Model Context Protocol (MCP) are applied. In particular, the LLM agent is first set up via a profile, which specifies the boundary of hyper-parameters, task objective, terminal condition, conservative or aggressive strategy of optimizing hyper-parameters, and LLM configurations. Then, the LLM agent iteratively invokes WS-PSO-CM algorithm for exploration. Finally, the LLM agent exits the loop based on the terminal condition and returns an optimized set of hyperparameters. Our experiment results show that the minimal sum-rate achieved by hyper-parameters generated via our LLM agent is significantly higher than those by both human heuristics and random generation methods. This indicates that an LLM agent with PSO and WS-PSO-CM algorithm knowledge is useful in seeking high-performance hyper-parameters.
Visual analytics (VA) requires analysts to iteratively propose analysis tasks based on observations and execute tasks by creating visualizations and interactive exploration to gain insights. This process demands skills in programming, data processing, and visualization tools, highlighting the need for a more intelligent, streamlined VA approach. Large language models (LLMs) have recently been developed as agents to handle various tasks with dynamic planning and tool-using capabilities, offering the potential to enhance the efficiency and versatility of VA. We propose LightVA, a lightweight VA framework that supports task decomposition, data analysis, and interactive exploration through human-agent collaboration. Our method is designed to help users progressively translate high-level analytical goals into low-level tasks, producing visualizations and deriving insights. Specifically, we introduce an LLM agent-based task planning and execution strategy, employing a recursive process involving a planner, executor, and controller. The planner is responsible for recommending and decomposing tasks, the executor handles task execution, including data analysis, visualization generation and multi-view composition, and the controller coordinates the interaction between the planner and executor. Building on the framework, we develop a system with a hybrid user interface that includes a task flow diagram for monitoring and managing the task planning process, a visualization panel for interactive data exploration, and a chat view for guiding the model through natural language instructions. We examine the effectiveness of our method through a usage scenario and an expert study.
Requirements elicitation, a critical, yet time-consuming and challenging step in product development, often fails to capture the full spectrum of user needs. This may lead to products that fall short of expectations. This paper introduces a novel framework that leverages Large Language Models (LLMs) to automate and enhance the requirements elicitation process. LLMs are used to generate a vast array of simulated users (LLM agents), enabling the exploration of a much broader range of user needs and unforeseen use cases. These agents engage in product experience scenarios, through explaining their actions, observations, and challenges. Subsequent agent interviews and analysis uncover valuable user needs, including latent ones. We validate our framework with three experiments. First, we explore different methodologies for the challenge of diverse agent generation, discussing their advantages and shortcomings. We measure the diversity of identified user needs and demonstrate that context-aware agent generation leads to greater diversity. Second, we show how our framework effectively mimics empathic lead user interviews, identifying a greater number of latent needs than conventional human interviews. Third, we showcase that LLMs can be used to analyze interviews, capture needs and classify them as latent or not. Our work highlights the potential of using LLMs to accelerate early-stage product development, reduce costs, and increase innovation.
Recently, Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience and have attracted considerable attention. Despite the impressive progress, the research question regarding the safety vulnerability of LLM-empowered RecSys still remains largely under-investigated. Given the security and privacy concerns, it is more practical to focus on attacking the black-box RecSys, where attackers can only observe the system's inputs and outputs. However, traditional attack approaches employing reinforcement learning (RL) agents are not effective for attacking LLM-empowered RecSys due to the limited capabilities in processing complex textual inputs, planning, and reasoning. On the other hand, LLMs provide unprecedented opportunities to serve as attack agents to attack RecSys because of their impressive capability in simulating human-like decision-making processes. Therefore, in this paper, we propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs, where an LLM-based agent is developed to attack LLM-Empowered RecSys. Specifically, our method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, we utilize the prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real-world datasets demonstrate the effectiveness of our proposed attacking method.
Many people struggle with learning a new language when moving to a new country, with traditional tools falling short in providing contextualized learning tailored to each learner’s needs. The recent development of large language models (LLMs) and embodied conversational agents (ECAs) in social virtual reality (VR) provides new opportunities to practice language learning in a contextualized and naturalistic way that takes into account the learner’s language level and needs. To explore this opportunity, we developed ELLMA-T, a design probe that integrates an LLM (GPT-4) with an ECA for English language learning in social VR (VRChat), informed by the situated learning framework. We conducted a feasibility study to explore the potential and challenges of LLM-based ECAs for language learning in social VR. Drawing on qualitative interviews (N=12), we reveal the potential of ELLMA-T to generate realistic, believable, and context-specific role plays for agent-learner interaction in VR, and LLM’s capability to provide initial language assessment and continuous feedback to learners. We provide four design implications for the future development of LLM-based language agents in social VR.
Controlling diversity in LLM-agent simulations is essential for balancing stability in structured tasks with variability in open-ended interactions. However, we observe that dialogue diversity tends to degrade over long-term simulations. To explore the role of prompt design in this phenomenon, we modularized the utterance generation prompt and found that reducing contextual information leads to more diverse outputs. Based on this insight, we propose Adaptive Prompt Pruning (APP), a novel method that allows users to control diversity via a single parameter, lambda. APP dynamically prunes prompt segments based on attention scores and is compatible with existing diversity control methods. We demonstrate that APP effectively modulates diversity through extensive experiments and propose a method to balance the control trade-offs. Our analysis reveals that all prompt components impose constraints on diversity, with the Memory being the most influential. Additionally, high-attention contents consistently suppress output diversity.
This paper introduces a conceptual architecture design aimed at enhancing interactions with cognitive digital twins of countries through an Large Language Model (LLM) agent. By leveraging sophisticated data retrieval and summarization techniques, the architecture integrates data from diverse sources, including environmental sensors, web pages, and human inputs, to create a dynamic and comprehensive digital twin. The LLM agent facilitates intuitive conversational interfaces, allowing users to query and interact with the digital twin in a natural manner. Through advanced natural language processing and prompt engineering, the agent can understand complex queries, retrieve relevant data, and provide transparent and explainable insights. Additionally, the system incorporates a feedback loop for continuous improvement based on user interactions. This approach addresses significant challenges in data acquisition and management, offering a scalable solution for creating accurate and real-time representations of countries. The architecture aims to empower decision-makers with precise, actionable insights for policy-making, urban planning, and resource management, demonstrating a significant step towards realizing the potential of digital twins in understanding and managing complex national systems.
Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.
Large Language Model-based Multi-Agent Systems (LLM-MAS) have revolutionized complex problem-solving capability by enabling sophisticated agent collaboration through message-based communications. While the communication framework is crucial for agent coordination, it also introduces a critical yet unexplored security vulnerability. In this work, we introduce Agent-in-the-Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM-MAS by intercepting and manipulating inter-agent messages. Unlike existing attacks that compromise individual agents, AiTM demonstrates how an adversary can compromise entire multi-agent systems by only manipulating the messages passing between agents. To enable the attack under the challenges of limited control and role-restricted communication format, we develop an LLM-powered adversarial agent with a reflection mechanism that generates contextually-aware malicious instructions. Our comprehensive evaluation across various frameworks, communication structures, and real-world applications demonstrates that LLM-MAS is vulnerable to communication-based attacks, highlighting the need for robust security measures in multi-agent systems.
Thematic analysis (TA) is a widely used qualitative method for identifying underlying meanings within unstructured text. However, TA requires manual processes, which become increasingly labour-intensive and time-consuming as datasets grow. While large language models (LLMs) have been introduced to assist with TA on small-scale datasets, three key limitations hinder their effectiveness. First, current approaches often depend on interactions between an LLM agent and a human coder, a process that becomes challenging with larger datasets. Second, with feedback from the human coder, the LLM tends to mirror the human coder, which provides a narrower viewpoint of the data. Third, existing methods follow a sequential process, where codes are generated for individual samples without recalling previous codes and associated data, reducing the ability to analyse data holistically. To address these limitations, we propose Thematic-LM, an LLM-based multi-agent system for large-scale computational thematic analysis. Thematic-LM assigns specialised tasks to each agent, such as coding, aggregating codes, and maintaining and updating the codebook. We assign coder agents different identity perspectives to simulate the subjective nature of TA, fostering a more diverse interpretation of the data. We applied Thematic-LM to the Dreaddit dataset and the Reddit climate change dataset to analyse themes related to social media stress and online opinions on climate change. We evaluate the resulting themes based on trustworthiness principles in qualitative research. Our study reveals insights such as assigning different identities to coder agents promotes divergence in codes and themes.
Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution
This study investigates the implementation of LLM agents in smart city management, leveraging both the inherent language processing abilities of LLMs and the distributed problem solving capabilities of multi-agent systems for the improvement of urban decision making processes. A multi-agent system architecture combines LLMs with existing urban information systems to process complex queries and generate contextually relevant responses for urban planning and management. The research is focused on three main hypotheses testing: (1) LLM agents’ capability for effective routing and processing diverse urban queries, (2) the effectiveness of Retrieval-Augmented Generation (RAG) technology in improving response accuracy when working with local knowledge and regulations, and (3) the impact of integrating LLM agents with existing urban information systems. Our experimental results, based on a comprehensive validation dataset of 150 question–answer pairs, demonstrate significant improvements in decision support capabilities. The multi-agent system achieved pipeline selection accuracy of 94–99% across different models, while the integration of RAG technology improved response accuracy by 17% for strategic development queries and 55% for service accessibility questions. The combined use of document databases and service APIs resulted in the highest performance metrics (G-Eval scores of 0.68–0.74) compared to standalone LLM responses (0.30–0.38). Using St. Petersburg’s Digital Urban Platform as a testbed, we demonstrate the practical applicability of this approach to create integrated city management systems with support complex urban decision making processes. This research contributes to the growing field of AI-enhanced urban management by providing empirical evidence of LLM agents’ effectiveness in processing heterogeneous urban data and supporting strategic planning decisions. Our findings suggest that LLM-based multi-agent systems can significantly enhance the efficiency and accuracy of urban decision making while maintaining high relevance in responses.
Heterogeneous multi-agent systems (HMAS) comprise various intelligent agents with specialized functions, such as drones, ground robots, and automated devices, working in coordinated settings. This paper presents AutoHMA-LLM, a novel framework that combines Large Language Models (LLMs) with classical control algorithms to address the challenges of task coordination and scheduling in complex, dynamic environments. The framework is designed with a multi-tier architecture, utilizing a cloud-based LLM as the central planner alongside device-specific LLMs and Generative Agents to improve task execution efficiency and accuracy. Specifically targeting dynamic scenarios, the system enhances resource utilization and stabilizes task execution through refined task scheduling and real-time feedback mechanisms. In experiments conducted across logistics, inspection, and search & rescue scenarios, AutoHMA-LLM demonstrated a 5.7% improvement in task completion accuracy, a 46% reduction in communication steps, and a 31% decrease in token usage and API calls compared to baseline methods. These results highlight our framework’s scalability and efficiency, offering substantial support for effective multi-agent collaboration in complex, resource-constrained environments.
Space-air-ground integrated network (SAGIN), which integrates satellite systems, aerial networks, and terrestrial communications, offers ubiquitous coverage for a multitude of applications. Nevertheless, the highly dynamic and open nature of SAGIN increases the network’s vulnerability. Hence, zero-trust security, operating on the principle of “never trust, always verify”, holds the significant potential of securing SAGIN. However, implementing zero-trust SAGIN in practice presents three primary challenges: 1) understanding massive unstructured threat information across diverse domains, 2) performing adaptive security assessments, and 3) making in-depth security decisions. This motivates us to propose SAG-Attack and LLM-SA to enhance zero-trust SAGIN. SAG-Attack serves as a simulator that aims to mimic various attacks in SAGIN. Our LLM-SA is a novel situation awareness method that explores the multiple agents of large language model (LLM). Specifically, the output logs of SAG-Attack will be fed into LLM-SA, and LLM-SA fuses vast amounts of heterogeneous threat information from various domains, thus tackling the first challenge. Then, our LLM-SA relies on multiple LLM-based agents to perform adaptive security assessments, utilizing the chain-of-thought capabilities of LLMs to automatically generate in-depth defense strategies, thereby addressing the second and third challenges. Experiments on five benchmarks demonstrate the superiority of the proposed SAG-Attack and LLM-SA. Notably, our method based on open-sourced Llama3-8B even outperforms ChatGPT-4 under the same setting, despite involving significantly fewer parameters. To foster further research in this area, we will release our platform to the community, facilitating the advancement of zero-trust SAGIN.
Consumer segmentation and targeting are essential for precision marketing, allowing businesses to deliver personalized experiences. The article explores the transformative role of autonomous AI agents in enhancing consumer segmentation and targeting within the data-driven marketing landscape. The proposed framework integrates machine learning (ML), natural language processing (NLP), and predictive analytics to continuously optimize segmentation models, enabling real-time targeting and hyper-personalization without human oversight. Autonomous agents dynamically manage segmentation by leveraging unsupervised learning algorithms, including K-means and DBSCAN, to refine clusters and discover complex micro-segments based on evolving consumer behavior and preferences. The AI agents use reinforcement learning to enhance campaign management through continuous feedback loops. By monitoring real-time performance metrics, such as click-through rates and conversions, they dynamically adjust ad spend, resource allocation, and personalized content delivery across digital channels. Predictive models, including Random Forests and time series analysis, further support real-time consumer behavior forecasting. This automation reduces operational inefficiencies, speeds up decision-making, and ensures marketing strategies remain relevant and adaptive. Ethical considerations, including data privacy and algorithmic fairness, are integral to the framework, promoting responsible AI deployment. Case studies from industries such as e-commerce and streaming illustrate significant improvements in campaign efficiency, customer engagement, and return on investment. Autonomous AI enables scalable, data-driven solutions that give businesses a competitive edge in rapidly changing markets.
Abstract Security Operations Centers (SOCs) face significant challenges due to the large volume, diversity, and dynamics of incident events. Alarm fatigue, delayed initiation of response, and the high share of false positives or missed threats limit team effectiveness and increase organizational risk. This study presents a methodology for automated management of key performance indicators (KPIs) in an SOC environment through an Agentic AI architecture and machine learning. Within the project, 214 CSV files were processed, comprising over 8.6 million data rows extracted from SIEM, Incident Management, Task Tracking, and CRM systems. Sixteen specific indicators were used, grouped into four categories: detection and filtering (TTD, FNR, FPR), response and resolution (TTR, IRR, SIHR), recovery and operations (MTTR, OE), satisfaction and risk management (CSR, SIER). The system includes ten specialized Agentic AI agents with clearly defined roles ‒ monitoring time parameters, predicting false alarm probabilities, automatically triggering playbooks, calculating operational metrics, and analyzing customer satisfaction. Five machine learning models were trained: two XGBoost classifiers for FPR and FNR, two LightGBM regressors for TTR and MTTR, and a BERT model for textual feedback analysis. The results demonstrate reduced detection and response times, a lower rate of false alarms, and improved operational predictability in calculating KPI values. The methodology shows the applicability of Agentic AI for optimizing SOC processes on real and public data, without the need for manual intervention in most processing phases.
Background: Globally we face a projected shortage of 11 million healthcare practitioners by 2030, and administrative burden consumes 50% of clinical time. Artificial intelligence (AI) has the potential to help alleviate these problems. However, no end-to-end autonomous large language model (LLM)-based AI system has been rigorously evaluated in real-world clinical practice. In this study, we evaluated whether a multi-agent LLM-based AI framework can function autonomously as an AI doctor in a virtual urgent care setting. Methods: We retrospectively compared the performance of the multi-agent AI system Doctronic and board-certified clinicians across 500 consecutive urgent-care telehealth encounters. The primary end points: diagnostic concordance, treatment plan consistency, and safety metrics, were assessed by blinded LLM-based adjudication and expert human review. Results: The top diagnosis of Doctronic and clinician matched in 81% of cases, and the treatment plan aligned in 99.2% of cases. No clinical hallucinations occurred (e.g., diagnosis or treatment not supported by clinical findings). In an expert review of discordant cases, AI performance was superior in 36.1%, and human performance was superior in 9.3%; the diagnoses were equivalent in the remaining cases. Conclusions: In this first large-scale validation of an autonomous AI doctor, we demonstrated strong diagnostic and treatment plan concordance with human clinicians. These findings indicate that multi-agent AI systems can achieve comparable clinical decision-making to human providers and offer a potential solution to healthcare workforce shortages. Key words: large language models, artificial intelligence, autonomous AI doctor, diagnostic accuracy
Healthcare operations are inherently complex, involving dynamic coordination across emergency triage, diagnostics, surgery, and discharge processes. Traditional orchestration methods such as manual scheduling, static bed boards, and siloed communication struggle to manage this complexity, often resulting in delayed interventions, inefficiencies, and suboptimal resource utilization, especially during high-acuity surges or emergencies. Agentic Artificial Intelligence (AI) introduces a transformative paradigm by embedding autonomy, reasoning, and negotiation capabilities into intelligent digital agents that perceive, learn, and act within clinical workflows. Unlike conventional AI systems that rely on predefined rules or static predictions, agentic AI employs multi-agent reinforcement learning (MARL) to enable decentralized decision-making, adaptive resource allocation, and cooperative policy optimization across interconnected hospital systems. This study presents an autonomous agentic AI framework for clinical workflow orchestration, integrating agents for triage, bed management, laboratory, imaging, transport, and discharge operations. Using HL7 Fast Healthcare Interoperability Resources (FHIR) and DICOM standards, agents exchange real-time information while adhering to governance and safety protocols aligned with the NIST AI Risk Management Framework and the EU AI Act. The architecture further incorporates three core design elements: (i) inter-hospital communication for mutual-aid and load sharing, (ii) decentralized ambulance routing that rebalances transport in real time based on dynamic capacity and patient acuity, and (iii) distributed crisis-management protocols for maintaining operational equilibrium during mass-casualty events. Evaluation through digital-twin simulations and shadow-mode deployments demonstrated substantial operational gains, including 60% faster ambulance response, 38% shorter door-to-clinician intervals, and 22% higher operating room throughput. These results confirm that agentic AI transforms reactive, human-initiated workflows into proactive, self-governing systems, enhancing responsiveness, equity, and resilience across healthcare networks.
No abstract available
No abstract available
Cloud-native organizations increasingly rely on microservices for backend modularity and micro-frontends for scalable user interface delivery. Yet, real-world systems still struggle to evolve these layers coherently under high release velocity, shifting product goals, and variable workloads. This paper presents a unified Agentic AI framework that autonomously coordinates the co-evolution of micro-frontend UIs (implemented in ReactJS and Angular) and microservices. The proposed architecture integrates reinforcement learning for continuous control, large language models for code and configuration synthesis, and a policy-governed multi-agent control plane that executes progressive delivery (feature flags, canary, blue-green) via Kubernetes and service meshes. We formalize decisions using Markov Decision Processes, propose drift detection models for UI-API compatibility, and formulate traffic-shifting optimization for safe rollouts. A mini empirical study across e-commerce, SaaS analytics, and multi-cloud migration scenarios demonstrates reductions in adaptation latency, error rates, and manual intervention relative to strong DevOps baselines. We discuss reliability, explainability, and governance challenges, and lay out future research on hybrid RL-LLM agents, knowledge-graph-aware planning, digital twins, and compliance-aware rewards.
In this paper, we explore the paradigm of Agentic AI, where generative AI systems are not limited to generate responses but become the agent in autonomous and context aware decisions. It discusses defining characteristics, architectural components, and the ethical considerations; real world applications included in autonomous vehicles, collaborative robots and personalized services. Challenges such as scalability and interpretability, as well as future opportunities such as interdisciplinary research and general-purpose adaptability, are also identified in the paper. In emerging arche of Agentic AI, dynamic decision making is addressed by methodological needs in industries.
The rapid evolution of Agentic Artificial Intelligence (AI)—autonomous, context-aware agents capable of self-directed decision-making—has introduced unprecedented security challenges for microservices architectures. Traditional session-based authentication, dependent on static tokens and centralized identity providers, is ill-suited for the dynamic, ephemeral, and machine-to-machine (M2M) interactions prevalent in zero trust environments. This paper investigates the convergence of Agentic AI and decentralized identity (DID) frameworks, emphasizing the role of verifiable credentials (VCs), dynamic token issuance, and contextual access control in enabling scalable, trust-minimized (i.e., reducing reliance on centralized authorities) service interactions. We propose a decentralized authentication and authorization framework where DIDs, maintained on blockchain-based registries, replace conventional identity silos, enabling autonomous agents to cryptographically prove trustworthiness without relying on persistent session states. Context-aware policy engines evaluate real-time telemetry such as location, workload, and behavioural patterns to issue short-lived, ephemeral access tokens with adaptive time-to-live (TTL) values. Experimental results from a Kubernetes-based microservices testbed with 50 simulated agents show that the proposed approach reduces authentication latency by 50% (from 180 ms to 90 ms), eliminates token replay vulnerabilities, and increases authentication throughput by 75% (from 800 to 1,400 agents/min) compared to OAuth2/JWT baselines. Furthermore, dynamic policy adaptation ensures immediate revocation of access when agents deviate from expected operational norms, minimizing attack surfaces. This work offers a novel synthesis of AI autonomy and decentralized identity principles, delivering both performance gains and enhanced security in zero trust microservices. The proposed architecture paves the way for resilient, self-governing ecosystems where Agentic AI can operate securely, efficiently, and adaptively in highly dynamic environments.
Through every economic sector, but especially among financial firms and enthusiasts, agentic AI systems are being tool-enabled, giving them control over large language models (LLM), reinforcement learning (RL) models, and more. We present a timely paper to explore the potential consequences with a novel combined system: a deep RL-based autonomous trading agent which also controls an LLM capable of posting to a simulated social media feed observed by other traders. As the agent trades, it also supplies order flow information to the LLM, which produces and posts natural language market analysis at the agent’s direction. We empirically investigate the performance and impact of such an agent using two DeepRL algorithms, finding that it learns to augment profit by manipulating sentiment in a sort of accidental pump and dump scheme. Along the way, we present confidence-building baseline results and specific insights from our investigation, before concluding with a discussion of results, limitations, and suggestions for future work.
This position paper presents A4FN, an Agentic Artificial Intelligence (AI) architecture for intent-driven automation in Flying Networks (FNs) using Unmanned Aerial Vehicles (UAVs) as access nodes. A4FN leverages Generative AI and Large Language Models (LLMs) to enable real-time, context-aware network control via a distributed agentic system. It comprises two components: the Perception Agent (PA), which semantically interprets multimodal input – including imagery, audio, and telemetry data – from UAV-mounted sensors to derive Service Level Specifications (SLSs); and the Decision-and-Action Agent (DAA), which reconfigures the network based on inferred intents. A4FN embodies key properties of Agentic AI, including autonomy, goal-driven reasoning, and continuous perception-action cycles. Designed for mission-critical, infrastructure-limited scenarios such as disaster response, it supports adaptive reconfiguration, dynamic resource management, and interoperability with emerging wireless technologies. The paper details the A4FN architecture, its core innovations, and open research challenges in multi-agent coordination and Agentic AI integration in next-generation FNs.
The deployment of AI agents within legacy Radio Access Network (RAN) infrastructure poses significant safety and reliability challenges for future 6G networks. This paper presents a novel Edge AI framework for autonomous network optimisation in Open RAN environments, addressing these challenges through three core innovations: (1) a persona-based multi-tools architecture enabling distributed, context-aware decision-making; (2) proactive anomaly detection agent powered by traffic predictive tool; and (3) a safety, aligned reward mechanism that balances performance with operational stability.Integrated into the RAN Intelligent Controller (RIC), our framework leverages multimodal data fusion, including network KPIs, a traffic prediction model, and external information sources, to anticipate and respond to dynamic network conditions. Extensive evaluation using realistic 5G scenarios demonstrates that the edge framework achieves zero network outages under high-stress conditions, compared to 8.4% for traditional fixed-power networks and 3.3% for large language model (LLM) agent-based approaches, while maintaining near real-time responsiveness and consistent QoS. These results establish that, when equipped with the right tools and contextual awareness, AI agents can be safely and effectively deployed in critical network infrastructure, laying the framework for intelligent and autonomous 5G and beyond network operations.
Autonomous agents in cloud computing represent a transformative evolution beyond traditional automation approaches, enabling self-directed management of complex application environments. This article explores the architectural framework, implementation patterns, and operational benefits of Agentic AI in cloud-based application management. Unlike conventional automation systems constrained by static rules and predetermined workflows, autonomous agents leverage advanced machine learning techniques to perceive environmental conditions, learn from interactions, and take independent actions aligned with organizational objectives. The architectural foundation integrates sensing, reasoning, action, and feedback layers to create cognitive systems capable of addressing the inherent complexity of modern distributed applications. Key implementation patterns examined include intelligent auto-remediation, proactive capacity management, autonomous patch management, and continuous compliance enforcement—each demonstrating distinctive operational advantages across diverse industry contexts. Benefits include significant operational efficiency improvements, cost optimization through intelligent resource management, enhanced risk mitigation through proactive security measures, and scalability advantages in multi-cloud environments. The article addresses technical challenges related to decision boundaries and explainability, organizational considerations including skills gaps and operational model transformation, and governance requirements for responsible autonomous operations. Mitigation strategies incorporate phased implementation approaches, comprehensive explainability frameworks, and appropriate human oversight models to ensure effective and responsible deployment.
Modern scientific discovery increasingly requires coordinating distributed facilities and heterogeneous resources, forcing researchers to act as manual workflow coordinators rather than scientists. Advances in AI leading to AI agents show exciting new opportunities that can accelerate scientific discovery by providing intelligence as a component in the ecosystem. However, it is unclear how this new capability would materialize and integrate in the real world. To address this, we propose a conceptual framework where workflows evolve along two dimensions which are intelligence (from static to intelligent) and composition (from single to swarm) to chart an evolutionary path from current workflow management systems to fully autonomous, distributed scientific laboratories. With these trajectories in mind, we present an architectural blueprint that can help the community take the next steps towards harnessing the opportunities in autonomous science with the potential for 100x discovery acceleration and transformational scientific workflows.CCS Concepts • Computing methodologies→Distributed computing methodologies; Multi-agent systems; Distributed computing methodologies; Parallel computing methodologies.
The integration of agentic AI, powered by large language models (LLMs) with autonomous reasoning, planning, and execution, into unmanned aerial vehicle (UAV) swarms opens new operational possibilities and brings the vision of the Internet of Drones closer to reality. However, infrastructure constraints, dynamic environments, and the computational demands of multi-agent coordination limit real-world deployment in high-risk scenarios such as wildfires and disaster response. This paper investigates the integration of LLM-based agentic AI and edge computing to realize scalable and resilient autonomy in UAV swarms. We first discuss three architectures for supporting UAV swarms - standalone, edge-enabled, and edge-cloud hybrid deployment - each optimized for varying autonomy and connectivity levels. Then, a use case for wildfire search and rescue (SAR) is designed to demonstrate the efficiency of the edge-enabled architecture, enabling high SAR coverage, reduced mission completion times, and a higher level of autonomy compared to traditional approaches. Finally, we highlight open challenges in integrating LLMs and edge computing for mission-critical UAV-swarm applications.
Enterprise data processing environments face increasing operational complexity that exceeds traditional manual management capabilities. Current Big Data platforms rely on reactive operational models that respond to system issues after they impact performance and user experience. This article introduces autonomous agent architectures that transform data platforms into intelligent systems capable of independent perception, reasoning, and action execution. The proposed framework integrates perception layers for comprehensive system monitoring, decision models that balance multiple competing objectives, and action orchestration mechanisms that implement optimizations automatically. Autonomous capabilities enable continuous performance tuning, intelligent failure recovery, dynamic cost optimization, and automated policy enforcement without human intervention. The architecture maintains scalability and fault tolerance characteristics while adding sophisticated reasoning capabilities that adapt to changing operational conditions. Implementation strategies offer practical deployment methods that reduce disruption while gradually adding autonomous features. The operational transformation enables proactive optimization that predicts and prevents issues before they affect system performance. Human-agent collaboration frameworks define effective interaction models that balance oversight with system autonomy. Risk mitigation strategies ensure safe autonomous operation through bounded decision-making and comprehensive safeguards. Performance evaluation metrics demonstrate significant improvements in operational efficiency, cost reduction, and system reliability through autonomous operation.
The promising potential of AI and network convergence in improving networking performance and enabling new service capabilities has recently attracted significant interest. Existing network AI solutions, while powerful, are mainly built based on the close-loop and passive learning framework, resulting in major limitations in autonomous solution finding and dynamic environmental adaptation. Agentic AI has recently been introduced as a promising solution to address the above limitations and pave the way for true, generally intelligent, and beneficial AI systems. The key idea is to create a networking ecosystem to support a diverse range of autonomous and embodied AI agents in fulfilling their goals. In this article, we focus on the novel challenges and requirements of agentic AI networking. We propose AgentNet, a novel framework for supporting interaction, collaborative learning, and knowledge transfer among AI agents. We introduce a general architectural framework of AgentNet and then propose a generative foundation model (GFM)-based implementation in which multiple GFM-as-agents have been created as an interactive knowledge-base to bootstrap the development of embodied AI agents according to different task requirements and environmental features. We consider two application scenarios, digital-twin-based industrial automation and metaverse-based infotainment system, to describe how to apply AgentNet for supporting efficient task-driven collaboration and interaction among AI agents.
This paper introduces a novel framework that integrates agentic Artificial Intelligence (AI) with Intent-Based Networks (IBN) to enable autonomous management, configuration, and optimization of mobile network services and resources. Leveraging the advanced reasoning and natural language processing capabilities of an Large Language Model (LLM), the proposed architecture translates high-level user intents into precise network actions, facilitating user-friendly and scalable network orchestration. The framework employs a distributed multi-agent system, where specialized agents collaborate to decompose user intents, provide computational infrastructure, and deploy services using industry-standard Infrastructure-as-Code (IaC) tools. By supporting natural language interactions, the system reduces operational complexity and enhances accessibility for users with varying technical expertise. Experimental evaluations demonstrate significant improvements in task completion rates, response accuracy, and operational efficiency compared to traditional manual methods, particularly for complex network management tasks. In essence, this work creates an intelligent network orchestration framework that adapts to user needs by automatically configuring network and computing resources while operating with minimal human intervention.
Developing clinical ML systems is costly and labor-intensive due to fragmented preprocessing, privacy constraints, and model-data alignment challenges. We introduce a modular agentic AI framework that automates the end-to-end ML lifecycle, from ingestion and anonymization to preprocessing, model selection, and interpretable inference. Each agent performs a well-defined task, enabling scalable workflows across structured and unstructured data. We evaluate the framework on public datasets from geriatrics, palliative care, and colonoscopy imaging. Data are automatically classified, anonymized via DLP, semantically represented, and mapped to suitable models using embedding- or LLM-based strategies. Preprocessing and inference agents ensure compatibility and produce interpretable outputs (e.g., SHAP, attention maps). By consolidating manual tasks into coordinated autonomous agents, our approach reduces expert intervention, lowers operational costs, and supports scalable clinical ML deployment.
Abstract This commentary introduces agentic artificial intelligence (AI) as an emerging paradigm in radiology, marking a shift from passive, user-triggered tools to systems capable of autonomous workflow management, task planning, and clinical decision support. Agentic AI models may dynamically prioritize imaging studies, tailor recommendations based on patient history and scan context, and automate administrative follow-up tasks, offering potential gains in efficiency, triage accuracy, and cognitive support. While not yet widely implemented, early pilot studies and proof-of-concept applications highlight promising utility across high-volume and high-acuity settings. Key barriers, including limited clinical validation, evolving regulatory frameworks, and integration challenges, must be addressed to ensure safe, scalable deployment. Agentic AI represents a forward-looking evolution in radiology that warrants careful development and clinician-guided implementation.
: This study presents a comprehensive quantitative analysis of Agentic AI performance and applications across various industries. Agentic Artificial Intelligent (AI), an emerging field combining advanced AI techniques with enterprise automation, has shown promise in creating autonomous agents capable of complex decision-making and problem-solving. Our research, conducted over a 12-month period, employed a mixed-methods approach, analyzing data from 500 organizations and incorporating insights from 50 industry experts. The study aimed to evaluate the efficiency, accuracy, and impact of Agentic AI systems compared to traditional AI approaches. Results demonstrate that Agentic AI systems significantly outperform traditional AI, with a 34.2% reduction in task completion time, 7.7% increase in accuracy, and 13.6% improvement in resource utilization. Productivity gains varied across industries, with the technology sector showing the highest improvement at 45%. The study also revealed high scalability of Agentic AI solutions across different organizational sizes, although implementation time increased with organization complexity. Key challenges identified include data privacy concerns, integration difficulties with legacy systems, skill gaps, and ethical considerations. Despite these challenges, the study concludes that Agentic AI has significant potential to transform business processes and decision-making across various sectors. Future research directions include enhancing interpretability, optimizing domain-specific applications, and exploring multi-agent collaborations. This research contributes valuable insights into the current state and future prospects of Agentic AI, providing a foundation for further development and implementation strategies in this rapidly evolving field.
The rise of Agentic AI—autonomous systems capable of executing tasks with self-directed decision-making—presents transformative potential for cybersecurity operations. However, as these systems begin to operate across threat detection, response orchestration, and policy enforcement, they introduce novel attack surfaces, decision-making opacity, and governance complexity. This paper introduces the Model–Control–Policy (MCP) framework as a structured approach to governing agentic AI workflows in cybersecurity. Through deep technical analysis, case studies including autonomous SOC agents and adaptive threat mitigation bots, and an evaluation of existing controls (e.g., explainability, human-in-the-loop, red-teaming), we explore how governance strategies must evolve to meet this new paradigm. We also propose specific policy recommendations and architectural safeguards to ensure accountability, resilience, and trust in AI-driven cybersecurity systems.
The recent development of Agentic AI systems, empowered by autonomous large language models (LLMs) agents with planning and tool-usage capabilities, enables new possibilities for the evolution of industrial automation and reduces the complexity introduced by Industry 4.0. This work proposes a conceptual framework that integrates Agentic AI with the intent-based paradigm, originally developed in network research, to simplify human-machine interaction (HMI) and better align automation systems with the human-centric, sustainable, and resilient principles of Industry 5.0. Based on the intent-based processing, the framework allows human operators to express high-level business or operational goals in natural language, which are decomposed into actionable components. These intents are broken into expectations, conditions, targets, context, and information that guide sub-agents equipped with specialized tools to execute domain-specific tasks. A proof of concept was implemented using the CMAPSS dataset and Google Agent Developer Kit (ADK), demonstrating the feasibility of intent decomposition, agent orchestration, and autonomous decision-making in predictive maintenance scenarios. The results confirm the potential of this approach to reduce technical barriers and enable scalable, intent-driven automation, despite data quality and explainability concerns.
Cultural heritage preservation increasingly relies on data-driven technologies, yet most existing systems lack the cognitive and temporal depth required to support meaningful, transparent, and policy-informed decision-making. This paper proposes a conceptual framework for memory-enabled, semantically grounded AI agents in the cultural domain, showing how the integration of the ICCROM/CCI ABC method for risk assessment into the Panoptes ontology enables the structured encoding of risk cognition over time. This structured risk memory becomes the foundation for agentic reasoning, supporting prioritization, justification, and long-term preservation planning. It is argued that this approach constitutes a principled step toward the development of Cultural Agentic AI: autonomous systems that remember, reason, and act in alignment with cultural values. Proof-of-concept simulations illustrate how memory-enabled agents can trace evolving risk patterns, trigger policy responses, and evaluate mitigation outcomes through structured, explainable reasoning.
While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.
This research paper investigates the critical importance of robust API and platform strategies for enterprises adapting to the proliferation of agentic AI, wherein AI systems autonomously execute tasks with limited human intervention. It addresses the imperative of facilitating seamless communication among AI agents, enterprise data systems, and external applications. The research examines the architectural and performance considerations essential for organizations to maintain competitiveness in this rapidly growing technological landscape of agentic AI projected to expand from $5.1 billion in 2024 to $47.1 billion by 2030. Key elements explored include unified data layer APIs, zero-trust authorization models, event-driven orchestration, and latency-sensitive design. Furthermore, the study considers emerging trends such as AI-powered SDKs, self-optimizing API gateways, autonomous API discovery, and ethical AI governance APIs. The findings emphasize that the adoption of modern API and platform architectures, optimization of performance metrics, and adherence to regulatory mandates are paramount for organizations to fully capitalize on the transformative potential of agentic AI. It is posited that enterprises embracing this paradigm shift will achieve a demonstrable competitive advantage, fostering innovation and operational excellence in the AI-driven future.
To achieve faithful reasoning that aligns with human expectations, large language models (LLMs) need to ground their reasoning to real-world knowledge (e.g., web facts, math and physical rules). Tools help LLMs access this external knowledge, but there remains challenges for fine-tuning LLM agents (e.g., Toolformer) to invoke tools in multi-step reasoning problems, where inter-connected tool calls require holistic and efficient tool usage planning. In this work, we propose a new method for LLMs to better leverage tools in multi-step reasoning. Our method, Chain-of-Abstraction (CoA), trains LLMs to first decode reasoning chains with abstract placeholders, and then call domain tools to reify each reasoning chain by filling in specific knowledge. This planning with abstract chains enables LLMs to learn more general reasoning strategies, which are robust to shifts of domain knowledge (e.g., math results) relevant to different reasoning questions. It also allows LLMs to perform decoding and calling of external tools in parallel, which avoids the inference delay caused by waiting for tool responses. In mathematical reasoning and Wiki QA domains, we show that our method consistently outperforms previous chain-of-thought and tool-augmented baselines on both in-distribution and out-of-distribution test sets, with an average ~6% absolute QA accuracy improvement. LLM agents trained with our method also show more efficient tool use, with inference speed being on average ~1.4x faster than baseline tool-augmented LLMs.
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and reduce hallucinations. However, developing prompting techniques that enable LLM agents to effectively use these tools and knowledge remains a heuristic and labor-intensive task. Here, we introduce AvaTaR, a novel and automated framework that optimizes an LLM agent to effectively leverage provided tools, improving performance on a given task. During optimization, we design a comparator module to iteratively deliver insightful and comprehensive prompts to the LLM agent by contrastively reasoning between positive and negative examples sampled from training data. We demonstrate AvaTaR on four complex multimodal retrieval datasets featuring textual, visual, and relational information, and three general question-answering (QA) datasets. We find AvaTaR consistently outperforms state-of-the-art approaches across all seven tasks, exhibiting strong generalization ability when applied to novel cases and achieving an average relative improvement of 14% on the Hit@1 metric for the retrieval datasets and 13% for the QA datasets. Code and dataset are available at https://github.com/zou-group/avatar.
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs’ tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
The recently proposed ToolkenGPT tool learning paradigm demonstrates promising performance but suffers from two major issues: first, it cannot benefit from tool documentation, and second, it often makes mistakes in whether to use a tool at all. We introduce Toolken+ that mitigates the first problem by reranking top $k$ tools selected by ToolkenGPT and the second problem with a special"Reject"option such that the model will generate a vocabulary token if"Reject"is ranked first. We demonstrate the effectiveness of Toolken+ on multistep numerical reasoning and tool selection tasks.
Table reasoning requires models to jointly perform comprehensive semantic understanding and precise numerical operations. Although recent large language model (LLM)-based methods have achieved promising results, most of them still rely on a single-turn reasoning paradigm that processes flattened tables in a single forward pass. This paradigm suffers from inherent limitations, including context overflow on large tables, weak sensitivity to continuous numerical values, and the absence of explicit tool-use and reflection. In this paper, we propose TableMind, a tuning-based autonomous programmatic table agent that simulates the human-like cognitive schema of multi-turn interaction within a lightweight LLM. Instead of adopting a training-free workflow design, TableMind learns to internalize planning, action, and reflection through a principled two-stage training strategy. To bootstrap structured table reasoning capabilities, we construct and filter high-quality reasoning data for the supervised fine-tuning (SFT) stage. To enable precise code generation, we introduce a designed multi-perspective reward scheme and a novel optimization objective in the reinforcement learning (RL) stage. Extensive experiments on diverse benchmarks demonstrate that TableMind consistently outperforms previous baselines, validating the effectiveness of training autonomous agents to improve overall performance.
Large Reasoning Models (LRMs) have become a central focus in today’s large language model (LLM) research, where models are designed to output a step-by-step thinking process before arriving at a final answer to handle complex reasoning tasks. Despite their promise, recent empirical studies (e.g., [Shojaee et al., 2025] from Apple) suggest that this thinking process may not actually enhance reasoning ability, where LLMs without explicit reasoning actually outperform LRMs on tasks with low or high complexity. In this work, we revisit these findings and investigate whether the limitations of LRMs persist when tool augmentations are introduced. We incorporate two types of tools, Python interpreters and scratchpads, and evaluate three representative LLMs and their LRM counterparts on Apple’s benchmark reasoning puzzles. Our results show that, with proper tool use, LRMs consistently outperform their non-reasoning counterparts across all levels of task complexity. These findings challenge the recent narrative that reasoning is an illusion and highlight the potential of tool-augmented LRMs for solving complex problems. Our source code is available at https: //github.com/magiclinux/thinking_is_not_an_illusion.
Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and fall short on geospatial tasks that require spatial reasoning, multi-hop planning, and real-time map interaction. To address these challenges, we introduce MapAgent, a hierarchical multi-agent plug-and-play framework with customized toolsets and agentic scaffolds for map-integrated geospatial reasoning. Unlike existing flat agent-based approaches that treat tools uniformly-often overwhelming the LLM when handling similar but subtly different geospatial APIs-MapAgent decouples planning from execution. A high-level planner decomposes complex queries into subgoals, which are routed to specialized modules. For tool-heavy modules-such as map-based services-we then design a dedicated map-tool agent that efficiently orchestrates related APIs adaptively in parallel to effectively fetch geospatial data relevant for the query, while simpler modules (e.g., solution generation or answer extraction) operate without additional agent overhead. This hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs. We evaluate MapAgent on four diverse geospatial benchmarks-MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA-and demonstrate substantial gains over state-of-the-art tool-augmented and agentic baselines. We open-source our framwork at https://github.com/Hasebul/MapAgent.
This paper presents the development of a novel ethical reasoning framework for robots. "Robots can feel" is the first system for robots that utilizes a combination of logic and human-like emotion simulation to make decisions in morally complex situations akin to humans. The key feature of the approach is the management of the Emotion Weight Coefficient, a customizable parameter to assign the role of emotions in robot decision-making. The system aims to serve as a tool that can equip robots of any form and purpose with ethical behavior close to human standards. Besides the platform, the system is independent of the choice of the base model. During the evaluation, the system was tested on 8 top-up-to-date LLMs (Large Language Models). This list included both commercial and open-source models developed by various companies and countries. The research demonstrated that, regardless of the model choice, the Emotions Weight Coefficient influences the robot’s decision similarly. According to ANOVA analysis, the use of different Emotion Weight Coefficients influenced the final decision in a range of situations, such as in a request for a dietary violation (F (4, 35) = 11.2, p = 0.0001) and in an animal compassion situation (F (4, 35) = 8.5441, p = 0.0001). A demonstration code repository is provided at: https://github.com/TemaLykov/robots_can_feel
Mathematical reasoning is an important research direction in the field of artificial intelligence. This article proposes a novel multi tool application framework for mathematical reasoning, aiming to achieve more comprehensive and accurate mathematical reasoning by utilizing the collaborative effect of large language models (LLMs) and multiple external tools. Firstly, use a Math Tool to perform basic mathematical calculations during the inference process through interaction with LLM. Secondly, Code Tool can generate code fragments that comply with syntax rules and execute them, providing support for complex mathematical problems. Then, through the iterative reasoning of the CoT Tool, the logical coherence and accuracy of mathematical reasoning are enhanced. Ultimately, by using self consistency tools to select the final answer based on different parameters, the consistency and reliability of reasoning are improved. Through the synergistic effect of these tools, the framework has achieved significant performance improvement in mathematical reasoning tasks. We conducted experiments on the NumGLUE Task 4 test set, which includes 220 mathematical reasoning fill in the blank questions. The experimental results showed that, based on Math Tool, Code Tool, and CoT Tool, in Task 4 task,our method achieved an accuracy of 89.09,compared with the GPT3+FewShot baseline, Few Shot+ERNIE-4.0+self consistency improved by 49.09%, and compared with fine-tuning the Fine tuning baseline, Few Shot+ERNIE-4.0+self consistency improved by 52.29%
Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to Tool Overuse, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce SMART (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent's self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce SMART-ER, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop SMARTAgent, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4o. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.
Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.
Clinicians often rely on data engineers to retrieve complex patient information from electronic health record (EHR) systems, a process that is both inefficient and time-consuming. We propose EHRAgent, a large language model (LLM) agent empowered with accumulative domain knowledge and robust coding capability. EHRAgent enables autonomous code generation and execution to facilitate clinicians in directly interacting with EHRs using natural language. Specifically, we formulate a multi-tabular reasoning task based on EHRs as a tool-use planning process, efficiently decomposing a complex task into a sequence of manageable actions with external toolsets. We first inject relevant medical information to enable EHRAgent to effectively reason about the given query, identifying and extracting the required records from the appropriate tables. By integrating interactive coding and execution feedback, EHRAgent then effectively learns from error messages and iteratively improves its originally generated code. Experiments on three real-world EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate, verifying its strong capacity to tackle complex clinical tasks with minimal demonstrations.
As Large Language Models (LLMs) evolve into powerful agentic systems, the telecommunications industry’s expansion into AI services necessitates industry-grounded benchmarks to evaluate their underexplored domain-specific capabilities. To address the gap left by generic benchmarks that fail to assess realistic, non-English performance, we present TelAgent-Bench, a Korean benchmark for the telecommunications domain evaluating five core agen-tic capabilities: Reasoning, Planning, Action (tool-use), Retrieval-Augmented Generation, and Instruction Following. Evaluations reveal significant performance disparities between models that employ explicit reasoning and those that do not, providing actionable insights for deploying agentic LLMs in real-world telecommunications tasks.
Large language models (LLMs) have shown great promise in automating data science workflows. However, existing models still struggle with multi-step reasoning and tool use, limiting their effectiveness on complex data analysis tasks. To address this limitation, we propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks and associated data files. Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task–solution pairs that reflect authentic tool-use patterns in practical data science scenarios. To further enhance the multi-step reasoning capabilities, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning. During inference, Jupiter combines the value model and node visit counts to efficiently collect executable multi-step plans with minimal search steps. Experimental results show that Qwen2.5-7B and 14B-Instruct models on NbQA solve 77.82% and 86.38% of tasks on InfiAgent-DABench, respectively—matching or surpassing GPT-4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool-use reasoning across diverse multi-step reasoning tasks.
Connecting Large Language Models (LLMs) with the ability to leverage APIs (Web Search, Charting, Calculators, Calendar, Flight Search, Hotel Search, Data Lookup, etc. ) is likely to allow us to solve a variety of new hard problems. Several research efforts have made this observation and suggested recipes for LLMs to emit API calls, and proposed mechanisms by which they can generate additional text conditioned on the output for the API call. However, in practice, the focus has been on relatively simple slot-filling tasks that make an API call rather unlocking novel capabilities by combining different tools, reasoning over the response from a tool, making multiple invocations, or complex planning. In this paper, we pose the following question: what does it mean to say that an LLM is proficient at using a set of APIs? We answer this question in the context of structured APIs by defining seven capabilities for API-use. We provide an approach for generating synthetic tasks that exercise each of these capabilities given only the description of an API. We argue that this provides practitioners with a principled way to construct a dataset to evaluate an LLM's ability to use a given set of APIs. Through human evaluations, we show that our approach produces high-quality tasks for each of the seven capabilities. We also describe how we used this approach to on-board new API and create principled evaluation sets for multiple LLM-based products.
Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extend their utility, enabling them to solve practical tasks. Previous methods manually parse tool documentation and create in-context demonstrations, transforming tools into structured formats for LLMs to use in their step-by-step reasoning. However, this manual process requires domain expertise and struggles to scale to large toolsets. Additionally, these methods rely heavily on ad-hoc inference techniques or special tokens to integrate free-form LLM generation with tool-calling actions, limiting the LLM's flexibility in handling diverse tool specifications and integrating multiple tools. In this work, we propose AutoTools, a framework that enables LLMs to automate the tool-use workflow. Specifically, the LLM automatically transforms tool documentation into callable functions, verifying syntax and runtime correctness. Then, the LLM integrates these functions into executable programs to solve practical tasks, flexibly grounding tool-use actions into its reasoning processes. Extensive experiments on existing and newly collected, more challenging benchmarks illustrate the superiority of our framework. Inspired by these promising results, we further investigate how to improve the expertise of LLMs, especially open-source LLMs with fewer parameters, within AutoTools. Thus, we propose the AutoTools-Learning approach, training the LLMs with three learning tasks on 34k instances of high-quality synthetic data, including documentation understanding, relevance learning, and function programming. Fine-grained results validate the effectiveness of our overall training approach and each individual task. Our methods are an important step towards the use of LLMs for solving real-world tasks with external tools.
Large language models have shown strong capabilities in performing natural language planning tasks, largely due to the chain-of-thought method, which enhances their ability to solve complex tasks through explicit intermediate inference. However, they face challenges in acquiring new knowledge, executing calculations, and interacting with the environment. Although previous work has enabled large language models to use external tools to improve reasoning and environmental interaction, there was no scalable or cohesive structure for these technologies. In this paper, we present LLM-Collab, where Collab represents the cooperative interaction between two AI agents, and the large language model plays a key role in the creation of AI agents. For this method, we took large language models as the reasoning core for AI agents and designed two AI agents to cooperate on the planning tasks: One as an analyst for tool selection and phase validation, and the other as an executor of specific tasks. Our method provided a comprehensive list of external tools to facilitate the invocation and integration of agents, ensuring a seamless collaboration process. This paradigm established a unified framework for autonomous task-solving based on massive language models by demonstrating how language communication and tool selection enable multi-agent collaboration.
Recent advances in large language models (LLMs) have enabled impressive progress across diverse tasks, yet interpretability remains a core requirement for deployment in high-stakes domains such as crisis prevention and policy-making. Prior work on event prediction has largely prioritized accuracy, but the reasoning behind model outputs often remains opaque and difficult to audit. In this paper, we propose C3OT, Causality Contextualized Chain-of-Thought, which integrates causal reasoning into an agentic LLM framework using the ReAct paradigm. We design and evaluate multiple prompting strategies-including Causal Chain Learning, Chain-of-Thought, and more nuanced hybrid approaches. Experiments assess both predictive accuracy and interpretability, the latter measured through structured rubrics that capture transparency, causal coherence, and auditability. Results demonstrate that our causal reasoning approach attains competitive predictive performance while producing more transparent and auditable reasoning traces. These findings underscore the value of causal reasoning for enhancing both trustworthiness and robustness in sociopolitical forecasting.
Generative AI systems and rational/active agents continue to struggle with long-horizon multi-step tasks, due to reasoning drift, unstable planning, and use of tools that are unreliable. ReAct-based agents currently are interpretable, but not robust to execution; diffusion-based planners generate smooth motion plans without a clear semantic grounding or tool-awareness. To overcome these shortcomings, this paper offers ReAct-Diffuse, a hybrid agentic-generative model that combines the structured ReAct reasoning with the diffusion-based plan refinement facilitating consistent and dependable autonomous task execution. The architecture consists of a twostage pipeline: A ReAct reasoning component first generates an explicit trace of reasoning and draft action plans; and a temporal-diffusion-refinement mechanism in the second stage denoises these interim plans while optimizing them for coherence, feasibility, and tool-use precision. The resulting location leading curves are implemented using an agentic control loop with feedback-based re-planning and safety constraints. We evaluate the proposed method on standard multi-step reasoning and tool-use benchmarks, e.g ALF World and BabyAI-MiniGrid with the evaluation metrics of plan coherence, execution success rate (ESR) and tool-use accuracy. In experiments3,17-19, the results indicate that ReAct-Diffuse is able to generate plans of 91.3% plan coherence rate, with a 88.7% execution success and with a 92.5% tool-use accuracy all outperforms state-of-the-art agentic systems including ReAct-GPT-4, Auto-GPT, Voyager and diffusion-only planners. These results demonstrate that complementing our explicit agentic reasoning with diffusion-based refinement considerably improves long-horizon autonomy, execution stability, and decision reliability in dynamic environments.
The electricity sector transition requires substantial increases in residential demand response capacity, yet Home Energy Management Systems (HEMS) adoption remains limited by user interaction barriers requiring translation of everyday preferences into technical parameters. While large language models have been applied to energy systems as code generators and parameter extractors, no existing implementation deploys LLMs as autonomous coordinators managing the complete workflow from natural language input to multi-appliance scheduling. This paper presents an agentic AI HEMS where LLMs autonomously coordinate multi-appliance scheduling from natural language requests to device control, achieving optimal scheduling without example demonstrations. A hierarchical architecture combining one orchestrator with three specialist agents uses the ReAct pattern for iterative reasoning, enabling dynamic coordination without hardcoded workflows while integrating Google Calendar for context-aware deadline extraction. Evaluation across three open-source models using real Austrian day-ahead electricity prices reveals substantial capability differences. Llama-3.3-70B successfully coordinates all appliances across all scenarios to match cost-optimal benchmarks computed via mixed-integer linear programming, while other models achieve perfect single-appliance performance but struggle to coordinate all appliances simultaneously. Progressive prompt engineering experiments demonstrate that analytical query handling without explicit guidance remains unreliable despite models'general reasoning capabilities. We open-source the complete system including orchestration logic, agent prompts, tools, and web interfaces to enable reproducibility, extension, and future research.
The development of the artificial intelligence (AI) system has advanced past simple reactivity-supported and rule-based agents to high-level learning mechanisms. The current shift that is emerging is associated with agentic AI systems, that are autonomous, goal-directed, and reason based and adaptive decision-making. The paper introduces a formal transition model that is used in developing agentic agents of the next generation. We describe the key characteristics, architectural requirements, technologies which enable this transformation as well aspects of ethics behind this transformation. In order to determine the validity of the framework, we provide a case study in the field of autonomous flying rebooking, showcasing how the use of multi-agent orchestration and goal planning can limit the user intervention and make the system more adaptive and better functioning. The effectiveness of our approach is confirmed by a comparative assessment of its work with well-known agentic systems (AutoGPT, ReAct, and CAMEL) in such important parameters as task success, autonomy, and coordination. The present work is contributing the potential to develop ethical and practical grounds of improving intelligent, scalable, and human-aligned AI systems.
The emergence of Agentic IoT, where autonomous intelligent agents such as mobile robots, UAVs, and industrial actuators independently execute complex missions, demands communication and security configurations that can adapt to both fast mission-driven changes and slower environment-driven performance drifts. Existing control paradigms are inadequate. Specifically, static policies cannot react to real-time variations, while task-aware adaptive policies largely overlook environmental dynamics, leaving systems vulnerable to network degradation and latency spikes. To address these limitations, we propose the Dynamic Model Context Protocol (dMCP), a cognitive control framework that bridges high-level mission intents with lowlevel system configurations via the standardized MCP interface. dMCP employs a Large Language Model to reason over real-time mission and environment contexts, generating executable policy vectors. An event-driven trigger mechanism re-evaluates policies upon abrupt mission changes or significant environmental drifts, ensuring timely adaptation without overreacting to transient fluctuations. Simulation results demonstrate that dMCP achieves higher reliability, reduced tail latency, and improved Service Level Objective compliance compared with both static and taskaware adaptive baselines, making it a viable control paradigm for highly dynamic Agentic IoT deployments.
Large language models (LLMs) have shown great potential in automated penetration testing (PT), but they need to rely on external knowledge to obtain precise output in complex command generation tasks. Although the traditional retrievalaugmented generation (RAG) can supplement external evidence to alleviate the hallucination problem, its static, single-round retrieval mechanism struggles to meet the dynamic knowledge needs of PT: the initial retrieval results may not cover the knowledge gaps generated during the reasoning process. To address it, this paper proposes a Reason-in-Document framework based on Agentic RAG, which transforms a single query into multi-step dynamic retrieval reasoning. Following the ReAct paradigm, the agent identifies knowledge gaps in the reasoning process, constructs targeted queries, and retrieves knowledge using the designated tools. By this mechanism, the system can gradually evolve the initial single query into a semantically progressive sub-query, enabling layer-by-layer knowledge refinement from conceptual-level understanding to implementation-level details.In this paper, a prototype system is implemented on the basis of PentestGPT, a structured knowledge base is constructed, and a variety of retrieval tool sets are designed. In the experiments of 22 real vulnerability scenarios, the Agentic RAG framework improves the accuracy of penetration command generation by 36% compared with the traditional RAG, which significantly enhances the knowledge utilization rate and the ability to generate complex commands. It provides a more efficient and accurate command generation mechanism for intelligent penetration testing.
Abstract Motivated by the astonishing capabilities of large language models (LLMs) in text-generation, reasoning, and simulation of complex human behaviors, in this paper, we propose a novel multi-component LLM-based framework, namely LLM4ACOE, that fully automates the collaborative ontology engineering (COE) process using role-playing simulation of LLM agents and retrieval augmented generation (RAG) technology. The proposed solution enhances the LLM-powered role-playing simulation with RAG ‘feeding’ the LLM with three different types of external knowledge. This knowledge corresponds to the knowledge required by each of the COE roles (agents), using a component-based framework, as follows: (a) domain-specific data-centric documents, (b) OWL documentation, and (c) ReAct guidelines. The aforementioned components are evaluated in combination, with the aim of investigating their impact on the quality of generated ontologies. The aim of this work is twofold, (a) to identify the capacity of LLM-based agents to generate acceptable (by human-experts) ontologies through agentic collaborative ontology engineering (ACOE) role-playing simulation, at specific levels of acceptance (accuracy, validity, and expressiveness of ontologies) without human intervention and (b) to investigate whether and/or to what extent the selected RAG components affect the quality of the generated ontologies. The evaluation of this novel approach is performed using ChatGPT-o in the domain of search and rescue (SAR) missions. To assess the generated ontologies, quantitative and qualitative measures are employed, focusing on coverage, expressiveness, structure, and human involvement.
No abstract available
The financial advisory profession demands extreme precision and speed in decision-making, compounded by the complexity of modern capital markets software. This often leads to high training overhead and reduces the time financial advisors can dedicate to client relations. This paper introduces an Agentic AI Co-Pilot designed as a significant architectural advancement beyond traditional Retrieval-Augmented Generation (RAG) systems. The core framework leverages a specialized Enterprise AI Flow to orchestrate a modular, decoupled agent architecture. The system's central component, the Reasoning and Action Agent (RAA), which implements the ReAct (Reasoning and Acting) paradigm that executes a fusion of explicit reasoning and external tool-use. This modularity allows the agent to: (1) interpret complex natural language queries, (2) articulate an internal step-by-step plan via Chain-of-Thought (CoT), and (3) autonomously execute a sequence of decoupled, modular API tools to perform high-stakes operations. This architectural separation ensures the seamless and incremental expansion of capabilities (e.g., integrating a risk-check API or a financial market forecasting module) without the need for retraining the core reasoning model. By providing both traceability and automated execution across complex workflows, the solution aims to substantially improve operational efficiency, enhance compliance through traceable decisions, and elevate the user experience in the highly regulated financial ecosystem.
Intelligent agent-driven research co-pilots, leveraging advances in generative AI, are transforming how scientists access biomedical knowledge. This paper presents Med.ai ASK, an agentic question-answering system designed to address biomedical inquiries through dynamic retrieval augmentation and tool-driven reasoning. We aim to develop a system capable of parsing the nuance in biomedical scientists’ research questions to provide reliable, grounded responses that are more accurate than other generative AI solutions. We adopt the ReAct framework’s tool-calling architecture and leverage atomic reasoning from Self-Discover to build Med.ai ASK. It selectively queries multiple biomedical knowledge bases and employs map-reduce tools for vector database retrieval, alongside external API and NER tool integration. We ingested 44 million biomedical documents from diverse sources. The agent is evaluated on a range of biomedical question-answering datasets. Human evaluation on an internal dataset shows strong performance and stability. Ratings from a large language model are aligned with human assessments, supporting its use in further experiments. Automatic evaluations indicate superior performance in long-form answers regarding accuracy, faithfulness, factuality, and reduced hallucinations. For short-form and multiple-choice answers, performance is competitive with state-of-the-art systems. The agent’s detailed answers are more interpretable than other systems attributed to its agentic design. The agent effectively selects tools based on question type and is deployed in a production-level chat platform with over 1600 users and 25 000 answered questions. Med.ai ASK dynamically orchestrates biomedical information retrieval tools to deliver robust interpretative, accurate, and factual answers, which is crucial in the biomedical domain.
This study investigates the feasibility of developing an automated solution that can generate dynamic decision tables from business process model and notation (BPMN)models using agentic artificial intelligence (AI). This work purpose is to reduce human error, inconsistencies and cognitive biases that can be introduced by traditional decision management in BPMN environments, which is often achieved by manually creating decision model and notation (DMN) decision tables (Richter et al., 2025). A novel AI-based solution is developed to generate dynamic decision tables from BPMN models. The proposed system integrates large language models within an agentic AI framework that autonomously analyses BPMN processes, identifies decision points and produces optimized DMN tables. The system employs agents for BPMN analysis, decision extraction, rule generation and validation, coordinated through a ReAct (Reasoning + Acting) engine with retrieval-augmented generation (RAG) capabilities (Zhang et al., 2025; Braunschweiler et al., 2025). Experimental evaluation of critical applications demonstrated that the system enhances decision-making by suggesting decision tables with values that humans might not intuitively identify. The system optimizes processes by transforming ambiguous paths into precise decisions. The framework is particularly effective in identifying non-obvious decision criteria and threshold parameters, resulting in significant process automation improvements. This approach establishes the foundation for intelligent, adaptive decision support systems within mission-critical environments and autonomous decision modeling that can dynamically adapt to evolving business requirements. This innovative approach represents the implementation of agentic AI specifically designed for automated DMN decision table generation from BPMN models, addressing a gap in the literature.
The proliferation of large language model-based agentic systems necessitates rigorous systems engineering approaches to context management. Contemporary frameworks, including Retrieval-Augmented Generation (RAG), ReAct, AutoGPT, and LangGraph, demonstrate autonomous capabilities but lack formal system specifications for context lifecycle, provenance tracking, and governance enforcement. This paper presents a systems engineering framework formalizing cognitive orchestration as a layered architecture with explicit invariants, interface contracts, and verification protocols. We introduce formal system models defining context as C = (K, M, P, T, V) with mathematical invariants ensuring consistency, completeness, and auditability. Our framework integrates Model Context Protocol (MCP) interfaces, establishing standardized contracts for agent coordination, memory management, and policy enforcement. Comparative analysis reveals systematic limitations in existing frameworks: RAG lacks multi-step context propagation (hallucination amplification 3.2×), ReAct exhibits unbounded memory growth (O(n²) with interaction length), AutoGPT suffers governance gaps (31% compliance violations), and LangGraph provides insufficient provenance tracking (34% audit coverage). Empirical validation through enterprise deployment, Annual Report Financial Analysis system processing 500+ documents across 15 regulatory frameworks, demonstrates quantifiable improvements: 94% reduction in compliance violations, 89% decrease in error propagation, 98% provenance completeness, and 3.1× mean time between failures compared to baseline architectures. System verification confirms invariant preservation across 10,000+ agent interactions with zero safety violations. This work establishes cognitive orchestration as essential infrastructure for production-grade agentic systems, providing formal foundations, architectural blueprints, and verification methodologies applicable across enterprise automation, financial analysis, regulatory compliance, and safety-critical domains.
Maintenance of mission-critical industrial assets is frequently hindered by fragmented data, inconsistent record-keeping, and limited access to analytical expertise, resulting in reactive rather than predictive practices. We present \textit{CodeReAct}, an AI-powered agentic framework deployed in large-scale facilities to automate event analysis and work order (WO) management.CodeReAct extends the ReAct paradigm by embedding executable Python code within the Thought--Action--Observation (TAO) loop, enabling natural language interaction, grounding heterogeneous alerts and work orders into structured Business Objects (BOs), and dynamically invoking analytic functions for forecasting, anomaly correlation, and maintenance recommendations. This architecture reduces manual data science intervention, improves adaptability, and supports reuse across asset types. Deployed in a mission-critical data center and productionized in Maximo, CodeReAct manages pumps, chillers, AHUs, compressors, cooling towers, and other mechanical and electrical systems. Evaluation with 36 representative maintenance utterances showed that outer-loop reflection and adaptive temperature improved task completion by up to 20%, while ablation studies confirmed the importance of reasoning in addition to code execution. Business validation revealed seasonal failure patterns, bundling opportunities, and predictive accuracy trends. In production, site engineers reported 25--40% faster diagnostics, fewer unplanned downtime events, and reduced reliance on specialized analysts. Lessons learned highlight the importance of structured BOs for grounding analytics, runtime safeguards to mitigate hallucinations, and adaptive model control for consistent execution. These results demonstrate how deployed agentic AI can deliver measurable business value in predictive and strategic maintenance planning.
Individuals entering Vietnam’s dynamic Information Technology (IT) job market face a critical gap in reliable career guidance. Existing market reports are often outdated, while the manual analysis of thousands of job postings is impractical for most. To address this challenge, we present the AI Job Market Consultant, a novel conversational agent that delivers deep, data-driven insights directly from the labor market in realtime. The foundation of our system is a custom-built dataset created via an automated pipeline that crawls job portals using Playwright and leverages the Large Language Model (LLM) to intelligently structure unstructured posting data. The core of our system is a tool-augmented AI agent, based on the ReAct agentic framework, which enables the ability of autonomously reasoning, planning, and executing actions through a specialized toolbox for SQL queries, semantic search, and data visualization. Our prototype successfully collected and analyzed 3,745 job postings, demonstrating its ability to answer complex, multistep queries, generate on-demand visualizations, and provide personalized career advice grounded in real-world data. This work introduces a new paradigm for labor market analysis, showcasing how specialized agentic AI systems can democratize access to timely, trustworthy career intelligence for the next generation of professionals.
University students face significant challenges in managing academic demands, which often lead to procrastination, stress, and diminished mental well-being. To address this, we developed a proactive AI-based assistant designed to support student productivity and health. The application leverages Large Language Models (LLMs) and an agentic AI framework based on the ReAct pattern to offer personalized task prioritization, dynamic scheduling, and cognitive load reduction. It uniquely integrates academic, emotional, and biological factors, such as circadian rhythms, to provide holistic, context-aware support. A usability study involving 30 participants showed favorable outcomes, achieving a System Usability Scale (SUS) score of 73.67, a Task Success Rate (TSR) of 83% for the AI scheduling task, average Single Ease Question (SEQ) score of 5.61 on a 7-point scale (where higher is better) indicating that the system is good enough to use. Qualitative feedback highlighted user satisfaction with the system's stability and AI-driven scheduling capabilities. This research presents a novel, adaptive platform that shifts from reactive, siloed educational tools to an anticipatory support system. The findings validate the potential of agentic AI to enhance academic performance, offering a scalable model for future student support in higher education.
The transition of Large Language Models (LLMs) from passive information retrieval interfaces to agentic systems capable of multi-step execution represents a significant paradigm shift in artificial intelligence. However, the reliability of these agents is frequently compromised by stochastic drift, hallucination, and the inability to maintain coherent context over extended planning horizons. This paper proposes a theoretical framework for Reliable Agent Delegation (RAD), focusing on structured elicitation techniques that constrain the probabilistic output of foundation models into deterministic workflows. We analyze role assignment mechanisms, meta-reasoning prompts, and self-corrective failure recovery loops. Drawing upon existing literature in Chain-of-Thought (CoT) reasoning, ReAct frameworks, and formal verification, we posit that imposing rigid syntactic and semantic constraints on elicitation allows for verifiable delegation between orchestrator and worker agents. We discuss the security implications of such architectures, specifically regarding indirect prompt injection and cascading logic failures, and outline a methodology for constructing robust, self-healing agentic systems.
Modern software delivery has accelerated from quarterly releases to multiple deployments per day. While CI/CD tooling has matured, human decision points interpreting flaky tests, choosing rollback strategies, tuning feature flags, and deciding when to promote a canary remain major sources of latency and operational toil. We propose AI-Augmented CI/CD Pipelines, where large language models (LLMs) and autonomous agents act as policy-bounded co-pilots and progressively as decision makers. We contribute: (1) a reference architecture for embedding agentic decision points into CI/CD, (2) a decision taxonomy and policy-as-code guardrail pattern, (3) a trust-tier framework for staged autonomy, (4) an evaluation methodology using DevOps Research and Assessment ( DORA) metrics and AI-specific indicators, and (5) a detailed industrial-style case study migrating a React 19 microservice to an AI-augmented pipeline. We discuss ethics, verification, auditability, and threats to validity, and chart a roadmap for verifiable autonomy in production delivery systems.
Large language models (LLMs) excel at solving complex tasks by executing agentic workflows composed of detailed instructions and structured operations. However, building agents for diverse applications by manually embedding foundation models into agentic systems such as Chain-of-Thought, Self-Reflection, and ReACT through text interfaces limits scalability and efficiency. Recently, researchers have explored automating workflow generation using code-based representations, but most methods depend on labeled data, limiting their applicability to real-world, dynamic hardware design problems. We introduce Polymath, a self-improving agent with a dynamic hierarchical workflow that combines task flow graphs with code-represented workflows to address these challenges. Polymath employs an experience-driven optimization framework that integrates multi-level graph optimization using surrogate scores from historical evaluations with a self-reflection-guided evolutionary algorithm for workflow refinement, enabling unsupervised self-improvement without labeled data. Experiments show that Polymath outperforms a leading commercial agentic system by 16.23% pass@1 and 11.47% pass@3 on hardware benchmarks, and achieves an average 8.1% improvement over state-of-the-art baselines on coding, math, and multi-turn QA tasks.
Innovation in nanophotonics currently relies on human experts who synergize specialized knowledge in photonics and coding with simulation and optimization algorithms, entailing design cycles that are time-consuming, computationally demanding, and frequently suboptimal. We introduce MetaChat, a multi-agentic design framework that can translate semantically described photonic design goals into high-performance, freeform device layouts in an automated, nearly real-time manner. Multistep reasoning is enabled by our Agentic Iterative Monologue paradigm, which coherently interfaces agents with code-based tools, other specialized agents, and human designers. Design acceleration is facilitated by Feature-wise Linear Modulation–conditioned Maxwell surrogate solvers that support the generalized evaluation of metasurface structures. We use freeform dielectric metasurfaces as a model system and demonstrate with MetaChat the design of multiobjective, multiwavelength metasurfaces orders of magnitude faster than conventional methods. These concepts present a scientific computing blueprint for using specialist design agents, surrogate solvers, and human interactions to drive multiphysics innovation and discovery.
We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Agentic Reasoning dynamically leverages web search, code execution, and structured memory to address complex problems requiring deep research. A key innovation in our framework is the Mind-Map agent, which constructs a structured knowledge graph to store reasoning context and track logical relationships, ensuring coherence in long reasoning chains with extensive tool usage. Additionally, we conduct a comprehensive exploration of the Web-Search agent, leading to a highly effective search mechanism that surpasses all prior approaches. When deployed on DeepSeek-R1, our method achieves a new state-of-the-art (SOTA) among public models and delivers performance comparable to OpenAI Deep Research, the leading proprietary model in this domain. Extensive ablation studies validate the optimal selection of agentic tools and confirm the effectiveness of our Mind-Map and Web-Search agents in enhancing LLM reasoning. The code is at: https://github.com/theworldofagents/Agentic-Reasoning
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at https://github.com/simular-ai/Agent-S.
Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning and (4) multimodal report generation. For the evaluation of the generated reports, we develop MultimodalReportBench which contains 100 diverse topics as inputs, and a set of dedicated metrics for report and chart evaluation. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82% overall win rate over the baseline method.
Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.
Our research reveals a new privacy risk associated with the vision language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term ''image private attribute profiling.'' This threat is particularly severe given that modern apps can easily access users' photo albums, and inference from image sets enables models to exploit inter-image relations for more sophisticated profiling. However, two main challenges hinder our understanding of how well VLMs can profile an individual from a few personal photos: (1) the lack of benchmark datasets with multi-image annotations for private attributes, and (2) the limited ability of current multimodal large language models (MLLMs) to infer abstract attributes from large image collections. In this work, we construct PAPI, the largest dataset for studying private attribute profiling in personal images, comprising 2,510 images from 251 individuals with 3,012 annotated privacy attributes. We also propose HolmesEye, a hybrid agentic framework that combines VLMs and LLMs to enhance privacy inference. HolmesEye uses VLMs to extract both intra-image and inter-image information and LLMs to guide the inference process as well as consolidate the results through forensic analysis, overcoming existing limitations in long-context visual reasoning. Experiments reveal that HolmesEye achieves a 10.8% improvement in average accuracy over state-of-the-art baselines and surpasses human-level performance by 15.0% in predicting abstract attributes. This work highlights the urgency of addressing privacy risks in image-based profiling and offers both a new dataset and an advanced framework to guide future research in this area.
Recent significant advances in integrating multiple Large Language Model (LLM) systems have enabled Agentic Frameworks capable of performing complex tasks autonomously, including novel scientific research. We develop and demonstrate such a framework specifically for the inverse design of photonic metamaterials. When queried with a desired optical spectrum, the Agent autonomously proposes and develops a forward deep learning model, accesses external tools via APIs for tasks like simulation and optimization, utilizes memory, and generates a final design via a deep inverse method. The framework's effectiveness is demonstrated in its ability to automate, reason, plan, and adapt. Notably, the Agentic Framework possesses internal reflection and decision flexibility, permitting highly varied and potentially novel outputs.
,
Traditional approaches to network management have been accessible only to a handful of highly-trained network operators with significant expert knowledge. This creates barriers for lay users to easily manage their networks without resorting to experts. With recent development of powerful large language models (LLMs) for language comprehension, we design a system to make network management accessible to a broader audience of non-experts by allowing users to converse with networks in natural language. To e!ectively leverage advancements in LLMs, we propose an agentic framework that uses an intermediate representation to streamline configuration across diverse vendor equipment, retrieves the network state from memory in real-time, and provides an interface for external feedback. We also conduct pilot studies to collect real user data of natural language utterances for network control, and present a visualization interface to facilitate dialogue-driven user interaction and enable large-scale data collection for future development. Preliminary experiments validate the e!ectiveness of our proposed system components with LLM integration on both synthetic and real user utterances. Through our data collection and visualization e!orts, we pave the way for more e!ective use of LLMs and democratize network control for everyday users.1
As chemical plants evolve towards full autonomy, the need for effective fault handling and control in dynamic, unpredictable environments becomes increasingly critical. This paper proposes an innovative approach to industrial automation, introducing validation and reprompting architectures utilizing large language model (LLM)-based autonomous control agents. The proposed agentic system, comprising of operator, validator, and reprompter agents, enables autonomous management of control tasks, adapting to unforeseen disturbances without human intervention. By utilizing validation and reprompting architectures, the framework allows agents to recover from errors and continuously improve decision-making in real-time industrial scenarios. We hypothesize that this mechanism will enhance performance and reliability across a variety of LLMs, offering a path toward fully autonomous systems capable of handling unexpected challenges, paving the way for robust, adaptive control in complex industrial environments. To demonstrate the concept's effectiveness, we created a simple case study involving a temperature control experiment embedded on a microcontroller device, validating the proposed approach.
Surgeons exhibit distinct operating styles shaped by training, experience, and motor behavior-yet most surgical AI systems overlook this personalization signal. We propose a novel agentic modeling approach for surgeon-specific behavior prediction in robotic surgery, combining a discrete diffusion framework with a vision-language-action (VLA) pipeline. Gesture prediction is framed as a structured sequence denoising task, conditioned on multimodal inputs including surgical video, intent language, and personalized embeddings of surgeon identity and skill. These embeddings are encoded through natural language prompts using third-party language models, allowing the model to retain individual behavioral style without exposing explicit identity. We evaluate our method on the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture sequences while learning meaningful motion fingerprints unique to each surgeon. To quantify the privacy implications of personalization, we perform membership inference attacks and find that more expressive embeddings improve task performance but simultaneously increase susceptibility to identity leakage. These findings demonstrate that while personalized embeddings improve performance, they also increase vulnerability to identity leakage, revealing the importance of balancing personalization with privacy risk in surgical modeling. Code is available at: https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.
A modularized, low-code automation system that follows the premise of the Agentic Artificial Intelligence (AI) internally known as Laila AI has been proposed in this paper which was completely designed, implemented and tested at the Graduate School of Business (GSB), Arab Academy for Science, Technology and Maritime Transport. Laila AI is self-generating, customizing, and real-time academic administrative workflow system with a dynamic interpretation of institutional input systems (LMS /SIS) and stakeholder data. In the system, the layer of the human-in-the-loop control is also implemented to guarantee its transparency, ethical regulatory use, and local adjustability. It was a mixed-methods case study that employed the use of system performance logs and structural surveys (n = 375) in addition to interviews with various stakeholders. The results indicate a high level of efficiency in operation: a more than 50% decrease in the time of task accomplishment, and automation of up to 70% of the assessment processes. More than 80% of the time, academic leadership reacted to strategic alerts within 48 hours. Qualitative information resonated with perceived gain in fairness, explainability, and trust among the stakeholders. Override and justification features provided active involvement of human reviewers, which supported the ethical dimension of governance. The above-stated findings assert that Laila AI encompasses a dualistic model of governance putting together independent decision reasoning and in-built moral control. Being a transparent, ethically controlled distributed digital form of administration, it presents a model that can be transferred to resource-restricted establishment of higher learning in disparate working environments.
No abstract available
Agentic AI describes the use of LLMs in novel AI agents that can answer questions or collaborate to achieve goals. These LLM agents can be used to build a novel generation of recommender systems. However, little is known about the LLM agents or their relationships needed to provide recommendations. Once identified, a framework can be constructed. Moreover, evaluating this framework is still not well understood. In this paper, we propose an agentic AI-based, multi-agent framework for recommender systems. We first identify LLM agents proposed in the literature, followed by the identification of their relationships and we propose a framework to represent them. Next, we evaluate this framework with respect to the LLM agents and functionalities of a recommender system based on published studies. This study is a stepping stone in a novel paradigm shift in the construction of recommender systems.
Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce \textbf{Search-o1}, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks, paving the way for more reliable and versatile intelligent systems. The code is available at \url{https://github.com/sunnynexus/Search-o1}.
Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner's tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.
: Smart buildings represent a significant trend in the future of the construction industry. The performance of human-computer interaction plays a vital role in achieving this from a human perspective. However, existing human-computer interaction algorithms are often limited to simple commands and fail to meet the complex and diverse needs of users. To address this issue, this paper introduces large language models (LLMs) and AI agents into smart buildings, proposing a general AI agent framework based on the ReAct strategy. The LLM serves as the system’s brain, responsible for reasoning and action planning, while tool calling mechanism puts the LLM’s plans into practice. Through this framework, developers can rely on prompt engineering alone to enable the LLM to interpret user intent accurately, perform appropriate actions
No abstract available
Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks through dynamic retrieval and adaptive workflows. Recent advances (e.g., Search-R1) have shown that outcome-supervised reinforcement learning demonstrate strong performance. However, this approach still suffers from inefficient exploration, sparse reward signals, and ambiguous global reward feedback. To address these challenges, we propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution, while introducing an efficient pruning strategy to optimize data expansion. Through comprehensive process-level policy optimization, DecEx-RAG significantly enhances the autonomous task decomposition, dynamic retrieval, and high-quality answer generation capabilities of large language models (LLMs). Experiments show that DecEx-RAG achieves an average absolute performance improvement of $6.2\%$ across six datasets, significantly outperforming existing baselines. Moreover, the pruning strategy improves data construction efficiency by nearly $6 \times$, providing an efficient solution for process-supervised RAG training. The code is available at https://github.com/sdsxdxl/DecEx-RAG.
Generative artificial intelligence (AI) and large language models (LLMs) are reshaping the landscape of intelligent educational systems; however, existing solutions often suffer from unstructured resource organization, limited interpretability, and suboptimal retrieval precision. To address these challenges, this study introduces KA-RAG, a course-oriented question answering (QA) framework that integrates a structured Knowledge Graph (KG) with an Agentic Retrieval-Augmented Generation (Agentic-RAG) workflow. The system incorporates a responsive interface, a unified agent controller (ToolPlanner), a course knowledge graph, and a vector-based retrieval subsystem. By combining symbolic graph reasoning with dense semantic retrieval, the proposed dual-retrieval strategy supports interpretable, context-aware responses to course-related queries. Experiments conducted on a graduate-level Pattern Recognition course demonstrate that KA-RAG achieves a retrieval accuracy of 91.4%, semantic consistency of 87.6%, and an average response latency of 2.8 s. User surveys further reveal significant improvements in learning efficiency and satisfaction. The results validate the feasibility of integrating KG and Agentic-RAG techniques for knowledge-grounded educational applications, offering a practical pathway toward intelligent knowledge organization and interactive learning support.
Although Vision Language Models (VLMs) have shown generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text–image search, enabling retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering.
The aim of this research paper is to present a novel approach of using Relevance Generative Answering (RGA) in the trending field of Agentic Retrieval Augmented Generation (RAG). The paradigm shift in the RAG system by the introduction of Agentic RAG has opened a new research paradigm. The major issue of hallucination is overcome with the use of a traditional RAG system with some limitations like accuracy and relevance, lack of reasoning, the lost in the middle problem, etc. The Agentic RAG system attempts to address a few of these limitations. However, interpreting results based on the user's intent remains a significant area of research. This research aimed to understand user intent by introducing relevance detection block in the proposed architecture. Different performance metrics like precision, recall, F1 score, relevance, latency are used to validate the proposed approach. The results presented in this research reveal that the performance of the proposed system is much more relevant compared to agentic RAG system. For context and intent specific applications proposed framework suits well.
Code smells—subtle indicators of poor design choices—pose significant challenges to software maintainability and readability, particularly in dynamic languages such as Python. Traditional detection methods, including rule-based heuristics and static machine learning classifiers, often suffer from limited adaptability, poor contextual awareness, and lack of explainability. These limitations hinder their effectiveness in evolving codebases and real-world development environments. This study introduces a novel Agentic retrieval-augmented generation (Agentic RAG) framework for code smell detection, marking the first application of agentic reasoning in this domain. By embedding autonomous agents into the retrieval and reasoning pipeline, the proposed system dynamically routes queries, selects optimal retrieval strategies, and synthesizes context-aware explanations using large language models (LLMs). Unlike static classifiers, the proposed framework leverages hybrid retrieval (sparse + dense) and structured prompting to detect and explain Long Method and Large Class smells with high interpretability. Experimental results demonstrate that Agentic RAG—particularly when paired with DeepSeek and chain-of-thought prompting—achieves superior performance, with 89.5% accuracy, a macro F1-score of 78.3%, and a weighted F1 of 88.7%. To assess generalization, Experiment 2 extended the framework to 21 distinct code smell types across multiple programming languages, achieving 94.85% accuracy, a macro F1-score of 90.24%, and a weighted F1-score of 94.93% through stratified five-fold cross-validation, thereby confirming the model’s robustness and scalability. Beyond academic benchmarks, this work lays the foundation for real-world integration into developer platforms, enabling real-time code review, contextual feedback, and actionable refactoring suggestions. By bridging LLMs with dynamic retrieval and agentic reasoning, this framework advances the frontier of intelligent software quality assurance.
Objective The practice of evidence-based medicine can be challenging when relevant data are lacking or difficult to contextualize for a specific patient. Large language models (LLMs) could potentially address both challenges by summarizing published literature or generating new studies using real-world data. Materials and Methods We submitted 50 clinical questions to five LLM-based systems: OpenEvidence, which uses an LLM for retrieval-augmented generation (RAG); ChatRWD, which uses an LLM as an interface to a data extraction and analysis pipeline; and three general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini 1.5 Pro). Nine independent physicians evaluated the answers for relevance, quality of supporting evidence, and actionability (i.e., sufficient to justify or change clinical practice). Results General-purpose LLMs rarely produced relevant, evidence-based answers (2–10% of questions). In contrast, RAG-based and agentic LLM systems, respectively, produced relevant, evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. OpenEvidence produced actionable results for 48% of questions with existing evidence, compared to 37% for ChatRWD and <5% for the general-purpose LLMs. ChatRWD provided actionable results for 52% of questions that lacked existing literature compared to <10% for other LLMs. Discussion Special-purpose LLM systems greatly outperformed general-purpose LLMs in producing answers to clinical questions. Retrieval-augmented generation-based LLM (OpenEvidence) performed well when existing data were available, while only the agentic ChatRWD was able to provide actionable answers when preexisting studies were lacking. Conclusion Synergistic systems combining RAG-based evidence summarization and agentic generation of novel evidence could improve the availability of pertinent evidence for patient care.
Retrieval-augmented generation (RAG) has emerged as a pivotal technology in natural language processing, owing to its efficacy in generating factual content. However, its informative inputs and complex paradigms often lead to a greater variety of errors. Consequently, achieving automated on-policy assessment and error-oriented correction remains an unresolved issue. In this paper, we propose RAG-Critic, a novel framework that leverages a critic-guided agentic workflow to improve RAG capabilities autonomously. Specifically, we initially design a data-driven error mining pipeline to establish a hierarchical RAG error system. Based on this system, we progressively align an error-critic model using a coarse-to-fine training objective, which automatically provides fine-grained error feedback. Finally, we design a critic-guided agentic RAG workflow that cus-tomizes executor-based solution flows based on the error-critic model’s feedback, facilitating an error-driven self-correction process. Experimental results across seven RAG-related datasets confirm the effectiveness of RAG-Critic, while qualitative analysis offers practical insights for achieving reliable RAG systems. Our dataset and code are available at https: //github.com/RUC-NLPIR/RAG-Critic .
Agentic Generative AI, powered by Large Language Models (LLMs) and enhanced with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable across specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain we focus on here comprises inherently complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research and decision-making. Here, we introduce a generative AI system, a jurisdiction-specific legal information retrieval that integrates RAG, VS, and KG, constructed via Hierarchical Non-Negative Matrix Factorization (HNMFk), to enhance information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends—challenging tasks essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This approach is demonstrated in legal document clustering, summarization, and cross-referencing tasks. The framework marks a significant step toward augmenting legal research with scalable, interpretable, and accurate retrieval methods for semi-structured data, advancing the intersection of computational law and artificial intelligence.
This article explores the application of Retrieval-Augmented Generation (RAG) to enhance the creation of knowledge assets and develop actionable insights from complex datasets. It begins by contextualising the limitations of large language models (LLMs), notably their knowledge cut-offs and hallucination tendencies, and it will present RAG as a promising solution that integrates external knowledge retrieval to improve factual accuracy and relevance. This study reviews current RAG architectures, including naïve and advanced models, emphasising techniques such as optimised indexing, query refinement, metadata utilisation, and the incorporation of autonomous AI agents in agentic RAG systems. Methodologies for effective data preprocessing, semantic-aware chunking, and retrieval strategies—such as multihop retrieval and reranking—are also discussed to address challenges such as irrelevant retrieval and semantic fragmentation. This work further examines embedding models, notably the use of state-of-the-art vector representations, to facilitate precise similarity searches within knowledge bases. A case study demonstrates the deployment of an RAG pipeline for analysing multisheet datasets, highlighting challenges in data structuring, prompt engineering, and ensuring output consistency.
Multimodal Retrieval Augmented Generation (MRAG) systems have shown promise in enhancing the generation capabilities of multimodal large language models (MLLMs). However, existing MRAG frameworks primarily adhere to rigid, single-step retrieval strategies that fail to address real-world challenges of information acquisition and query reformulation. In this work, we introduce the task of Multimodal Retrieval Augmented Generation Planning (MRAG Planning) that aims at effective information seeking and integration while minimizing computational overhead. Specifically, we propose CogPlanner, an agentic plug-and-play framework inspired by human cognitive processes, which iteratively determines query reformulation and retrieval strategies to generate accurate and contextually relevant responses. CogPlanner supports parallel and sequential modeling paradigms. Furthermore, we introduce CogBench, a new benchmark designed to rigorously evaluate the MRAG Planning task and facilitate lightweight CogPlanner integration with resource-efficient MLLMs, such as Qwen2-VL-7B-Cog. Experimental results demonstrate that CogPlanner significantly outperforms existing MRAG baselines, offering improvements in both accuracy and efficiency with minimal additional computational costs.
Retrieval-Augmented Generation (RAG) systems often face limitations in specialized domains such as fintech, where domain-specific ontologies, dense terminology, and acronyms complicate effective retrieval and synthesis. This paper introduces an agentic RAG architecture designed to address these challenges through a modular pipeline of specialised agents. The proposed system supports intelligent query reformulation, iterative sub-query decomposition guided by keyphrase extraction, contextual acronym resolution, and cross-encoder-based context re-ranking. We evaluate our approach against a standard RAG baseline using a curated dataset of 85 question–answer–reference triples derived from an enterprise fintech knowledge base. Experimental results demonstrate that the agentic RAG system outperforms the baseline in retrieval precision and relevance, albeit with increased latency. These findings suggest that structured, multi-agent methodologies offer a promising direction for enhancing retrieval robustness in complex, domain-specific settings.
Maintaining compliance with complex Know Your Customer (KYC) and Anti-Money Laundering (AML) regulations is a resource-intensive challenge for financial institutions. This paper presents an agentic AI approach that leverages Retrieval-Augmented Generation (RAG) to automate and enhance compliance research and decision-making. We define the inefficiencies in current U.S. KYC/AML compliance workflows – including lengthy onboarding times and costly manual processes – as motivation for a more dynamic solution. We then introduce an autonomous agent framework, implemented with LangChain, that integrates a RAG pipeline to perform contextual reasoning over regulatory knowledge bases. The technical architecture is detailed with an emphasis on the agent’s planning and tool use capabilities, and the RAG components for knowledge base construction (using U.S. regulations such as FinCEN guidance, Code of Federal Regulations (CFR) provisions, and OFAC sanctions data), transformer-based embedding and indexing, vector retrieval, and LLM-driven answer generation. We demonstrate how this agent can handle compliance queries (e.g., customer due diligence requirements and detection of transaction structuring) in a simulated proof-of-concept. We discuss key advantages of this approach over traditional rule-based or static NLP systems – notably greater adaptability to changing regulations, improved traceability via source citations, and higher precision in complex scenario handling. Finally, we address ethical considerations (hallucination risk, ensuring regulatory accuracy, and model governance) and explore practical applications such as automated audit support, compliance report drafting, and future directions including real-time monitoring and multimodal compliance agents.
Leveraging the autonomous decision-making capabilities of large language models (LLMs) has demonstrated superior performance in reasoning tasks. However, despite the success of iterative or agentic retrieval-augmented generation (RAG) techniques, these methods are often constrained to a single solution space when confronted with complex problems. In this paper, we propose a novel thinking pattern in RAG that integrates autonomous strategic planning with efficient reasoning actions, significantly activating intrinsic reasoning capabilities and expanding the solution space of specific tasks via Monte Carlo Tree Search (MCTS), which we refer to as AirRAG. Specifically, our approach designs five fundamental reasoning actions, which are expanded to a broad tree-based reasoning space using MCTS. The approach also incorporates self-consistency verification to explore potential reasoning paths and inference scaling law. Additionally, computationally optimal strategies are employed to allocate more inference resources to key actions, thereby enhancing overall performance. Experimental results demonstrate the effectiveness of AirRAG, showing significant performance gains on complex question-answering datasets. Furthermore, AirRAG is flexible and lightweight, making it easy to integrate with other advanced technologies and models.
Given a semi-structured knowledge base (SKB), where text documents are interconnected by relations, how can we effectively retrieve relevant information to answer user questions? Retrieval-Augmented Generation (RAG) retrieves documents to assist large language models (LLMs) in question answering; while Graph RAG (GRAG) uses structured knowledge bases as its knowledge source. However, many questions require both textual and relational information from SKB - referred to as"hybrid"questions - which complicates the retrieval process and underscores the need for a hybrid retrieval method that leverages both information. In this paper, through our empirical analysis, we identify key insights that show why existing methods may struggle with hybrid question answering (HQA) over SKB. Based on these insights, we propose HybGRAG for HQA consisting of a retriever bank and a critic module, with the following advantages: (1) Agentic, it automatically refines the output by incorporating feedback from the critic module, (2) Adaptive, it solves hybrid questions requiring both textual and relational information with the retriever bank, (3) Interpretable, it justifies decision making with intuitive refinement path, and (4) Effective, it surpasses all baselines on HQA benchmarks. In experiments on the STaRK benchmark, HybGRAG achieves significant performance gains, with an average relative improvement in Hit@1 of 51%.
Agentic Retrieval Augmented Generation (RAG) and'deep research'systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.
Real-world data in domains such as finance and fraud detection can be rare, imbalanced, or inaccessible, necessitating synthetic data as a crucial alternative. Gathering and leveraging real-world data in such domains is subject to important challenges such as privacy issues, legality, high cost of annotation, and restricted access due to proprietary ownership. Synthetic data generation in this context offers a meaningful alternative to real data gathering, reducing both privacy and computational costs while allowing for the construction of flexible, scalable datasets. This paper presents a new paradigm for tabular data synthesis through CTGAN (Conditional Tabular GAN) with integration into agentic workflows and retrieval-augmented generation (RAG). The proposed system herein accepts partial data samples and column constraints as inputs from a user-friendly chatbot interface and augment the dataset intelligently through an AI-agent-based generation pipeline. These AI agents aid in the automation of preprocessing, column semantics interpretation, and the enforcement of user-specified constraints specified in natural language, minimizing manual intervention by a considerable margin. The framework further includes ChromaDB to enable semantic retrieval of past relevant datasets. With this semantic memory, the model can improve generation quality, apply schema-level consistency, and update even synthesis of new datasets based on column names or metadata alone. It allows for context-aware, structurally sound, and domain-conformant data generation—without the need to access sensitive or full datasets. The current research utilizes statistical measures like mean, variance, and the Kolmogorov–Smirnov (KS) test to confirm the fidelity of data produced. The approach maintains a mean difference of just 0.16% and a KS statistic of 0.0020, which reflects outstanding statistical consistency with original distributions of data. Preliminary results show significant enhancements in data realism, diversity, and variability without sacrificing domain coherence. The system introduced is particularly well-adapted to financial datasets, such as applications in credit card fraud detection, and offers a scalable, privacy-aware method of synthetic data generation in sensitive or data-scarce environments.
This research-to-practice full paper investigates how Large Language Models (LLMs) with long-context window and enhanced retrieval efficiency can generate context-specific quizzes to address high attrition in engineering education. We aim to enable LLMs to process large, multidisciplinary Artificial Intelligence (AI) datasets, covering topics like Machine Learning, Generative AI, and Neural Networks, from foundational to advanced concepts. A systematic literature review identified gaps in Retrieval-Augmented Generation (RAG) systems, which often retrieve irrelevant chunks due to context limitations, leading to inaccurate or hallucinated responses [1], [2]. Traditional quiz generation lacks modular design, limiting scalability and interpretability. To address this, we developed an agentic long-context RAG architecture using Gemini 1.5's one-million-token window, integrating retrieval, reasoning, and evaluation in a unified pipeline. Our methodology employed a modular Agentic AI system. A Parsing Agent extracts text from academic sources, followed by a Chunking & Storage Agent segmenting content with character overlaps. An Embedding & Indexing Agent generates and indexes vector embeddings, verified by a Verification Agent for topical alignment. For quiz generation, a Retriever Agent uses cosine similarity and multilingual re-ranking, a Selector Agent filters meaningful chunks, a Response Agent leverages cached ground-truth MCQs with an LLM, and an Evaluator Agent assesses outputs. Experiments on a 150-question benchmark showed accuracy improvements: 78.00 % (raw), 84.00 (chunks), $\mathbf{8 9. 3 3 \%}$ (chunks+cache), and $\mathbf{9 3. 3 3 \%}$ ($\mathbf{1 M}$ context+cache) for Gemini, with GPT-4o and Claude Sonnet 3.7 revealing complementary strengths in precision and confidence. Future work includes deploying an interactive quiz application and expanding domain-specific datasets across engineering fields.
The increasing complexity of Radio Access Network (RAN) environments, especially 5G and future 6G infrastructures, has prompted the development of smarter and more flexible network automation infrastructures. As a more advanced form of context-driven decision-making and process automation in wireless networks, Large Language Models (LLMs) have recently been refined using Retrieval-Augmented Generation (RAG). This paper reviews current developments in applying RAG-augmented LLMs to RAN automation, including spectrum and power allocation, fault detection in distributed RANs, and secure 5G/6G multi-agent automation. It also presents comparative studies with more conventional approaches, such as Deep Reinforcement Learning (DRL), and discusses multi-agent systems, graph-based retrieval mechanisms, and agentic AI systems. The review highlights potential limitations, including safety concerns, data management challenges, and scalability issues, as well as future research and implementation directions. The discussion demonstrates the disruptive potential of RAG-enhanced LLMs in reshaping automation and intelligence in next-generation wireless networks.
: Retrieval Augmented Generation has improved LLM question answering significantly. However, this mechanism still produces hallucinations and structural incoherence in knowledge-intensive tasks. Additionally, many existing techniques neither holistically leverage multiple properties of text nor integrate diverse prompting and agenting frameworks. To address these limitations, this paper proposes a novel methodology that extracts and utilizes unstructured and structured properties of text to construct layered RAG pipelines designed to enhance complex LLM reasoning. Our approach synthesizes three distinct RAG methodologies, each specialized in various aspects: textual entity knowledge graph extraction (Textual Entity RAG); community summary and entity generation (Microsoft GraphRAG), and structural link navigation (MetaWiki RAG). By cumulatively layering these techniques along with advanced prompting and agentic evaluation, we aim to capture a more comprehensive context, enabling the model to generate well-structured responses that reflect all relevant attributes of the text. The proposed framework not only enhances existing RAG mechanisms but also demonstrates the effective integration of knowledge graphs. Additionally, it showcases the application of this framework to advanced answer generation using Wikipedia, with extensions to similar knowledge networks. This novel approach offers a robust solution for social recommender systems and other practical applications, delivering holistic outcomes by synthesizing diverse RAG techniques.
Retrieval-Augmented Generation (RAG) systems promise practical legal assistance by grounding Large Language Models (LLMs) in external authority. However, standard RAG optimizes semantic similarity and often fails to respect common-law constraints such as jurisdictional bindingness, court hierarchy, temporal validity, and negative treatment. We propose Precedent- Aware Multi-Agent RAG (PA-MA-RAG), an agentic architecture that decomposes legal research and writing into specialized agents for issue framing, authority planning, retrieval, precedent ranking, conflict resolution, drafting, and citation verification. Our method introduces an authority- constrained re-ranking objective that prioritizes controlling precedents while penalizing overruled or otherwise negatively treated cases. The verifier agent enforces evidence-grounded generation by requiring each legal proposition to be supported by retrieved holdings and quotations. We describe an evaluation protocol for both precedent retrieval and citation-grounded legal analysis generation, including authority correctness, supported-claim rate, and robustness to conflicting precedent.
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval to improve factual reliability. Traditional RAG employs a fixed, single-pass retrieval process, limiting its ability to handle multi-step reasoning, adaptive queries, and heterogeneous data sources. Agentic RAG extends this framework with autonomous agents that plan, iterate retrieval, integrate tools, and reason over intermediate results. This paper presents a comprehensive comparison of Traditional and Agentic RAG in terms of architecture, capabilities, evaluation metrics, and operational challenges. In addition to synthesizing representative systems, we provide a side-by-side analysis of comparative limitations, failure modes, and corresponding mitigations, mapping domain-specific applications across established and emerging fields. We also outline governance recommendations and propose future research directions, including graph-augmented, multimodal, human-in-the-loop, and domain-specialized Agentic RAG frameworks with standardized model cards. These insights offer both a technical and practical foundation for designing more adaptive, trustworthy, and context-aware retrieval-augmented systems.
This paper presents a doctoral research focusing on integrating Retrieval-Augmented Generation (RAG) into video-related multimodal tasks. Existing RAG studies predominantly target text, images, or tabular data, overlooking the unique value of video as a knowledge carrier. We address this gap by: 1) proposing AdaVideoRAG, a framework that adaptively allocates retrieval strategies based on query complexity for long-video understanding; 2) developing REViG (RAG-Enhanced Video Generation) to optimize prompt engineering via retrieved knowledge for controllable video synthesis; 3) constructing the UltraVideo dataset (UHD-4K/8K resolution, 100+ themes, 10 structured captions per video) and HiVU/HiVG benchmarks to evaluate RAG-driven video tasks. Experiments validate the effectiveness of our methods, and we outline future plans to unify video understanding and generation through Agentic RAG for AGI-oriented research.
Objective: To evaluate if a tool-using agent-based system utilizing large language models (LLMs) for medical question-answering (QA) tasks outperforms standalone LLMs. Methods: We developed a unified, open-source LLM-based agentic system that integrates document retrieval, re-ranking, evidence grounding, and diagnosis generation to support dynamic, multi-step medical reasoning. Our system features a lightweight retrieval-augmented generation pipeline coupled with a cache-and-prune memory bank, enabling efficient long-context inference beyond standard LLM limits. The system autonomously invokes specialized tools, eliminating the need for manual prompt engineering or brittle multi-stage templates. We compared the agentic system against standalone LLMs on various medical QA benchmarks. Results: Evaluated on five well-known medical QA benchmarks, our system outperforms or closely matches state-of-the-art proprietary and open-source medical LLMs in multiple-choice and open-ended formats. Specifically, our system achieved accuracies of 82.98% on USMLE Step 1 and 86.24% on USMLE Step 2, surpassing GPT-4's 80.67% and 81.67%, respectively, while closely matching on USMLE Step 3 (88.52% vs. 89.78%). Conclusion: Our findings highlight the value of combining tool-augmented and evidence-grounded reasoning strategies to build reliable and scalable medical AI systems.
Scientists and operators at SLAC National Accelerator Laboratory rely on electronic logbooks (ELOGs) to record and share critical information surrounding accelerator operations. However, since creating log entries is time-consuming and complex, they are often brief, incomplete, filled with jargon, and inconsistently structured. With thousands of records spanning decades, this makes it difficult for operators to search for and interpret information. Through interviews with operators, we identified two critical gaps: the lack of automated shift summarization and the difficulty of real-time ELOG information retrieval. Therefore, we introduce ChatEED, a novel agentic retrieval-augmented generation (RAG) system that is operator-centric and addresses these two needs while also prioritizing security, modularity, efficiency, and transparency. In this paper, we analyze the operator needs and workflow that guide the system design, detail the system architecture and deployment, and outline future directions for expansion and evaluation. This ongoing work demonstrates the potential for AI systems to improve continuity, communication, and efficiency in high-performance science facilities.CCS Concepts• Information systems → Question answering; Summarization; Information extraction.
RALLM-POI: Retrieval-Augmented LLM for Zero-shot Next POI Recommendation with Geographical Reranking
Next point-of-interest (POI) recommendation predicts a user's next destination from historical movements. Traditional models require intensive training, while LLMs offer flexible and generalizable zero-shot solutions but often generate generic or geographically irrelevant results due to missing trajectory and spatial context. To address these issues, we propose RALLM-POI, a framework that couples LLMs with retrieval-augmented generation and self-rectification. We first propose a Historical Trajectory Retriever (HTR) that retrieves relevant past trajectories to serve as contextual references, which are then reranked by a Geographical Distance Reranker (GDR) for prioritizing spatially relevant trajectories. Lastly, an Agentic LLM Rectifier (ALR) is designed to refine outputs through self-reflection. Without additional training, RALLM-POI achieves substantial accuracy gains across three real-world Foursquare datasets, outperforming both conventional and LLM-based baselines. Code is released at https://github.com/LKRcrocodile/RALLM-POI.
Retrieval-Augmented Generation (RAG) has emerged as a promising solution to address key challenges faced by GenAI, such as hallucination, outdated or non-removable parametric knowledge, and non-traceable reasoning processes. Existing RAG frameworks introduce dynamism into RAG process through adaptive, recursive and interactive usage of retriever and generator. More recently, agentic RAG adds another layer of intelligence to RAG by leveraging GenAI agents to further enhance dynamism by autonomously planning the retrieval process as a complex orchestration workflow with various external tools. However, current RAG architectures often overlook the significant role that domain experts can play in the retrieval process, alongside passive knowledge bases. This paper introduces a new paradigm for agentic RAG systems, capable of integrating external passive knowledge bases as well as active domain experts. This integration further enhances the versatility and factual accuracy of RAG systems. The paper discusses the key components of this new paradigm and examines the associated design challenges.
e13685 Background: Generative Artificial Intelligence (GenAI) has demonstrated promise as a clinical decision support tool. Previous studies utilized closed-source large language models (LLMs) such as GPT-4o (via the chatbot ChatGPT) to evaluate GenAI's role in healthcare. However, these LLMs may change, causing challenges with reliability and reproducibility. Hallucinations are especially concerning in healthcare, so methods such as grounding and retrieval augmented generation (RAG) are important tools that may reduce or eliminate hallucinations. Methods: The goal of this study was to enhance GenAI with agentic AI and vector-based RAG, using only open-source tools and LLMs to produce reliable breast cancer summaries and treatment evaluations. A container with Neo4j vector database, LangChain, Docling, and Jupyter was created to review HL7 patient charts containing mCODE data. Ollama was used to pull the LLMs llama3.2, gemma2:2b, qwen2.5, and phi3:mini. Synthetic Breast Cancer Dataset collected from The mCODE Project was collected, and a custom HL7-mCODE module was made to make patient data LLM-ingestible. The workflow was as follows: a modular (i.e., swappable) LLM with RAG would iterate over patient notes to extract all information related to cancer in their chart. A subsequent LLM (i.e., agentic AI) would compare the first AI's extraction with an mCODE summary to evaluate if there were any errors, remove them, and return a corrected cancer history. After this comparison was complete, another AI agent would evaluate for missing oncologic information (such as HER status) and return a list of known and unknown information for breast cancer. For the last step, NCCN Breast Cancer guidelines (Version 6.2024 11-11-2024) were converted to LLM-ingestible text via IBM's docling and placed in a vector database. The last AI agent would compare the patient's cancer details and treatment to compare with the guidelines. Results: 724 patient charts were generated with various modular AIs. No hallucinations were observed in the outputted data (i.e., no fabricated diagnoses, cancer details, treatments, etc.), and no incorrect interpretations were found. Most outputs correctly stated they could not assess NCCN guidelines due to insufficient information in the patient chart; charts with sufficient information to follow a specific guideline returned correct comparisons. In one case, Microsoft's phi3:mini was able to discern that while the guidelines were not followed, the provided guidelines are newer than the date the synthetic patient received treatment. Conclusions: Agentic AI as a utility for grounding, summarizing, and quality assurance demonstrates promise as an augmentation for GenAI to produce effective CDS tools for breast cancer history collection, evaluation, and treatment. Further studies with knowledge graphs may further improve their utility.
Deep learning has advanced medical image classification, but interpretability challenges hinder its clinical adoption. This study enhances interpretability in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs) and a multi-agent Retrieval-Augmented Generation (RAG) system for report generation. By modeling relationships between visual features and clinical concepts, we create interpretable concept vectors that guide a multi-agent RAG system to generate radiology reports, enhancing clinical relevance, explainability, and transparency. Evaluation of the generated reports using an LLM-as-a-judge confirmed the interpretability and clinical utility of our model's outputs. On the COVID-QU dataset, our model achieved 81% classification accuracy and demonstrated robust report generation performance, with five key metrics ranging between 84% and 90%. This interpretable multi-agent framework bridges the gap between high-performance AI and the explainability required for reliable AI-driven CXR analysis in clinical settings. Our code is available at https://github.com/tifat58/IRR-with-CBM-RAG.git.
While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io
Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open-source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge-driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. This agentic framework collectively generates and refines domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.
Modern enterprise systems exhibit complex interdependencies that make observability and incident response increasingly challenging. Manual alert triage, which typically involves log inspection, API verification, and cross-referencing operational knowledge bases, remains a major bottleneck in reducing mean recovery time (MTTR). This paper presents an agentic observability framework deployed within Adobe's e-commerce infrastructure that autonomously performs alert triage using a ReAct paradigm. Upon alert detection, the agent dynamically identifies the affected service, retrieves and analyzes correlated logs across distributed systems, and plans context-dependent actions such as handbook consultation, runbook execution, or retrieval-augmented analysis of recently deployed code. Empirical results from production deployment indicate a 90% reduction in mean time to insight compared to manual triage, while maintaining comparable diagnostic accuracy. Our results show that agentic AI enables an order-of-magnitude reduction in triage latency and a step-change in resolution accuracy, marking a pivotal shift toward autonomous observability in enterprise operations.
Despite recent advances, autonomous agents often struggle to solve complex tasks in enterprise domains that require coordinating multiple tools and processing diverse data sources. This struggle is driven by two main limitations. First, single-agent architectures enforce a monolithic plan-execute loop, which directly causes trajectory instability. Second, the requirement to use local open-weight models for data privacy introduces smaller context windows leading to the rapid consumption of context from large tool outputs. To solve this problem we introduce RP-ReAct (Reasoner Planner-ReAct), a novel multi-agent approach that fundamentally decouples strategic planning from low-level execution to achieve superior reliability and efficiency. RP-ReAct consists of a Reasoner Planner Agent (RPA), responsible for planning each sub-step, continuously analysing the execution results using the strong reasoning capabilities of a Large Reasoning Model, and one or multiple Proxy-Execution Agent (PEA) that translates sub-steps into concrete tool interactions using a ReAct approach. Crucially, we incorporate a context-saving strategy within the PEA to mitigate context window overflow by managing large tool outputs via external storage and on-demand access. We evaluate RP-ReAct, on the challenging, multi-domain ToolQA benchmark using a diverse set of six open-weight reasoning models. Our empirical results show that RP-ReAct achieves superior performance and improved generalization ability over state-of-the-art baselines when addressing diverse complex tasks across the evaluated domains. Furthermore we establish the enhanced robustness and stability of our approach across different model scales, paving the way for effective and deployable agentic solutions for enterprises.
The reasoning abilities of Large Language Models (LLMs) remain a topic of debate. Some methods such as ReAct-based prompting, have gained popularity for claiming to enhance sequential decision-making abilities of agentic LLMs. However, it is unclear what is the source of improvement in LLM reasoning with ReAct based prompting. In this paper we examine these claims of ReAct based prompting in improving agentic LLMs for sequential decision-making. By introducing systematic variations to the input prompt we perform a sensitivity analysis along the claims of ReAct and find that the performance is minimally influenced by the "interleaving reasoning trace with action execution" or the content of the generated reasoning traces in ReAct, contrary to original claims and common usage. Instead, the performance of LLMs is driven by the similarity between input example tasks and queries, implicitly forcing the prompt designer to provide instance-specific examples which significantly increases the cognitive burden on the human. Our investigation shows that the perceived reasoning abilities of LLMs stem from the exemplar-query similarity and approximate retrieval rather than any inherent reasoning abilities.
Software supply-chain security requires provenance mechanisms that support reproducibility and vulnerability assessment under dynamic execution conditions. Conventional Software Bills of Materials (SBOMs) provide static dependency inventories but cannot capture runtime behaviour, environment drift, or exploitability context. This paper introduces agentic Artificial Intelligence Bills of Materials (AIBOMs), extending SBOMs into active provenance artefacts through autonomous, policy-constrained reasoning. We present an agentic AIBOM framework based on a multi-agent architecture comprising (i) a baseline environment reconstruction agent (MCP), (ii) a runtime dependency and drift-monitoring agent (A2A), and (iii) a policy-aware vulnerability and VEX reasoning agent (AGNTCY). These agents generate contextual exploitability assertions by combining runtime execution evidence, dependency usage, and environmental mitigations with ISO/IEC 20153:2025 Common Security Advisory Framework (CSAF) v2.0 semantics. Exploitability is expressed via structured VEX assertions rather than enforcement actions. The framework introduces minimal, standards-aligned schema extensions to CycloneDX and SPDX, capturing execution context, dependency evolution, and agent decision provenance while preserving interoperability. Evaluation across heterogeneous analytical workloads demonstrates improved runtime dependency capture, reproducibility fidelity, and stability of vulnerability interpretation compared with established provenance systems, with low computational overhead. Ablation studies confirm that each agent contributes distinct capabilities unavailable through deterministic automation.
Evaluating large language model (LLM)-based multi-agent systems remains a critical challenge, as these systems must exhibit reliable coordination, transparent decision-making, and verifiable performance across evolving tasks. Existing evaluation approaches often limit themselves to single-response scoring or narrow benchmarks, which lack stability, extensibility, and automation when deployed in enterprise settings at multi-agent scale. We present AEMA (Adaptive Evaluation Multi-Agent), a process-aware and auditable framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under human oversight. Compared to a single LLM-as-a-Judge, AEMA achieves greater stability, human alignment, and traceable records that support accountable automation. Our results on enterprise-style agent workflows simulated using realistic business scenarios demonstrate that AEMA provides a transparent and reproducible pathway toward responsible evaluation of LLM-based multi-agent systems. Keywords Agentic AI, Multi-Agent Systems, Trustworthy AI, Verifiable Evaluation, Human Oversight
Code generation models based on large language models (LLMs) have gained wide adoption, but challenges remain in ensuring safety, accuracy, and controllability, especially for complex tasks. Existing methods often lack dynamic integration of external tools, transparent reasoning, and user control over safety. To address these issues, we propose a controllable code generation framework utilizing the ReAct paradigm for multi-agent task execution. This framework is a multi-agent system designed to enable efficient, precise, and interpretable code generation through dynamic interactions between LLMs and external resources. The framework adopts a collaborative architecture comprising four specialized agents: a Planner for task decomposition, a Searcher that leverages the ReAct framework for reasoning and tool integration, a CodeGen agent for accurate code generation, and an Extractor for structured data retrieval. The ReAct-based Searcher alternates between generating reasoning traces and executing actions, facilitating seamless integration of internal knowledge with external tools (such as search engines) to enhance accuracy and user control. Experimental results show the framework's effectiveness across multiple languages, achieving a 94.8% security rate on the SVEN dataset with CodeQL, outperforming existing approaches. Its transparent reasoning process fosters user trust and improves controllability.
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes Agentic Predictor, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a multi-view workflow encoding technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs cross-domain unsupervised pretraining. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms several strong graph-based baselines in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.
Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \textbf{\emph{Executability}}, since each instance requires a suitable and often distinct Docker environment; and \textbf{\emph{Verifiability}}, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose \textbf{TerminalTraj}, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20\% on TB~1.0 and 10\% on TB~2.0 over their respective backbones. Notably, \textbf{TerminalTraj-32B} achieves strong performance among models with fewer than 100B parameters, reaching 35.30\% on TB~1.0 and 22.00\% on TB~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at https://github.com/Wusiwei0410/TerminalTraj.
This paper proposes a highly robust autonomous agent framework based on the ReAct paradigm, designed to solve complex tasks through adaptive decision making and multi-agent collaboration. Unlike traditional frameworks that rely on fixed workflows generated by LLM-based planners, this framework dynamically generates next actions during agent execution based on prior trajectories, thereby enhancing its robustness. To address potential termination issues caused by adaptive execution paths, I propose a timely abandonment strategy incorporating a probabilistic penalty mechanism. For multi-agent collaboration, I introduce a memory transfer mechanism that enables shared and dynamically updated memory among agents. The framework's innovative timely abandonment strategy dynamically adjusts the probability of task abandonment via probabilistic penalties, allowing developers to balance conservative and exploratory tendencies in agent execution strategies by tuning hyperparameters. This significantly improves adaptability and task execution efficiency in complex environments. Additionally, agents can be extended through external tool integration, supported by modular design and MCP protocol compatibility, which enables flexible action space expansion. Through explicit division of labor, the multi-agent collaboration mechanism enables agents to focus on specific task components, thereby significantly improving execution efficiency and quality.
Enterprise back office workflows require agentic systems that are auditable, policy-aligned, and operationally predictable, capabilities that generic multi-agent setups often fail to deliver. We present POLARIS (Policy-Aware LLM Agentic Reasoning for Integrated Systems), a governed orchestration framework that treats automation as typed plan synthesis and validated execution over LLM agents. A planner proposes structurally diverse, type checked directed acyclic graphs (DAGs), a rubric guided reasoning module selects a single compliant plan, and execution is guarded by validator gated checks, a bounded repair loop, and compiled policy guardrails that block or route side effects before they occur. Applied to document centric finance tasks, POLARIS produces decision grade artifacts and full execution traces while reducing human intervention. Empirically, POLARIS achieves a micro F1 of 0.81 on the SROIE dataset and, on a controlled synthetic suite, achieves 0.95 to 1.00 precision for anomaly routing with preserved audit trails. These evaluations constitute an initial benchmark for governed Agentic AI. POLARIS provides a methodological and benchmark reference for policy-aligned Agentic AI. Keywords Agentic AI, Enterprise Automation, Back-Office Tasks, Benchmarks, Governance, Typed Planning, Evaluation
Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation. (Project page: https://montrealrobotics.ca/agentic-scene-policies.github.io/)
Even as AI capabilities improve, most enterprises do not consider agents trustworthy enough to work on production data. In this paper, we argue that the path to trustworthy agentic workflows begins with solving the infrastructure problem first: traditional lakehouses are not suited for agent access patterns, but if we design one around transactions, governance follows. In particular, we draw an operational analogy to MVCC in databases and show why a direct transplant fails in a decoupled, multi-language setting. We then propose an agent-first design, Bauplan, that reimplements data and compute isolation in the lakehouse. We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan, which seamlessly couples agent reasoning with all the desired guarantees for correctness and trust.
LLM-based coding agents are increasingly common but still face challenges in context management, latency, reliability, reproducibility, and scalability. We present Agint, an agentic graph compiler, interpreter, and runtime that incrementally and hierarchically converts natural-language instructions into typed, effect-aware code DAGs. Agint introduces explicit type floors (text to data to spec to code) grounded in semantic graph transformations and a hybrid LLM and function-based JIT runtime. This enables dynamic graph refinement, reproducible and optimizable execution, speculative evaluation, and interoperability with existing developer tools. Agint's typed graph bindings improve reliability and allow concurrent composition of concurrent codebases by construction, supporting accelerated development with smaller and faster models, lower latency, efficient context utilization, and higher throughput. Hierarchical compilation allows scalable graph edits, while the graph structure supports reproducibility and efficient parallel generation. Agint provides a composable unix-style toolchain: dagify (DAG compiler), dagent (hybrid JIT runtime), schemagin (schema generator), and datagin (data transformer) for realtime, low-latency code and dataflow creation. Human developers and coding agents refine graphs through the Agint CLI, while non-technical users use Agint Flow GUI for visual editing, conversational refinement, and debugging to promote prototype agentic workflows to production code. This continuous co-creation model allows teams to prototype quickly, refine seamlessly, and deploy reliably, bridging natural language, compiler methods, and developer tooling to enable a new generation of composable, team-centric coding agents at scale.
The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.
As large language models evolve from conversational assistants to autonomous agents, ensuring trustworthiness requires a fundamental shift from post-hoc evaluation to real-time action verification. Current frameworks like AgentBench evaluate task completion, while TrustLLM and HELM assess output quality after generation. However, none of these prevent harmful actions during agent execution. We present TrustBench, a dual-mode framework that (1) benchmarks trust across multiple dimensions using both traditional metrics and LLM-as-a-Judge evaluations, and (2) provides a toolkit agents invoke before taking actions to verify safety and reliability. Unlike existing approaches, TrustBench intervenes at the critical decision point: after an agent formulates an action but before execution. Domain-specific plugins encode specialized safety requirements for healthcare, finance, and technical domains. Across multiple agentic tasks, TrustBench reduced harmful actions by 87%. Domain-specific plugins outperformed generic verification, achieving 35% greater harm reduction. With sub-200ms latency, TrustBench enables practical real-time trust verification for autonomous agents.
The rapid evolution to autonomous, agentic AI systems introduces significant risks due to their inherent unpredictability and emergent behaviors; this also renders traditional verification methods inadequate and necessitates a shift towards probabilistic guarantees where the question is no longer if a system will fail, but the probability of its failure within given constraints. This paper presents AgentGuard, a framework for runtime verification of Agentic AI systems that provides continuous, quantitative assurance through a new paradigm called Dynamic Probabilistic Assurance. AgentGuard operates as an inspection layer that observes an agent's raw I/O and abstracts it into formal events corresponding to transitions in a state model. It then uses online learning to dynamically build and update a Markov Decision Process (MDP) that formally models the agent's emergent behavior. Using probabilistic model checking, the framework then verifies quantitative properties in real-time.
This chapter argues that the reliability of agentic and generative AI is chiefly an architectural property. We define agentic systems as goal-directed, tool-using decision makers operating in closed loops, and show how reliability emerges from principled componentisation (goal manager, planner, tool-router, executor, memory, verifiers, safety monitor, telemetry), disciplined interfaces (schema-constrained, validated, least-privilege tool calls), and explicit control and assurance loops. Building on classical foundations, we propose a practical taxonomy-tool-using agents, memory-augmented agents, planning and self-improvement agents, multi-agent systems, and embodied or web agents - and analyse how each pattern reshapes the reliability envelope and failure modes. We distil design guidance on typed schemas, idempotency, permissioning, transactional semantics, memory provenance and hygiene, runtime governance (budgets, termination conditions), and simulate-before-actuate safeguards.
This paper presents a novel, structured decision support framework that systematically aligns diverse artificial intelligence (AI) agent architectures, reactive, cognitive, hybrid, and learning, with the comprehensive National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) 2.0. By integrating agent theory with industry guidelines, this framework provides a transparent and stepwise methodology for selecting and deploying AI solutions to address contemporary cyber threats. Employing a granular decomposition of NIST CSF 2.0 functions into specific tasks, the study links essential AI agent properties such as autonomy, adaptive learning, and real-time responsiveness to each subcategory's security requirements. In addition, it outlines graduated levels of autonomy (assisted, augmented, and fully autonomous) to accommodate organisations at varying stages of cybersecurity maturity. This holistic approach transcends isolated AI applications, providing a unified detection, incident response, and governance strategy. Through conceptual validation, the framework demonstrates how tailored AI agent deployments can align with real-world constraints and risk profiles, enhancing situational awareness, accelerating response times, and fortifying long-term resilience via adaptive risk management. Ultimately, this research bridges the gap between theoretical AI constructs and operational cybersecurity demands, establishing a foundation for robust, empirically validated multi-agent systems that adhere to industry standards.
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
Large Language Models demonstrate strong reasoning and generation abilities, yet their behavior in multi-turn tasks often lacks reliability and verifiability. We present a task completion framework that enables LLM-based agents to act under explicit behavioral guidance in environments described by reinforcement learning formalisms with defined observation, action, and reward signals. The framework integrates three components: a lightweight task profiler that selects reasoning and generation strategies, a reasoning module that learns verifiable observation - action mappings, and a generation module that enforces constraint-compliant outputs through validation or deterministic synthesis. We show that as the agent interacts with the environment, these components co-evolve, yielding trustworthy behavior.
Artificial intelligence models have shown strong potential in acute ischemic stroke imaging, particularly for lesion detection and segmentation using computed tomography and magnetic resonance imaging. However, most existing approaches operate as black box predictors, producing deterministic outputs without explicit uncertainty awareness or structured mechanisms to abstain under ambiguous conditions. This limitation raises serious safety and trust concerns in high risk emergency radiology settings. In this paper, we propose an explainable agentic AI framework for uncertainty aware and abstention enabled decision support in acute ischemic stroke imaging. The framework follows a modular agentic pipeline in which a perception agent performs lesion aware image analysis, an uncertainty estimation agent computes slice level predictive reliability, and a decision agent determines whether to issue a prediction or abstain based on predefined uncertainty thresholds. Unlike prior stroke imaging systems that primarily focus on improving segmentation or classification accuracy, the proposed framework explicitly prioritizes clinical safety, transparency, and clinician aligned decision behavior. Qualitative and case based analyses across representative stroke imaging scenarios demonstrate that uncertainty driven abstention naturally emerges in diagnostically ambiguous regions and low information slices. The framework further integrates visual explanation mechanisms to support both predictive and abstention decisions, addressing a key limitation of existing uncertainty aware medical imaging systems. Rather than introducing a new performance benchmark, this work presents agentic control, uncertainty awareness, and selective abstention as essential design principles for developing safe and trustworthy medical imaging AI systems.
While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3 points over the strongest iterative baseline -- establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external, domain-specific data into the generative process. While LLMs are highly capable, they often rely on static, pre-trained datasets, limiting their ability to integrate dynamic or private data. Traditional RAG systems typically use a single-agent architecture to handle query generation, data retrieval, and response synthesis. However, this approach becomes inefficient when dealing with diverse data sources, such as relational databases, document stores, and graph databases, often leading to performance bottlenecks and reduced accuracy. This paper proposes a multi-agent RAG system to address these limitations. Specialized agents, each optimized for a specific data source, handle query generation for relational, NoSQL, and document-based systems. These agents collaborate within a modular framework, with query execution delegated to an environment designed for compatibility across various database types. This distributed approach enhances query efficiency, reduces token overhead, and improves response accuracy by ensuring that each agent focuses on its specialized task. The proposed system is scalable and adaptable, making it ideal for generative AI workflows that require integration with diverse, dynamic, or private data sources. By leveraging specialized agents and a modular execution environment, the system provides an efficient and robust solution for handling complex, heterogeneous data environments in generative AI applications.
The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.
Large Language Models (LLMs) have demonstrated impressive ability in generation and reasoning tasks but struggle with handling up-to-date knowledge, leading to inaccuracies or hallucinations. Retrieval-Augmented Generation (RAG) mitigates this by retrieving and incorporating external knowledge into input prompts. In particular, due to LLMs' context window limitations and long-context hallucinations, only the most relevant "chunks" are retrieved. However, current RAG systems face three key challenges: (1) chunks are often retrieved independently without considering their relationships, such as redundancy and ordering; (2) the utility of chunks is non-monotonic, as adding more chunks can degrade quality; and (3) retrieval strategies fail to adapt to the unique characteristics of different queries. To overcome these challenges, we design a cost-constrained retrieval optimization framework for RAG. We adopt a Monte Carlo Tree Search (MCTS) based strategy to find the optimal chunk combination order, which considers the chunks' correlations. In addition, to address the non-monotonicity of chunk utility, instead of treating budget exhaustion as the termination condition, we design a utility computation strategy to identify the optimal chunk combination without necessarily exhausting the budget. Furthermore, we propose a configuration agent that predicts optimal configurations for each query domain, improving our framework's adaptability and efficiency. Experimental results demonstrate up to a 30% improvement over baseline models, highlighting the framework's effectiveness, scalability, and suitability. Our source code has been released at https://github.com/wang0702/CARROT.
Retrieval-augmented generation (RAG) equips large language models (LLMs) with reliable knowledge memory. To strengthen cross-text associations, recent research integrates graphs and hypergraphs into RAG to capture pairwise and multi-entity relations as structured links. However, their misaligned memory organization necessitates costly, disjointed retrieval. To address these limitations, we propose IGMiRAG, a framework inspired by human intuition-guided reasoning. It constructs a hierarchical heterogeneous hypergraph to align multi-granular knowledge, incorporating deductive pathways to simulate realistic memory structures. During querying, IGMiRAG distills intuitive strategies via a question parser to control mining depth and memory window, and activates instantaneous memories as anchors using dual-focus retrieval. Mirroring human intuition, the framework guides retrieval resource allocation dynamically. Furthermore, we design a bidirectional diffusion algorithm that navigates deductive paths to mine in-depth memories, emulating human reasoning processes. Extensive evaluations indicate IGMiRAG outperforms the state-of-the-art baseline by 4.8% EM and 5.0% F1 overall, with token costs adapting to task complexity (average 6.3k+, minimum 3.0k+). This work presents a cost-effective RAG paradigm that improves both efficiency and effectiveness.
Retrieval-Augmented Generation (RAG) has shown promise in enhancing recommendation systems by incorporating external context into large language model prompts. However, existing RAG-based approaches often rely on static retrieval heuristics and fail to capture nuanced user preferences in dynamic recommendation scenarios. In this work, we introduce ARAG, an Agentic Retrieval-Augmented Generation framework for Personalized Recommendation, which integrates a multi-agent collaboration mechanism into the RAG pipeline. To better understand the long-term and session behavior of the user, ARAG leverages four specialized LLM-based agents: a User Understanding Agent that summarizes user preferences from long-term and session contexts, a Natural Language Inference (NLI) Agent that evaluates semantic alignment between candidate items retrieved by RAG and inferred intent, a context summary agent that summarizes the findings of NLI agent, and an Item Ranker Agent that generates a ranked list of recommendations based on contextual fit. We evaluate ARAG accross three datasets. Experimental results demonstrate that ARAG significantly outperforms standard RAG and recency-based baselines, achieving up to 42.1% improvement in NDCG@5 and 35.5% in Hit@5. We also, conduct an ablation study to analyse the effect by different components of ARAG. Our findings highlight the effectiveness of integrating agentic reasoning into retrieval-augmented recommendation and provide new directions for LLM-based personalization.
Current search techniques are limited to standard RAG query-document applications. In this paper, we propose a novel technique to expand the code and index for predicting the required APIs, directly enabling high-quality, end-to-end code generation for auto-completion and agentic AI applications. We address the problem of API leaks in current code-to-code benchmark datasets by introducing a new dataset built from real-world ServiceNow Script Includes that capture the challenge of unclear API usage intent in the code. Our evaluation metrics show that this method achieves 87.86% top-40 retrieval accuracy, allowing the critical context with APIs needed for successful downstream code generation. To enable real-time predictions, we develop a comprehensive post-training pipeline that optimizes a compact 0.6B reranker through synthetic dataset generation, supervised fine-tuning, and reinforcement learning. This approach enables our compact reranker to outperform a much larger 8B model while maintaining 2.5x reduced latency, effectively addressing the nuances of enterprise-specific code without the computational overhead of larger models.
Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment Large Language Models' (LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at https://github.com/Applied-Machine-Learning-Lab/TeaRAG.
Biomedical research underpins progress in our understanding of human health and disease, drug discovery, and clinical care. However, with the growth of complex lab experiments, large datasets, many analytical tools, and expansive literature, biomedical research is increasingly constrained by repetitive and fragmented workflows that slow discovery and limit innovation, underscoring the need for a fundamentally new way to scale scientific expertise. Here, we introduce Biomni, a general-purpose biomedical AI agent designed to autonomously execute a wide spectrum of research tasks across diverse biomedical subfields. To systematically map the biomedical action space, Biomni first employs an action discovery agent to create the first unified agentic environment - mining essential tools, databases, and protocols from tens of thousands of publications across 25 biomedical domains. Built on this foundation, Biomni features a generalist agentic architecture that integrates large language model (LLM) reasoning with retrieval-augmented planning and code-based execution, enabling it to dynamically compose and carry out complex biomedical workflows - entirely without relying on predefined templates or rigid task flows. Systematic benchmarking demonstrates that Biomni achieves strong generalization across heterogeneous biomedical tasks - including causal gene prioritization, drug repurposing, rare disease diagnosis, microbiome analysis, and molecular cloning - without any task-specific prompt tuning. Real-world case studies further showcase Biomni's ability to interpret complex, multi-modal biomedical datasets and autonomously generate experimentally testable protocols. Biomni envisions a future where virtual AI biologists operate alongside and augment human scientists to dramatically enhance research productivity, clinical insight, and healthcare. Biomni is ready to use at https://biomni.stanford.edu, and we invite scientists to explore its capabilities, stress-test its limits, and co-create the next era of biomedical discoveries.
With the fast-growing and evolving omics data, the demand for streamlined and adaptable tools to handle bioinformatics analysis continues to grow. In response to this need, Automated Bioinformatics Analysis (AutoBA) is introduced, an autonomous AI agent designed explicitly for fully automated multi-omic analyses based on large language models (LLMs). AutoBA simplifies the analytical process by requiring minimal user input while delivering detailed step-by-step plans for various bioinformatics tasks. AutoBA's unique capacity to self-design analysis processes based on input data variations further underscores its versatility. Compared with online bioinformatic services, AutoBA offers multiple LLM backends, with options for both online and local usage, prioritizing data security and user privacy. In comparison to ChatGPT and open-source LLMs, an automated code repair (ACR) mechanism in AutoBA is designed to improve its stability in automated end-to-end bioinformatics analysis tasks. Moreover, different from the predefined pipeline, AutoBA has adaptability in sync with emerging bioinformatics tools. Overall, AutoBA represents an advanced and convenient tool, offering robustness and adaptability for conventional multi-omic analyses.
Protein function prediction remains a fundamental challenge in computational biology. Here, we present a Large Language Model (LLM) agent-based system that improves protein function prediction performance using knowledge-augmented reasoning and multi-source evidence synthesis.Our approach integrates computational predictions with structured protein metadata, scientific literature, and ontological knowledge through a multi-stage reasoning process. An LLM agent equipped with specialized tools progressively refines functional predictions by querying constraints, cross-referencing evidence, and ensuring biological plausibility. Furthermore, the system provides detailed explanations for each prediction update, documenting the reasoning process and evidence sources.We evaluate our approach against established baseline methods across three Gene Ontology sub-ontologies using four complementary metrics, achieving superior performance in threshold-dependent measures, attaining the lowest Smin scores across all ontologies and the best Fmax for Molecular Function and Cellular Component ontologies. We make our code publicly available at https://github.com/bio-ontology-research-group/go-agent.
Hepatocellular carcinoma (HCC) treatment is challenging due to tumor heterogeneity and patient variability. Current guidelines often overlook individual factors, limiting treatment precision. We developed an integrated framework combining radiomics, deep learning, and large language model (LLM)-based decision agents to generate personalized HCC treatment recommendations. A modified GhostNet incorporating dilated convolutions, channel and spatial attention mechanism (CBAM), and residual channel attention (RCA) modules was trained on MRI to predict pathological markers such as microvascular invasion (MVI), capsule presence, and tumor differentiation. A fusion model integrating radiomics and deep learning enhanced prediction accuracy. Six AI agents processed structured multimodal data and generated individualized treatment strategies, which were evaluated by hepatobiliary surgeons. The fusion model significantly improved prediction accuracy, with MVI and capsule presence reaching 0.8902 and 0.8765, respectively. DeepSeek-R1 achieved the highest clinical relevance score, followed by GPT-4 and Med-PaLM 2. This framework demonstrates the feasibility of AI-assisted, patient-specific HCC decision-making, offering a promising direction for precision oncology.
Science frequently benefits from teams of interdisciplinary researchers
Understanding causality in medical research is essential for developing effective interventions and diagnostic tools. Mendelian Randomization (MR) is a pivotal method for inferring causality through genetic data. However, MR analysis often requires pre-identification of exposure-outcome pairs from clinical experience or literature, which can be challenging to obtain. This poses difficulties for clinicians investigating causal factors of specific diseases. To address this, we introduce MRAgent, an innovative automated agent leveraging Large Language Models (LLMs) to enhance causal knowledge discovery in disease research. MRAgent autonomously scans scientific literature, discovers potential exposure-outcome pairs, and performs MR causal inference using extensive Genome-Wide Association Study data. We conducted both automated and human evaluations to compare different LLMs in operating MRAgent and provided a proof-of-concept case to demonstrate the complete workflow. MRAgent's capability to conduct large-scale causal analyses represents a significant advancement, equipping researchers and clinicians with a robust tool for exploring and validating causal relationships in complex diseases. Our code is public at https://github.com/xuwei1997/MRAgent.
Intensive Care Unit (ICU) nursing is demanding, requiring advanced clinical decision-making and emergency management skills. Simulation-based instruction is central to ICU nursing education but remains constrained by the cost and time required for scenario authoring, limited faculty capacity for feedback, and slow content updates. Large language models (LLMs)-based pedagogical agents may augment instructor training by supporting rapid scenario generation, formative guidance, and on-demand assistance. However, evidence from real-world ICU instructor training is limited, and the balance between perceived benefits, usability, and objective educational outcomes is unclear. To evaluate the feasibility and learner-perceived impact of integrating an LLM-based pedagogical agent into ICU simulation instructor training. An exploratory quasi-experimental study was conducted with 40 ICU nurses from a tertiary hospital in February 2025. Participants were randomly assigned to an experimental group (n = 20) using the LLM-based AI teaching agent for simulation training, and a comparison group (n = 20) using traditional blended learning. The training effectiveness was assessed using the Chinese version of the Jeffries Simulation Design Scale (SDS), the System Usability Scale (SUS), the Adult Online Learning Self-Efficacy Scale, and a teaching satisfaction questionnaire. Data were analyzed using Wilcoxon rank-sum tests and t-tests. The experimental group outperformed the comparison group in multiple areas. Specifically, in the SDS, the experimental group scored higher in case authenticity (5.00 vs. 4.00, p < 0.001), scenario complexity (5.00 vs. 4.00, p < 0.001), feedback mechanisms (5.00 vs. 4.00, p < 0.001), interactivity (5.00 vs. 4.00, p < 0.001), and teaching objectives (5.00 vs. 4.25, p < 0.001). The experimental group also showed higher self-efficacy in learning ability (16.0 vs. 13.0, p < 0.001) and learning technology (18.0 vs. 16.0, p = 0.045). Satisfaction was high in both groups and demonstrated a pronounced ceiling effect. Embedding an LLM-based pedagogical agent into ICU simulation instructor training was feasible and associated with more favorable learner-perceived simulation design quality and online learning self-efficacy, while usability did not differ from traditional blended learning. Findings are preliminary and hypothesis-generating; future multi-centre, adequately powered randomized controlled trials are warranted to determine efficacy and isolate the LLM component's independent contribution.
Performing effective gene-editing experiments requires a deep understanding of both the CRISPR technology and the biological system involved. Meanwhile, despite their versatility and promise, large language models (LLMs) often lack domain-specific knowledge and struggle to accurately solve biological design problems. We present CRISPR-GPT, an LLM agent system to automate and enhance CRISPR-based gene-editing design and data analysis. CRISPR-GPT leverages the reasoning capabilities of LLMs for complex task decomposition, decision-making and interactive human-artificial intelligence (AI) collaboration. This system incorporates domain expertise, retrieval techniques, external tools and a specialized LLM fine tuned with open-forum discussions among scientists. CRISPR-GPT assists users in selecting CRISPR systems, experiment planning, designing guide RNAs, choosing delivery methods, drafting protocols, designing assays and analysing data. We showcase the potential of CRISPR-GPT by knocking out four genes with CRISPR-Cas12a in a human lung adenocarcinoma cell line and epigenetically activating two genes using CRISPR-dCas9 in a human melanoma cell line. CRISPR-GPT enables fully AI-guided gene-editing experiment design and analysis across different modalities, validating its effectiveness as an AI co-pilot in genome engineering.
Ophthalmic findings can non-invasively reflect nervous-system status. We present an LLM-based multi-agent framework that preserves diagnostic uncertainty to support neuro-ophthalmic screening and referral. Heterogeneous inputs (clinical text/PDFs and optional fundus/OCT images) are normalized by an Information Collection Agent. A Diagnosis Agent ensembles multiple LLMs and, when available, a CNN image branch; outputs are aggregated with an uncertainty-aware fusion. Across a curated ophthalmic corpus, the multi-agent framework improves robustness over single-model baselines and produces multi-candidate distributions suitable for downstream triage and monitoring. Uncertainty-aware, multi-candidate predictions align with clinical decision-making under ambiguity and suggest future work on calibration and knowledge-layer fusion.
Large Language Model-based Multi-Agent Systems (LLM-based MASs) represent a groundbreaking paradigm where diverse LLM-based agents collaborate, leveraging their unique capabilities to achieve shared objectives. Although LLM-based MASs outperform individual agents, their current architectures are limited by predefined, fixed, and static agent designs, restricting adaptability and scalability in dynamic environments. To address these limitations, this study proposes two novel approaches: Initial Automatic Agent Generation (IAAG) and Dynamic Real-Time Agent Generation (DRTAG). These approaches enable the automatic creation and seamless integration of new agents into MASs, driven by evolving conversational and task-specific contexts, thereby reducing the need for human intervention. Our method leverages advanced prompt engineering techniques such as persona pattern prompting, chain prompting, and few-shot prompting to generate new agents through existing LLM agents. Additionally, several evaluation metrics were adapted to score and rank LLM-generated texts. Experimental results demonstrate that the DRTAG approach significantly improves system adaptability and task performance compared to static MAS architectures. The IAAG framework also enhances initial system flexibility, supporting the creation of contextually relevant agents. These findings highlight the potential of dynamic LLM-based MASs to overcome the limitations of static architectures to address complex real-world challenges, paving the way for innovative applications across diverse domains.
Effective diabetes management is crucial for maintaining health in diabetic patients. Large Language Models (LLMs) have opened new avenues for diabetes management, facilitating their efficacy. However, current LLM-based approaches are limited by their dependence on general sources and lack of integration with domain-specific knowledge, leading to inaccurate responses. In this paper, we propose a knowledge-infused LLM-powered conversational health agent (CHA) for diabetic patients. We customize and leverage the open-source openCHA framework, enhancing our CHA with external knowledge and analytical capabilities. This integration involves two key components: 1) incorporating the American Diabetes Association dietary guidelines and the Nutritionix information and 2) deploying analytical tools that enable nutritional intake calculation and comparison with the guidelines. We compare the proposed CHA with GPT4. Our evaluation includes 100 diabetes-related questions on daily meal choices and assessing the potential risks associated with the suggested diet. Our findings show that the proposed agent demonstrates superior performance in generating responses to manage essential nutrients.
Task Offloading with LLM-Enhanced Multi-Agent Reinforcement Learning in UAV-Assisted Edge Computing.
Unmanned aerial vehicles (UAVs) furnished with computational servers enable user equipment (UE) to offload complex computational tasks, thereby addressing the limitations of edge computing in remote or resource-constrained environments. The application of value decomposition algorithms for UAV trajectory planning has drawn considerable research attention. However, existing value decomposition algorithms commonly encounter obstacles in effectively associating local observations with the global state of UAV clusters, which hinders their task-solving capabilities and gives rise to reduced task completion rates and prolonged convergence times. To address these challenges, this paper introduces an innovative multi-agent deep learning framework that conceptualizes multi-UAV trajectory optimization as a decentralized partially observable Markov decision process (Dec-POMDP). This framework integrates the QTRAN algorithm with a large language model (LLM) for efficient region decomposition and employs graph convolutional networks (GCNs) combined with self-attention mechanisms to adeptly manage inter-subregion relationships. The simulation results demonstrate that the proposed method significantly outperforms existing deep reinforcement learning methods, with improvements in convergence speed and task completion rate exceeding 10%. Overall, this framework significantly advances UAV trajectory optimization and enhances the performance of multi-agent systems within UAV-assisted edge computing environments.
Cancer patients often lack timely education and personalized support due to clinician workload. This quality improvement study develops and evaluates a Large Language Model (LLM) agent, MedEduChat, which is integrated with the clinic's electronic health records (EHR) and designed to enhance prostate cancer patient education. Fifteen non-metastatic prostate cancer patients and three clinicians recruited from the Mayo Clinic interacted with the agent between May 2024 and April 2025. Findings showed that MedEduChat has a high usability score (UMUX = 83.7/100) and improves patients' health confidence (Health Confidence Score rose from 9.9 to 13.9). Clinicians evaluated the patient-chat interaction history and rated MedEduChat as highly correct (2.9/3), complete (2.7/3), and safe (2.7/3), with moderate personalization (2.3/3). This study highlights the potential of LLM agents to improve patient engagement and health education.
Recently, Large Language Model based Autonomous System (LLMAS) has gained great popularity for its potential to simulate complicated behaviors of human societies. One of its main challenges is to present and analyze the dynamic events evolution of LLMAS. In this work, we present a visualization approach to explore the detailed statuses and agents' behavior within LLMAS. Our approach outlines a general pipeline that organizes raw execution events from LLMAS into a structured behavior model. We leverage a behavior summarization algorithm to create a hierarchical summary of these behaviors, arranged according to their sequence over time. Additionally, we design a cause trace method to mine the causal relationship between agent behaviors. We then develop AgentLens, a visual analysis system that leverages a hierarchical temporal visualization for illustrating the evolution of LLMAS, and supports users to interactively investigate details and causes of agents' behaviors. Two usage scenarios and a user study demonstrate the effectiveness and usability of our AgentLens.
Large language models (LLMs) are revolutionizing healthcare by improving diagnosis, patient care, and decision support through interactive communication. More recently, they have been applied to analyzing physiological time-series like wearable data for health insight extraction. Existing methods embed raw numerical sequences directly into prompts, which exceeds token limits and increases computational costs. Additionally, some studies integrated features extracted from time-series in textual prompts or applied multimodal approaches. However, these methods often produce generic and unreliable outputs due to LLMs' limited analytical rigor and inefficiency in interpreting continuous waveforms. In this paper, we develop an LLM-powered agent for physiological time-series analysis aimed to bridge the gap in integrating LLMs with well-established analytical tools. Built on the OpenCHA, an open-source LLM-powered framework, our agent powered by OpenAI's GPT-3.5-turbo model features an orchestrator that integrates user interaction, data sources, and analytical tools to generate accurate health insights. To evaluate its effectiveness, we implement a case study on heart rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study. The agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o, with ECG serving as the gold standard for HR estimation. Results demonstrate that our agent significantly outperforms benchmark models by achieving lower error rates and more reliable HR estimations. The agent implementation is publicly available on GitHub
MedAgentBench is the first benchmark for evaluating LLM agents on clinical tasks in a FHIR-compliant EHR. In this paper, we present significant prompt engineering and tool design improvements over the original agent implementation and introduce a memory component that enables the agent to learn from prior failures. We added new tools for the agent to properly format its output for tasks, interact with an EHR without constructing explicit HTTP requests, which were prone to syntax errors, and make math calculations. We also wrote a new system prompt that asked the agent to outline its plan before making any tool calls and think step by step using chain of thought reasoning, and provided few shot examples of good vs. bad outputs. Using GPT-4.1 as the base model, our agent achieved a success rate of 91.0% without memory and 98.0% with memory. A surprising consequence is that the agent performed better on a different task that had no associated memory entry, possibly demonstrating that LLMs can adapt to the style of tasks presented by users. To contribute to the benchmark and evaluate the generalization of our agent, we developed 300 new multi-step clinically-driven tasks in collaboration with a physician. Lastly, we show the current limitations of these benchmarks and highlight the necessary next steps and challenges for the responsible deployment of AI agents in real-world healthcare settings. We hope that this paper leads to further development of EHR agents and benchmarks.
Urban mobility systems face escalating challenges associated with sustainability, equity, and resilience, further compounded by environmental pressures. Traditional agent-based models (ABMs) often fail to capture cognitively rich, adaptive behaviors, limiting their ability to simulate realistic user responses to disruptions. In this work, we propose a cognitive agent architecture based on Large Language Models (LLMs), featuring multi-horizon memory-driven planning, reflection, and adaptation. Integrated into the SimFleet agent-based simulator with realistic sociodemographic profiles, the agents dynamically generate, adjust, and reflect upon travel plans across a 20-day simulation involving over 320 individuals. Experimental results reveal emergent adaptation patterns under both stable and disrupted transport conditions, and an ablation study under severe service disruption quantifies the contributions of short-term and long-term memory modules to memory-driven reasoning, demonstrating the potential of LLM-driven agents to enhance the realism, flexibility, and interpretability of urban mobility simulations.
Recent research on motion generation and text-to-motion synthesis focus on coarse-grained motion descriptions, neglecting fine-grained motion details and motion quality refinement. Additionally, current text-to-motion models, such as MotionGPT, lack multi-turn interaction capabilities, relying on single-turn and single-modality transformations, which limit their ability to integrate information from different modalities across interaction stages. These gaps leave critical questions, such as "How well is the motion performed" and "How can it be refined?" largely unaddressed. To address these issues, first, we introduce two fine-grained dance datasets-one focusing on jazz dance and the other on folk dance, which we have independently collected. Second, considering that dance motions are inherently complex and consist of long sequential actions, we introduce both global and local optimization during the motion encoding phase and employ Hidden Markov Model (HMM) temporal modeling to capture differential features between correct and incorrect movements, thereby optimizing the training process. Finally, we propose a multi-turn historical dialogue framework that enables three stages generation-motion assess, text instructions, and motion refinement-for input videos. This framework assists dance beginners by providing feedback on their movements, offering textual instructions, and delivering motion-based refinement. Experimental results on the jazz dance and folk dance datasets demonstrate that our method surpasses existing approaches in both quantitative and qualitative metrics, establishing a new benchmark for motion-text generation in the field of dance training.
Spike sorting is a fundamental process for decoding neural activity, involving preprocessing, spike detection, feature extraction, clustering, and validation. However, conventional spike sorting methods are highly fragmented, labor-intensive, and heavily reliant on expert manual curation, limiting their scalability and reproducibility. This challenge has become more pressing with advances in neural recording technology, such as high-density Neuropixels for large-scale neural recording or flexible electrodes for long-term stable recording over months to years. The volume and complexity of these datasets make manual curation infeasible, requiring an automated and scalable solution. Here, we introduce SpikeAgent, a multimodal large language model (LLM)-based AI agent that automates and standardizes the entire spike sorting pipeline. Unlike traditional approaches, SpikeAgent integrates multiple LLM backends, coding functions, and established algorithms, autonomously performing spike sorting with reasoning-based decision-making and real-time interaction with intermediate results. It generates interpretable reports, providing transparent justifications for each sorting decision, enhancing transparency and reliability. We benchmarked SpikeAgent against human experts across various neural recording technology, demonstrating its versatility and ability to achieve curation consistency that are equal to, or even higher than human experts. It also drastically reduces the expertise barrier and accelerates the curation and validation time by orders of magnitude. Moreover, it enables automated interpretability of the neural spiking data, which cannot be achieved by any conventional methods. SpikeAgent presents a paradigm shift in processing signals for neuroscience and brain-computer interfaces, while laying the ground for AI agent-augmented science across various domains.
This Perspective explores the transformative potential of multi-agent systems (MAS) powered by Large Language Models (LLMs) in the geosciences. Users of geoscientific data repositories face challenges due to the complexity and diversity of data formats, inconsistent metadata practices, and a considerable number of unprocessed datasets. MAS possesses transformative potential for improving scientists' interaction with geoscientific data by enabling intelligent data processing, natural language interfaces, and collaborative problem-solving capabilities. We illustrate this approach with "PANGAEA GPT," a specialized MAS pipeline integrated with the diverse PANGAEA database for Earth & Environmental Science, demonstrating how MAS-driven workflows can effectively manage complex datasets and accelerate scientific discovery. We discuss how MAS can address current data challenges in geosciences, highlight advancements in other scientific fields, and propose future directions for integrating MAS into geoscientific data processing pipelines. In this Perspective, we show how MAS can fundamentally improve data accessibility, promote cross-disciplinary collaboration, and accelerate geoscientific discoveries.
Quantitative kinetic models of biological regulatory processes play an important role in understanding disease mechanisms. However, their simulation and analysis require specialized domain expertise. In this study, we present Talk2Biomodels (T2B), an open-source, user-friendly, large language model-based agentic AI platform designed to facilitate access to computational models of biological systems and promote the FAIRification (Findability, Accessibility, Interoperability, and Reusability) principles in systems biology. T2B allows users to interact with and analyse mathematical models of biological systems through conversations in natural language, thereby lowering the barrier to entry for model interpretation and hypothesis-driven exploration. The platform natively supports models encoded in the Systems Biology Markup Language, a widely adopted standard in the computational biology community. T2B is integrated with the BioModels database ( https://www.ebi.ac.uk/biomodels/ ), enabling retrieval, simulation, and analysis of curated systems biology models. We illustrate the platform's capabilities through use cases in precision medicine, infectious disease epidemiology, and the study of emergent network-level properties in cellular systems - demonstrating how both computational experts and domain scientists without formal modelling training can derive actionable insights from complex biological models. Talk2Biomodels is available at https://github.com/VirtualPatientEngine/AIAgents4Pharma . Detailed documentation and use cases are available at https://virtualpatientengine.github.io/AIAgents4Pharma/talk2biomodels/intro/ . In summary, T2B lowers the barrier for non-experts to engage with and extract insights from computational models of biological systems, while simultaneously providing experts with a streamlined interface for analysing models and overall contributes to the FAIRification of models.
Visual analytics (VA) is typically applied to complex data, thus requiring complex tools. While visual analytics empowers analysts in data analysis, analysts may get lost in the complexity occasionally. This highlights the need for intelligent assistance mechanisms. However, even the latest LLM-assisted VA systems only provide help when explicitly requested by the user, making them insufficiently intelligent to offer suggestions when analysts need them the most. We propose a ProactiveVA framework in which LLM-powered UI agent monitors user interactions and delivers context-aware assistance proactively. To design effective proactive assistance, we first conducted a formative study analyzing help-seeking behaviors in user interaction logs, identifying when users need proactive help, what assistance they require, and how the agent should intervene. Based on this analysis, we distilled key design requirements in terms of intent recognition, solution generation, interpretability and controllability. Guided by these requirements, we develop a three-stage UI agent pipeline including perception, reasoning, and acting. The agent autonomously perceives users' needs from VA interaction logs, providing tailored suggestions and intuitive guidance through interactive exploration of the system. We implemented the framework in two representative types of VA systems, demonstrating its generalizability, and evaluated the effectiveness through an algorithm evaluation, case and expert study and a user study. We also discuss current design trade-offs of proactive VA and areas for further exploration.
No abstract
Computational protein design is often constrained by slow, complex, inaccessible, and highly sophisticated and expert-dependent workflows that hinder its transferrability and generalization power for broader applications. We present ProteinMCP, an agentic AI framework designed to accelerate and democratize protein engineering. ProteinMCP automates end-to-end scientific tasks, delivering dramatic gains in efficiency; for instance, a comprehensive protein fitness modeling workflow was completed in just 11 min. This performance is achieved by an AI agent that intelligently orchestrates a unified ecosystem of 38 specialized tools, made accessible through a model-context-protocol (MCP). A cornerstone of the framework is an automated pipeline that converts existing software into MCP-compliant servers, ensuring the platform is both powerful and perpetually extensible. We further demonstrate its capabilities through the successful autonomous design and selection of high-affinity de novo binders and therapeutic nanobodies. By removing technical barriers, ProteinMCP has the potential to shorten the design-build-test cycle and make advanced computational protein design accessible to the broader scientific community.
Crop productivity is heavily impacted by inefficient fertilizer usage, improper fertilizer handling, and inappropriately chosen crops. To address these issues, this research work proposes an AI-powered Smart Agriculture Prediction System utilizing intelligent agents for carrying out soil classification, estimating soil parameters, crop suggestion, and fertilizer suggestion. The soil classifier module is trained with 1,563 images of black soils, red soils, clay soils, and alluvial soils using MobileNet-V2, ResNet, and Custom CNN, with Custom CNN resulting in a higher accuracy of 92.88%, which performed better in classifying soils based on textures. A soil parameter estimation agent utilizes regression models for estimating pH and NPK content of soils using images. For crop suggestion, a crop dataset with 2,200 samples with parameters such as N, P, K, T, H, pH, and rainfall is used, in which Random Forest model performed better with an accuracy of 92.4% when compared with CNN and DNN models. For fertilizer suggestion, XGBoost performed better with an accuracy of 94.7% in estimating fertilizers such as Urea, DAP, NPK, Potash, and Compost. Real-time climatic parameters are obtained using API in order to make dynamic updates for climatic parameters. Real-time weather data obtained through APIs enables dynamic updates of climatic parameters, while Explainable AI techniques such as SHAP and LIME enhance model transparency and user trust. Additionally, the system incorporates an interactive agent-based framework that processes user inputs, including location, soil images, and nutrient levels, to generate adaptive outputs such as weather alerts, yield potential, and personalized recommendations. The experimental results demonstrate that the proposed system effectively integrates deep learning, ensemble learning, and explainability to deliver a scalable, efficient, and sustainable decision-support solution for precision agriculture, promoting optimized resource utilization and environmental stewardship.
Reproducibility in biological research and manufacturing remains constrained by the complexity of multi-step protocols, fragmented data-analysis pipelines, and the intrinsic variability of experimental execution. Here, we present Agentic Lab, an agentic-physical AI platform that unifies large language model and vision language model (LLMs/VLMs)-driven reasoning with real-world laboratory operations. Agentic Lab uses multi-agent orchestration architecture, comprising of specialized subagents for knowledge retrieval, protocol design, multimodal data analysis, and training-free segmentation and representation learning for intrinsically explainable single-cell and organoid phenotyping. These agents operate under the orchestration of a virtual principal investigator MolAgent that is linked to an augmented reality (AR)-based physical AI interface, which can bridge digital reasoning with human physical execution. Agentic Lab perceives real-world experimental activities, provides context-aware instructions, identifying procedural errors in real time for humans to correct, and continuously evolves with its long-term memory database expanding through the accumulation of experimental data logs from human scientists. This interaction allows scientists and AI agents to collaborate and co-evolve dynamically, closing the loop between planning, action, and analysis in the traditional cell and organoid research lifecycle. We demonstrate Agentic Lab in organoid differentiation from human pluripotent stem cells, where it autonomously generates protocols, monitors culture procedures, and identifies subtle morphological heterogeneity linked to growth conditions. The system interprets these phenotypes, grounds them in literature, and proposes targeted instructions for improving differentiation efficiency. By combining multi-agent reasoning with physical laboratory awareness, Agentic Lab transforms experimentation and biomanufacturing from a static workflow into an adaptive, feedback-driven, bidirectional process that integrates agentic AI into the research lifecycle. This framework establishes a foundation for intelligent laboratories that integrate design, execution, and interpretation within a unified agentic-physical system.
Behavior analysis across species represents a fundamental challenge in neuroscience, psychology, and ethology, typically requiring extensive expert knowledge and labor-intensive processes that limit research scalability and accessibility. We introduce BehaveAgent, an autonomous multimodal AI agent designed to automate behavior analysis from video input without retraining or manual intervention. Unlike conventional methods that require manual behavior annotation, video segmentation, task-specific model training, BehaveAgent leverages the reasoning capabilities of multimodal large language models (LLM) to generalize across novel behavioral domains without need for additional training. It integrates LLMs, vision-language models (VLMs), and large-scale visual grounding modules, orchestrated through a multimodal context memory and goal-directed attention mechanism, to enable robust zero-shot visual reasoning across species and experimental paradigms, including plants, insects, rodents, primates, and humans. Upon receiving a video input, BehaveAgent autonomously identifies the correct analysis strategy and performs end-to-end behavior analysis and interpretation without human supervision. Leveraging vision-language representations, it performs general-purpose tracking, pose estimation and segmentation. We demonstrate BehaveAgent's universal applicability to autonomously (1) identify the behavioral paradigm and develop an action plan specialized for the identified paradigm, (2) identify relevant subjects and objects, (3) track those features, (4) identify behavioral sequences with explicit reasoning, (5) generate and execute code for targeted analysis and (6) generate comprehensive research reports that integrate behavioral findings with relevant scientific literature. Through interpretable agentic reasoning, BehaveAgent makes its internal decision-making process transparent, clarifying why particular features are tracked or behaviors inferred. By reducing the time and expertise required for behavior analysis, BehaveAgent introduces a scalable, generalizable, and explainable paradigm for advancing biological and behavioral research.
Spatial transcriptomics has revolutionized our understanding of tissue organization by simultaneously capturing gene expression and spatial localization within intact tissues. However, analyzing these increasingly complex datasets requires specialized expertise across computational biology, statistics, and biological context. To address this challenge, we introduce the Spatial Transcriptomics AI Agent (STAgent), an autonomous multimodal agentic AI that integrates multimodal large language models (LLMs) with specialized computational tools to transform weeks-long analysis tasks into minutes of automated processing. Unlike conventional machine learning approaches that are limited to narrow, predefined tasks, STAgent leverages the emergent capabilities of multimodal LLMs - such as flexible reasoning, contextual understanding, and cross-modal integration - which allow it to adapt to novel data, execute multi-step analyses, and generate biologically meaningful insights with minimal human input. STAgent enables autonomous deep research through integrated capabilities, including dynamic code generation for complex analytical workflows, visual reasoning for interpreting spatial patterns, real-time retrieval of relevant peer-reviewd scientific literature, and synthesis of comprehensive, actionable reports. We applied STAgent to investigate the
In today's world, Diabetic Retinopathy (DR) remains a leading cause of vision loss globally, necessitating early detection and accurate diagnosis for timely intervention. Traditional machine learning and deep learning-based approaches, while effective, often suffer from issues such as limited interpretability, static decision-making, and inadequate generalization across diverse patient data. This research introduces an Agentic-AI Driven Framework for Diabetic Retinopathy Analysis (AADR-AI), which leverages intelligent agent-based learning mechanisms to enhance decision-making autonomy, dynamic adaptability, and contextual understanding of retinal fundus images. The novelty lies in incorporating agentic intelligence principles, autonomy, reactivity, and proactivity into DR detection systems, allowing real-time analysis and adaptive feature learning based on patient-specific variations. The proposed AADR-AI framework integrates a multi-agent ensemble of convolutional and transformer-based networks, coordinated through a decision fusion layer for robust classification. Key contributions include improved classification accuracy (up to 96.7%), enhanced model efficiency with reduced computational overhead, and real-time adaptability to varying image qualities and disease progression stages. Extensive experimentation on benchmark datasets demonstrates superior performance compared to existing state-of-the-art methods. This work highlights the transformative potential of agentic AI in medical imaging, paving the way for more autonomous and interpretable clinical decision-support systems.
Radiology is undergoing a paradigm shift from traditional single-function AI systems to sophisticated multi-agent networks capable of autonomous reasoning, coordinated decision-making, and adaptive workflow management. These agentic AI systems move beyond simple pattern recognition to encompass complex radiological workflows including image analysis, report generation, clinical communication, and care coordination. While multi-agent radiological AI promises enhanced diagnostic accuracy, improved workflow efficiency, and reduced physician burden, it simultaneously amplifies the long-standing "black box" problem. Traditional explainable AI methods, which are adequate for understanding isolated diagnostic predictions, fail when applied to multi-step reasoning processes involving multiple specialized agents coordinating across imaging interpretation, clinical correlation, and treatment planning. This paper examines how agentic AI systems in radiology create "compound opacity" layers of inscrutability from agent interactions and distributed decision-making processes. We analyze the autonomy-transparency paradox specific to radiological practice, where increasing AI capability directly conflicts with interpretability requirements essential for clinical trust and regulatory oversight. Through examination of emerging multi-agent radiological workflows, we propose frameworks for responsible implementation that preserve both diagnostic innovation and the fundamental principles of medical transparency and accountability.
The advent of agentic AI systems is leading to significant transformations across scientific and technological domains. Advances in large language models (LLMs), reasoning capabilities, and integration with external tools have ushered in a new era where agentic AI systems can autonomously perform computational tasks that were traditionally carried out by humans. Computer-aided drug design (CADD)─a multifaceted process encompassing complex, interdependent tasks─stands to benefit profoundly from these advancements. However, one of the key challenges in enabling agentic systems to autonomously take over tasks in CADD is constructing models for property estimation that match the quality and reliability of those developed by human experts. As this is not currently straightforward, this capability represents a major bottleneck for fully realizing the potential of autonomous pipelines in drug discovery. We present here MolAgent, a system-agnostic agentic AI framework designed for high-fidelity modeling of molecular properties in early-stage drug discovery. MolAgent autonomously implements expert-level pipelines for both classification and regression, empowering agentic systems to efficiently construct and deploy models. With integrated automated feature engineering, robust model selection, advanced ensemble methodologies, and comprehensive validation frameworks, MolAgent ensures optimal accuracy and model robustness. The platform seamlessly accepts 2D and 3D structural data for ligands and receptors and harmonizes traditional molecular descriptors with advanced deep learning features extracted from pretrained 2D and 3D encoders. Ultimately, the platform's fully automated, end-to-end workflow is designed for seamless agentic execution. Adherence to the Model Context Protocol (MCP) guarantees interoperability with diverse agentic AI infrastructures, ensuring flexible integration into complex, future discovery pipelines.
Large-scale, diverse data produced by higher vocational teacher colleges' digital transformation challenges traditional methods for evaluating digital literacy. The reliability of current analytics and black-box artificial intelligence (AI) models for educational decision-making is limited by their frequent lack of autonomy and transparency. In order to assess digital literacy at higher vocational teacher colleges using big data and visual analytics, this study suggests an Explainable Agentic AI framework. In order to facilitate adaptive data exploration, competency evaluation, and insight generation across multimodal educational data, such as learning behavior logs, assessment records, and digital engagement indicators, the framework combines autonomous agentic intelligence with explainable AI (XAI). While XAI methods offer clear explanations of literacy aspects, decision rationale, and uncertainty, agentic components dynamically handle data processing, feature reasoning, and model selection. Effective human-AI collaboration is made possible by an interactive visual analytics layer that allows for layered investigation of learner patterns, temporal dynamics, and cohort heterogeneity. When compared with traditional machine learning techniques, experimental results on large-scale datasets from higher vocational teacher colleges show better assessment accuracy, robustness, and interpretability. This work demonstrates the promise of agentic AI for explainable big data exploration and promotes reliable instructional intelligence by combining agentic autonomy, explainability, and visual analytics within a scalable big data paradigm.
Ophthalmic practice involves the integration of diverse clinical data and interactive decision-making, posing challenges for traditional artificial intelligence (AI) systems. Visual question answering (VQA) addresses this by combining computer vision and natural language processing to interpret medical images through user-driven queries. Evolving from VQA, multimodal AI agents enable continuous dialogue, tool use and context-aware clinical decision support. This review explores recent developments in ophthalmic conversational AI, spanning theoretical advances and practical implementations. We highlight the transformative role of large language models (LLMs) in improving reasoning, adaptability and task execution. However, key obstacles remain, including limited multimodal datasets, absence of standardised evaluation protocols, and challenges in clinical integration. We outline these limitations and propose future research directions to support the development of robust, LLM-driven AI systems. Realising their full potential will depend on close collaboration between AI researchers and the ophthalmic community.
Large language models (LLMs) have shown remarkable potential in various domains but often lack the ability to access and reason over domain-specific knowledge and tools. In this article, we introduce Chemistry Agent Connecting Tool-Usage to Science (CACTUS), an LLM-based agent that integrates existing cheminformatics tools to enable accurate and advanced reasoning and problem-solving in chemistry and molecular discovery. We evaluate the performance of CACTUS using a diverse set of open-source LLMs, including Gemma-7b, Falcon-7b, MPT-7b, Llama3-8b, and Mistral-7b, on a benchmark of thousands of chemistry questions. Our results demonstrate that CACTUS significantly outperforms baseline LLMs, with the Gemma-7b, Mistral-7b, and Llama3-8b models achieving the highest accuracy regardless of the prompting strategy used. Moreover, we explore the impact of domain-specific prompting and hardware configurations on model performance, highlighting the importance of prompt engineering and the potential for deploying smaller models on consumer-grade hardware without a significant loss in accuracy. By combining the cognitive capabilities of open-source LLMs with widely used domain-specific tools provided by RDKit, CACTUS can assist researchers in tasks such as molecular property prediction, similarity searching, and drug-likeness assessment.
Medical diagnosis is a complex, iterative process that relies heavily on clinicians' reasoning and judgment. Traditional models, while able to provide consistent diagnostic results, fail to replicate the reasoning process of clinicians, making their outputs difficult to understand and justify. In this paper, we address this limitation by first generating clinical notes that capture the clinician's diagnostic reasoning. These notes are then used to train a large language model, allowing it to mimic the step-by-step reasoning employed by clinicians during diagnosis. Our method introduces a hierarchical agent reflection mechanism to generate clinical notes, which deconstructs the diagnostic process into key stages, each handled by specialized agents. This structured approach not only improves the accuracy and reliability of the generated clinical notes but also ensures that the model's reasoning aligns with human clinical practice. Experimental results show that models trained on this data outperform both general-purpose large language models and domain-specific medical models in diagnostic tasks. The proposed method enhances diagnostic transparency and interpretability, offering a valuable tool for AI-assisted clinical decision-making.
Driver behavior is a critical factor in driving safety, making the development of sophisticated distraction classification methods essential. Our study presents a Distracted Driving Classification (DDC) approach utilizing a visual Large Language Model (LLM), named the Distracted Driving Language Model (DDLM). The DDLM introduces whole-body human pose estimation to isolate and analyze key postural features-head, right hand, and left hand-for precise behavior classification and better interpretability. Recognizing the inherent limitations of LLMs, particularly their lack of logical reasoning abilities, we have integrated a reasoning chain framework within the DDLM, allowing it to generate clear, reasoned explanations for its assessments. Tailored specifically with relevant data, the DDLM demonstrates enhanced performance, providing detailed, context-aware evaluations of driver behaviors and corresponding risk levels. Notably outperforming standard models in both zero-shot and few-shot learning scenarios, as evidenced by tests on the 100-Driver dataset, the DDLM stands out as an advanced tool that promises significant contributions to driving safety by accurately detecting and analyzing driving distractions.
AI Agents have evolved to not only recommend content but also facilitate information retrieval and task processing. Developing AI Agents using general-purpose LLM models necessitates integration with external tools, leading to tool-augmented LLM studies. Despite the availability of multiple tools for the same purpose, existing research has not fully leveraged this diversity. This study categorizes external tools by type and proposes a method to simultaneously call tools of the same type. This allows for the utilization of diverse external tools in LLM inference, thereby achieving a higher accuracy compared to when only a single tool for one task is used. Experimental results show an accuracy improvement of 4.4-9.3% over existing studies. Furthermore, when utilizing tool-augmented LLM, a multi-step reasoning approach that divides the process into stages such as planning and tool invocation is widely employed. With the rapid advancement of LLMs, enhanced models continue to emerge. Considering the trade-offs between performance and cost in models, it is crucial to find an optimal combination of models in each stage of tool augmented LLM. In this study, we propose a novel method for efficiently utilizing both enhanced LLM models and existing models, which reduces response errors by up to 9%.
Artificial intelligence agents are emerging as powerful applications of large language models (LLMs), automating complex tasks and enabling scientific data exploration. However, their use in biomedical data analysis remains limited by the difficulty of handling specialized tools and multistep reasoning. Here we introduce BioMedAgent, a self-evolving LLM multi-agent framework, which learns to use diverse bioinformatics tools and chain them into executable workflows through interactive exploration and memory retrieval algorithms. It allows biomedical users to initiate tasks using natural language, without requiring computational expertise. Evaluated on our newly released BioMed-AQA benchmark comprising 327 biomedical data tasks, BioMedAgent achieved a 77% success rate, surpassing other LLM agents, and generalized robustly to the external BixBench dataset. Beyond benchmarks, it autonomously performs cross-omics analysis, machine-learning modelling and pathology image segmentation, highlighting its potential to advance biomedical research and extend to other scientific domains requiring complex tool integration and multistep reasoning.
The COVID-19 pandemic has been accompanied by an "infodemic," where the rapid spread of misinformation has exacerbated public health challenges. Traditional fact-checking methods, though effective, are time-consuming and resource-intensive, limiting their ability to combat misinformation at scale. Large language models (LLMs) such as GPT-4 offer a more scalable solution, but their susceptibility to generating hallucinations-plausible yet incorrect information-compromises their reliability. This study aims to enhance the accuracy and reliability of COVID-19 fact-checking by integrating a retrieval-augmented generation (RAG) system with LLMs, specifically addressing the limitations of hallucination and context inaccuracy inherent in stand-alone LLMs. We constructed a context dataset comprising approximately 130,000 peer-reviewed papers related to COVID-19 from PubMed and Scopus. This dataset was integrated with GPT-4 to develop multiple RAG-enhanced models: the naïve RAG, Lord of the Retrievers (LOTR)-RAG, corrective RAG (CRAG), and self-RAG (SRAG). The RAG systems were designed to retrieve relevant external information, which was then embedded and indexed in a vector store for similarity searches. One real-world dataset and one synthesized dataset, each containing 500 claims, were used to evaluate the performance of these models. Each model's accuracy, F The baseline GPT-4 model achieved an accuracy of 0.856 on the real-world dataset. The naïve RAG model improved this to 0.946, while the LOTR-RAG model further increased accuracy to 0.951. The CRAG and SRAG models outperformed all others, achieving accuracies of 0.972 and 0.973, respectively. The baseline GPT-4 model reached an accuracy of 0.960 on the synthesized dataset. The naïve RAG model increased this to 0.972, and the LOTR-RAG, CRAG, and SRAG models achieved an accuracy of 0.978. These findings demonstrate that the RAG-enhanced models consistently maintained high accuracy levels, closely mirroring ground-truth labels and significantly reducing hallucinations. The CRAG and SRAG models also provided more detailed and contextually accurate explanations, further establishing the superiority of agentic RAG frameworks in delivering reliable and precise fact-checking outputs across diverse datasets. The integration of RAG systems with LLMs substantially improves the accuracy and contextual relevance of automated fact-checking. By reducing hallucinations and enhancing transparency by citing retrieved sources, this method holds significant promise for rapid, reliable information verification to combat misinformation during public health crises.
Large language models (LLMs) can answer expert-level questions in medicine but are prone to hallucinations and arithmetic errors. Early evidence suggests LLMs cannot reliably perform clinical calculations, limiting their potential integration into clinical workflows. We evaluated ChatGPT's performance across 48 medical calculation tasks, finding incorrect responses in one-third of trials (n = 212). We then assessed three forms of agentic augmentation: retrieval-augmented generation, a code interpreter tool, and a set of task-specific calculation tools (OpenMedCalc) across 10,000 trials. Models with access to task-specific tools showed the greatest improvement, with LLaMa and GPT-based models demonstrating a 5.5-fold (88% vs 16%) and 13-fold (64% vs 4.8%) reduction in incorrect responses, respectively, compared to the unimproved models. Our findings suggest that integration of machine-readable, task-specific tools may help overcome LLMs' limitations in medical calculations.
Healthcare simulation scenario design remains a resource-intensive process, demanding significant time and expertise from educators. This article presents an innovative AI-driven agentic workflow for healthcare simulation scenario development, bridging technical capability with pedagogical effectiveness. The system evolved from an initial ChatGPT-based prototype to a sophisticated platform implementation utilizing multiple specialized AI agents. Each agent addresses specific sub-tasks, including objective formulation, patient narrative generation, diagnostic data creation, and debriefing point development. The workflow employs advanced AI methodologies including decomposition, prompt chaining, parallelization, retrieval-augmented generation, and iterative refinement, all orchestrated through a user-friendly conversational interface. Critical to implementation was the demonstration that healthcare professionals with modest technical skills could develop these complex workflows without specialized AI expertise. The system ensures consistent adherence to established simulation guidelines, including INACSL Standards of Best Practice and ASPiH Standards Framework, while significantly reducing scenario development time by approximately 70-80%. Designed for broad applicability across diverse clinical settings and learner levels, the workflow incorporates multilingual capabilities for global application. Potential pitfalls include the necessity for rigorous review of AI-generated content and awareness of bias in model outputs. Key lessons learned emphasize interdisciplinary collaboration, systematic prompt refinement, essential human oversight, and the democratization of AI tools in healthcare education. This innovation demonstrates how sophisticated agentic AI implementations can transform healthcare simulation through enhanced efficiency, consistency, and accessibility without sacrificing pedagogical integrity.
The field of radiology is experiencing rapid adoption of large language models (LLMs), yet their tendency to generate hallucinations (plausible but incorrect information) remains a significant barrier to trust. This comprehensive review evaluates emerging agentic artificial intelligence (AI) approaches, including multi-agent role-based systems, retrieval-augmented generation (RAG), and uncertainty quantification, to assess their potential for reducing hallucinations in radiology workflows. Evidence from 2024 to 2025 demonstrates that agentic AI can improve diagnostic accuracy and reduce error rates, though these methods remain computationally demanding and lack comprehensive clinical validation. Multi-agent frameworks enable cross-validation through role-based specialization and systematic workflow orchestration, while RAG strategies enhance accuracy by grounding responses in verified medical literature. Within multi-agent systems, uncertainty quantification enables agents to communicate confidence levels to one another, allowing them to appropriately weigh each other's contributions during collaborative analysis. While multi-agent frameworks and RAG strategies show significant promise, practical deployment will require careful integration with human oversight, robust evaluation metrics tailored to medical imaging tasks, and regulatory adaptation to ensure safe clinical use in diverse patient populations and imaging modalities.
This paper proposed the Agentic Computer Vision (AgCV) framework designed to automate complex computer vision (CV) tasks through autonomous agents that communicate through Graphical User Interface (GUI). The AgCV framework leverages LangGraph, natural language processing, deep learning, and data science to build adaptive, user-driven CV pipelines. In the proposed AgCV each Agent works on a particular task ranging from object identification and classification to image segmentation. By incorporating Retrieval-Augmented Generation (RAG) and LangGraph, the AgCV enable fully automated pipelining through user interactions. The proposed Framework strategy reduces the need for technical expertise, allowing end-users to generate and configure CV operations using intuitive language commands. AgCV promotes accessibility, scalability, and flexibility of CV applications in different domains. The AgCV not only simplifies user interaction but also ensures that the system aligns with user expectations and needs.•The proposed system allows users to create and configure CV operations using simple natural language, making it accessible even to those with limited technical expertise.•The AgCV framework supports a wide range of CV tasks and can be easily adapted to different user needs and applications.
Integrating Large Language Models (LLMs) with research tools presents technical and reproducibility challenges for biomedical research. While commercial artificial intelligence (AI) systems are easy to adopt, they obscure data provenance, lack transparency, and can generates false information, making them unfit for many research problems. To address these challenges, we developed the Bioinformatics Retrieval Augmented Digital (BRAD) agent software system. Here, we introduce BRAD, an agentic system that integrates LLMs with external tools and data to streamline research workflows. BRAD's modular agents retrieve information from literature, custom software, and online databases while maintaining transparent protocols to increase the reliability of AI generated results. We apply BRAD to a biomarker discovery pipeline, automating both execution and the generation of enrichment reports. This workflow contextualizes user data within the literature, enabling a level of interpretation and automation that surpasses conventional research tools. Beyond the workflow we highlight here, BRAD is a flexible system that has been deployed in other applications including a chatbot, video RAG, and analysis of single cell data. The source code for BRAD is available at https://github.com/Jpickard1/BRAD; Information for pip installation, tutorials, documentation, and further information can be found at: ReadTheDocs.
The exponential growth of biomedical literature-over a million new PubMed entries each year-has outpaced traditional evidence-synthesis methods. Systematic reviews, long the cornerstone of evidence-based dentistry, are resource-intensive and often outdated within a few years, widening the gap between current research and clinical practice. We outline Retrieval-Augmented Generation (RAG) as a methodology for dynamic evidence reviews. RAG strengthens Large Language Models (LLMs) by combining their generative capacity with real-time retrieval from a continuously updated, curated knowledge base. This design grounds every answer in verifiable sources and mitigates the factual errors and hallucinations seen in standalone LLMs. RAG enables on-demand dynamic synthesis of the latest evidence, allowing clinicians and researchers to ask complex, natural-language questions and receive concise, fully cited answers. For dental clinicians, this approach enables rapid, citation-linked answers to practice-relevant questions-such as material selection, healing outcomes, or procedural comparisons-without relying on outdated narrative summaries. We describe three complementary integration pathways-RAG on pre-retrieved article pools, public living review portals, and machine-actionable journal publications-each with distinct requirements and benefits. Looking forward, emerging agentic AI systems, capable of planning multi-step searches and iterative updates, may further enhance these capabilities. Although this framework is conceptually grounded and supported by emerging methodological evidence, prospective empirical validation, benchmarking against existing review approaches, and real-world deployment studies will be required to fully assess its performance, reliability, and impact on clinical decision-making. RAG offers a scalable, transparent alternative to static systematic reviews and can shorten the research-to-practice timeline. By automating retrieval and initial synthesis while keeping human critical appraisal and ethical judgment central, it points toward an era of augmented rather than automated intelligence in evidence-based dentistry.
Large language models (LLMs) have demonstrated impressive capabilities in medical domains, yet their ability to handle the specialized reasoning patterns required in clinical neurology warrants systematic evaluation. Neurological assessment presents distinctive challenges that combine anatomical localization, temporal pattern recognition, and nuanced symptom interpretation-cognitive processes that are specifically tested in board certification examinations. We developed a comprehensive benchmark comprising 305 questions from Israeli Board Certification Exams in Neurology and classified each along three dimensions of complexity: factual knowledge depth, clinical concept integration, and reasoning complexity. We evaluated ten LLMs of varying architectures and specializations using this benchmark, testing base models, retrieval-augmented generation (RAG) enhancement, and a novel multi-agent system. Our analysis revealed significant performance variation across models and methodologies. The OpenAI-o1 model achieved the highest base performance (90.9% accuracy), while specialized medical models performed surprisingly poorly (52.9% for Meditron-70B). RAG enhancement provided variable benefits across models; substantial improvements for mid-tier models like GPT-4o (80.5% to 87.3%) and smaller models, but limited effectiveness on the highest complexity questions regardless of model size. In contrast, our multi-agent framework-which decomposes neurological reasoning into specialized cognitive functions including question analysis, knowledge retrieval, answer synthesis, and validation-achieved dramatic improvements, especially for mid-range models. The LLaMA 3.3-70B-based agentic system reached 89.2% accuracy compared to 69.5% for its base model, with particularly substantial gains on level 3 complexity questions across all dimensions. External validation on MedQA revealed dataset-specific RAG effects: while RAG improved board certification performance, it showed minimal benefit on MedQA questions (LLaMA 3.3-70B: + 1.4% vs + 3.9% on board exams), reflecting alignment between our specialized neurology textbook and board examination content rather than the broader medical knowledge required for MedQA. Most notably, the multi-agent approach transformed inconsistent subspecialty performance into remarkably uniform excellence, effectively addressing the neurological reasoning challenges that persisted even with RAG enhancement. We further validated our approach using an independent dataset comprising 155 neurological cases extracted from MedQA. The results confirm that structured multi-agent approaches designed to emulate specialized cognitive processes significantly enhance complex medical reasoning offering promising directions for AI assistance in challenging clinical contexts.
Biological swarms communicate through decentralized, adaptive behaviors shaped by local interactions, selective attention, and symbolic signaling. These principles of animal communication enable robust coordination without centralized control or persistent connectivity. This work presents a proof of concept that identifies, evaluates, and translates biological communication strategies into a generative visual language for unmanned aerial vehicle (UAV) swarm agents operating in radio-frequency (RF)-denied environments. Drawing from natural exemplars such as bee waggle dancing, white-tailed deer flagging, and peacock feather displays, we construct a configuration space that encodes visual messages through trajectories and LED patterns. A large language model (LLM), preconditioned using retrieval-augmented generation (RAG), serves as a generative translation layer that interprets perception data and produces symbolic UAV responses. Five test cases evaluate the system's ability to preserve and adapt signal meaning through within-modality fidelity (maintaining symbolic structure in the same modality) and cross-modal translation (transferring meaning across motion and light). Covariance and eigenvalue-decomposition analysis demonstrate that this bio-agentic approach supports clear, expressive, and decentralized communication, with motion-based signaling achieving near-perfect clarity and expressiveness (0.992, 1.000), while LED-only and multi-signal cases showed partial success, maintaining high expressiveness (~1.000) but with much lower clarity (≤0.298).
Clinical decision-making in hepatology is currently challenged by the rapid expansion of medical knowledge and the limitations of Large Language Models (LLMs), specifically their unreliability and tendency to hallucinate. Furthermore, standard Retrieval-Augmented Generation (RAG) paradigms often fail to effectively leverage complex medical knowledge structures. To address these issues, we propose an Agentic Graph RAG framework built upon a clinically-verified hepatology knowledge graph. Our approach utilizes a state-driven agentic system employing a self-correcting "retrieve-evaluate-refine" loop. Within this workflow, agents dynamically generate, semantically validate, assess, and iteratively optimize graph search strategies to construct a comprehensive and accurate context, which is then used by an LLM to generate reliable responses. The framework was evaluated on a custom dataset of clinical questions. It significantly outperformed baseline models (including GPT-4, standard RAG, and Graph RAG) across all evaluation metrics. Specifically, our model achieved superior scores in faithfulness (0.94), context recall (0.92), and answer relevancy (0.91). This agentic approach effectively mitigates LLM hallucinations and provides accurate, interpretable answers. These findings demonstrate the framework's potential as a robust, next-generation intelligent clinical decision support tool for hepatology.
The pharmaceutical industry faces pressure to improve the drug development process while reducing costs in an evolving regulatory landscape. This paper presents the Preclinical Information Center (PRINCE), a cloud-hosted data integration platform developed by Bayer AG in collaboration with Thoughtworks. PRINCE integrates decades of structured and unstructured safety study reports, leveraging a multi-agent architecture based on Large Language Models (LLMs) and advanced data retrieval methodologies, such as Retrieval-Augmented Generation and Text-to-SQL. In this paper, we describe the three-step evolution of PRINCE from a data search tool based on keyword matching to a resourceful research assistant capable of answering complex questions and drafting regulatory-critical documents. We highlight the iterative development process, guided by user feedback, that ensures alignment with evolving research needs and maximizes utility. Finally, we discuss the importance of building trust-based solutions and how transparency and explainability have been integrated into PRINCE. In particular, the integration of a human-in-the-loop approach enhances the accuracy and retains human accountability. We believe that the development and deployment of the PRINCE chatbot demonstrate the transformative potential of AI in the pharmaceutical industry, significantly improving data accessibility and research efficiency, while prioritizing data governance and compliance.
Wildfires are environmental hazards with severe ecological, social, and economic impacts. Wildfires devastate ecosystems, communities, and economies worldwide, with rising frequency and intensity driven by climate change, human activity, and environmental shifts. Analyzing wildfire insights such as detection, predictive patterns, and risk assessment enables proactive response and long-term prevention. However, most of the existing approaches have been focused on isolated processing of data, making it challenging to orchestrate cross-modal reasoning and transparency. This study proposed a novel orchestrator-based multi-agent system (MAS), with the aim of transforming multimodal environmental data into actionable intelligence for decision making. We designed a framework to utilize Large Multimodal Models (LMMs) augmented by structured prompt engineering and specialized Retrieval-Augmented Generation (RAG) pipelines to enable transparent and context-aware reasoning, providing a cutting-edge Visual Question Answering (VQA) system. It ingests diverse inputs like satellite imagery, sensor readings, weather data, and ground footage and then answers user queries. Validated by several public datasets, the system achieved a precision of 0.797 and an F1-score of 0.736. Thus, powered by Agentic AI, the proposed, human-centric solution for wildfire management, empowers firefighters, governments, and researchers to mitigate threats effectively.
Traditional knowledge graphs of water conservancy project risks have supported risk decision-making. However, they are constrained by limited data modalities and low accuracy in information extraction. A multimodal water conservancy project risk knowledge graph is proposed in this study, along with a synergistic strategy involving multimodal large language models Risk decision-making generation is facilitated through a multi-agent agentic retrieval-augmented generation framework. To enhance visual recognition, a DenseNet-based image classification model is improved by incorporating single-head self-attention and coordinate attention mechanisms. For textual data, risk entities such as locations, components, and events are extracted using a BERT-BiLSTM-CRF architecture. These extracted entities serve as the foundation for constructing the multimodal knowledge graph. To support generation, a multi-agent agentic retrieval-augmented generation mechanism is introduced. This mechanism enhances the reliability and interpretability of risk decision-making outputs. In experiments, the enhanced DenseNet model outperforms the original baseline in both precision and recall for image recognition tasks. In risk decision-making tasks, the proposed approach-combining a multimodal knowledge graph with a multi-agent agentic retrieval-augmented generation method-achieves strong performance on BERTScore and ROUGE-L metrics. This work presents a novel perspective for leveraging multimodal knowledge graphs in water conservancy project risk management.
To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content. We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy. ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA. Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance. ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.
The Internet of Things (IoT) has enabled a vast network of devices to communicate over the Internet. However, the fragmentation of IoT systems continues to hinder seamless data sharing and coordinated management across platforms.However, there is currently no actual search engine for IoT data. Existing IoT search engines are considered device discovery tools, providing only metadata about devices rather than enabling access to IoT application data. While efforts such as IoTCrawler have striven to support IoT application data, they have largely failed due to the fragmentation of IoT systems and the heterogeneity of IoT data.To address this, we recently introduced SensorsConnect-a unified framework designed to facilitate interoperable content and sensor data sharing among collaborative IoT systems, inspired by how the World Wide Web (WWW) enabled shared and accessible information spaces for humans. This paper presents the IoT Agentic Search Engine (IoTASE), a real-time semantic search engine tailored specifically for IoT environments. IoTASE leverages LLMs and Retrieval-Augmented Generation (RAG) techniques to address the challenges of navigating and searching vast, heterogeneous streams of real-time IoT data. This approach enables the system to process complex natural language queries and return accurate, contextually relevant results in real time. To evaluate its effectiveness, we implemented a hypothetical deployment in the Toronto region, simulating a realistic urban environment using a dataset composed of 500 services and over 37,000 IoT-like data entries. Our evaluation shows that IoT-ASE achieved 92% accuracy in retrieving intent-aligned services and consistently generated concise, relevant, and preference-aware responses, outperforming generalized outputs produced by systems such as Gemini. These results underscore the potential of IoT-ASE to make real-time IoT data both accessible and actionable, supporting intelligent decision-making across diverse application domains.
本报告对 Agent、AI 及 LLM 研究进行了深度整合与分类,构建了涵盖底层理论、应用开发与安全治理的综合框架。研究呈现出明显的范式转变:从基础的 Agentic RAG 框架进化至复杂的科学自动化工作流;从单一的任务执行向多智能体协同协作演进;从通用架构设计走向各行业(如生物医药、工业运维、社会科学)的垂直落地。同时,随着智能体应用复杂性的提升,针对隐私安全、行为验证及人机信任评估的研究已成为不可或缺的基石,标志着 AI Agent 正步入高度自主与可控并行发展的工业化应用阶段。