agent 发展演化
智能体认知架构与通用理论模型
关注 Agent 的底层设计与认知演进,涵盖感知-大脑-行动模型、模型上下文协议(MCP)、世界模型、意图驱动理论以及从传统架构向生成式、神经符号化架构的转变。
- Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents(V. Arunkumar, Gangadharan G.R., R. Buyya, 2026, ArXiv)
- Architecting AI Agents: A Comprehensive Technical Guide(Mamatha Adinarayana Swamy, 2025, European Modern Studies Journal)
- From the logic of coordination to goal-directed reasoning: the agentic turn in artificial intelligence(Tsehaye Haidemariam, 2026, Frontiers in Artificial Intelligence)
- From Subsumption to Semantic Mediation: A Generative Orchestration Architecture for Autonomous Systems(A. Kojukhov, Ilya Levin, Arkady Bovshover, 2025, Algorithms)
- Unified Multimodal Cognitive Architecture (UMCA): an Integrated Framework for Perception, Reasoning, and Action for High-Stakes Environments(Dr. Jaimin Jani, 2025, International Journal of Environmental Sciences)
- Modeling Interactions Between Autonomous Agents in a Multi-Agent Self-Awareness Architecture(Abrham Shiferaw Alemaw, Giulia Slavic, Pamela Zontone, L. Marcenaro, David Martín Gómez, Carlo S. Regazzoni, 2025, IEEE Transactions on Multimedia)
- Architecture and Workload as Primary Sources of Error in RAG and Agentic AI Systems: Summary of Two Years of ☸️SAIMSARA Development(Saimsara, 2025, Web3 Journal: ML in Health Science)
- LLM-Driven Agentic AI Approach to Enhanced O-RAN Resilience in Next-Generation Networks(Xingqi Wu, Yuhui Wang, Junaid Farooq, Juntao Chen, 2025, IEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS))
- MCP-Enabled LLM Agent for Contextual Reasoning in a Multilingual Tourism Voicebot(S. Anand, A. Sai, 2025, 2025 IEEE 4th International Conference for Advancement in Technology (ICONAT))
- A Generative Neuro-Cognitive Architecture Using Quantum Algorithms for the Autonomous Behavior of a Smart Agent in a Simulation Environment(Evren Daglarli, 2025, Computers, Materials & Continua)
- Agentic AI Systems: Architecture, Capabilities, and Implications for Autonomous Decision-Making(Gauri Sale, Dr. Sonal Ayare, 2025, International Journal of Technology & Emerging Research)
- World-in-World: World Models in a Closed-Loop World(Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunouglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, P. Liang, Daniel Khashabi, Cheng Peng, Ramalingam Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen, 2025, ArXiv)
- Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration(James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu, Tristan Naumann, Muhao Chen, H. Poon, 2025, ArXiv)
- Autonomous Task Completion Based on Goal-directed Answer Set Programming(Alexis Tudor, 2025, Electronic Proceedings in Theoretical Computer Science)
- Augmenting Von Neumann's Architecture for an Intelligent Future(Rajpreet Singh, Vidhi Kothari, 2025, ArXiv)
- AI Agents: Evolution, Architecture, and Real-World Applications(Naveen Krishnan, 2025, ArXiv)
- Agentic AI Systems: Architecture and Evaluation Using a Frictionless Parking Scenario(Alaa Khamis, 2025, IEEE Access)
- An Enterprise Agentic Architecture Framework for Agentic AI Governance and Scalable Autonomy(Padmanabham Venkiteela, 2026, Scientific Journal of Computer Science)
多智能体系统(MAS)协作、通信与组织拓扑
探讨多个 Agent 之间的动态协作逻辑,包括任务分解、共识机制、角色分配、通信开销优化、延迟容忍策略以及基于图神经网络的拓扑学习。
- Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems(Jiaxi Yang, Mengqi Zhang, Yiqiao Jin, Hao Chen, Qingsong Wen, Lu Lin, Yi He, Weijie Xu, James Evans, Jindong Wang, 2025, ArXiv)
- Preventing Rogue Agents Improves Multi-Agent Collaboration(Ohav Barbi, Ori Yoran, Mor Geva, 2025, ArXiv)
- Multi-Agent Collaboration via Evolving Orchestration(Yufan Dang, Cheng Qian, Xu Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun, 2025, ArXiv)
- Advancing Complex Task Management Through Multi-Agent Systems: Evolution and Transformation of Distributed AI Platforms(Mohan Singh, 2025, Journal of Computer Science and Technology Studies)
- MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation(Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, Lihua Zhang, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- MAGNNET: Multi-Agent Graph Neural Network-Based Efficient Task Allocation for Autonomous Vehicles with Deep Reinforcement Learning(Lavanya Ratnabala, A. Fedoseev, Robinroy Peter, Dzmitry Tsetserukou, 2025, 2025 IEEE Intelligent Vehicles Symposium (IV))
- Long-Video Audio Synthesis with Multi-Agent Collaboration(Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, Ying-Cong Chen, 2025, ArXiv)
- Collaborative Tree Search for Enhancing Embodied Multi-Agent Collaboration(Lizheng Zu, Lin Lin, Song Fu, Na Zhao, Pan Zhou, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction(Song Wang, Zhen Tan, Zihan Chen, Shuang Zhou, Tianlong Chen, Jundong Li, 2025, ArXiv)
- AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration(Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, Min Zhang, 2025, No journal)
- Semantic Information Extraction and Multi-Agent Communication Optimization Based on Generative Pre-Trained Transformer(Li Zhou, Xinfeng Deng, Zhe Wang, Xiaoying Zhang, Yanjie Dong, Xiping Hu, Zhaolong Ning, Jibo Wei, 2025, IEEE Transactions on Cognitive Communications and Networking)
- CoDe: Communication Delay-Tolerant Multi-Agent Collaboration via Dual Alignment of Intent and Timeliness(Shoucheng Song, Youfang Lin, Sheng Han, Chang Yao, Hao Wu, Shuo Wang, Kai Lv, 2025, No journal)
- HAWK: A Hierarchical Workflow Framework for Multi-Agent Collaboration(Yuyang Cheng, Yumiao Xu, Chaojia Yu, Yong Zhao, 2025, ArXiv)
- MARS: toward more efficient multi-agent collaboration for LLM reasoning(Xiao Wang, Jia Wang, Yijie Wang, P. Dang, Sha Cao, Chi Zhang, 2025, ArXiv)
- Explain-Analyze-Generate: A Sequential Multi-Agent Collaboration Method for Complex Reasoning(W. Gu, Jiale Han, Haowen Wang, Xiang Li, Bo Cheng, 2025, No journal)
- LLM-Based Multi-Agent Systems are Scalable Graph Generative Models(Jiarui Ji, Runlin Lei, Jialing Bi, Zhewei Wei, Xu Chen, Yankai Lin, Xuchen Pan, Yaliang Li, Bolin Ding, 2025, No journal)
- DRF: LLM-AGENT Dynamic Reputation Filtering Framework(Yuwei Lou, Hao Hu, Shaocong Ma, Zongfei Zhang, Liang Wang, Jidong Ge, Xianping Tao, 2025, ArXiv)
- AutoHMA-LLM: Efficient Task Coordination and Execution in Heterogeneous Multi-Agent Systems Using Hybrid Large Language Models(Tinging Yang, Ping Feng, Qixin Guo, Jindi Zhang, Xiufeng Zhang, Jiahong Ning, Xinghan Wang, Zhongyang Mao, 2025, IEEE Transactions on Cognitive Communications and Networking)
自演进、自学习与持续进化机制
聚焦 Agent 的自主进化能力,通过强化学习、自反思(Self-reflection)、跨任务经验池、幻觉缓解以及“试错-改进”循环实现能力的持续增长。
- From Agentification to Self-Evolving Agentic AI for Wireless Networks: Concepts, Approaches, and Future Research Directions(Changyuan Zhao, Ruichen Zhang, Jiacheng Wang, Dusit Niyato, G. Sun, Xianbin Wang, Shiwen Mao, Abbas Jamalipour, 2025, ArXiv)
- SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience(Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiao-wen Dong, Tong Wu, Dahua Lin, Jiaqi Wang, 2025, ArXiv)
- HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research(Yinghao Zhu, Yifan Qi, Zixiang Wang, Lei Gu, Dehao Sui, Haoran Hu, Xichen Zhang, Ziyi He, Liantao Ma, Lequan Yu, 2025, ArXiv)
- Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration(Yilong Li, Cheng Qian, Yu Xia, Ruijie Shi, Yufan Dang, Zihao Xie, Ziming You, Weize Chen, Cheng Yang, Weichuan Liu, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun, 2025, ArXiv)
- Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies(Jing Wang, Weiting Peng, Jing Tang, Zeyu Gong, Xihuai Wang, Bo Tao, Li Cheng, 2025, ArXiv)
- REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation(Puzhen Yuan, Angyuan Ma, Yunchao Yao, Huaxiu Yao, Masayoshi Tomizuka, Mingyu Ding, 2025, ArXiv)
- MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration(David Wan, J. Chen, Elias Stengel-Eskin, Mohit Bansal, 2025, ArXiv)
- RoboPhD: Self-Improving Text-to-SQL Through Autonomous Agent Evolution(Andrew Borthwick, Stephen Ash, 2026, ArXiv)
- InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration(Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, Wei Pang, 2025, ArXiv)
- Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks(Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Mingshi Yan, Ji Zhang, Fei Huang, Heng Ji, 2025, ArXiv)
- Self-Supervised Goal-Reaching Results in Multi-Agent Cooperation and Exploration(Chirayu Nimonkar, Shlok Shah, Catherine Ji, Benjamin Eysenbach, 2025, ArXiv)
- Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning(Hampus Åström, E. A. Topp, J. Malec, 2025, ArXiv)
- AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation(Ming Wang, Peidong Wang, L. Wu, Xiaocui Yang, Daling Wang, Shi Feng, Yuxin Chen, Bixuan Wang, Yifei Zhang, 2025, ArXiv)
软件工程自动化与智能运维(AIOps)
研究 Agent 在软件生命周期(SDLC)中的深度集成,包括自动化测试、代码生成与修复、需求工程、敏捷管理以及在 Kubernetes 等云原生环境下的自主运维。
- Agentic AI for Software: thoughts from Software Engineering community(Abhik Roychoudhury, 2025, ArXiv)
- Do Autonomous Agents Contribute Test Code? A Study of Tests in Agentic Pull Requests(S. Haque, Sarvesh Ingale, Christoph Csallner, 2026, ArXiv)
- Think Like an Engineer: A Neuro-Symbolic Collaboration Agent for Generative Software Requirements Elicitation and Self-Review(Sai Zhang, Zhenchang Xing, Jieshan Chen, Dehai Zhao, Zizhong Zhu, Xiaowang Zhang, Zhiyong Feng, Xiaohong Li, 2025, ArXiv)
- Mediating between Human Programmers and Integrated Development Environments using LLM-based Agents(Ziyou Li, 2025, Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering)
- An Agent-Oriented Model-Driven Development Process for Cyber-Physical Systems(Claudio Navarro, Lorenzo Devia, J. E. L. Gayo, Carlos Cares, 2025, No journal)
- Software Architecture in the Age of Agentic AI(Karthik Vaidhyanathan, Henry Muccini, 2025, No journal)
- Agile Software Management with Cognitive Multi-Agent Systems(Konrad Cinkusz, Jarosław A. Chudziak, 2025, No journal)
- Unified Software Engineering agent as AI Software Engineer(Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, Abhik Roychoudhury, 2025, ArXiv)
- Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency(Nazmus Ashrafi, Salah Bouktif, Mohammed Mediani, 2025, ArXiv)
- Autonomous QA Agent: A Retrieval-Augmented Framework for Reliable Selenium Script Generation(Dudekula Kasim Vali, 2025, ArXiv)
- AI builds, We Analyze: An Empirical Study of AI-Generated Build Code Quality(Anwar Ghammam, Mohamed Almukhtar, 2026, ArXiv)
- AI-Driven Product Development: Cognitive Software Delivery at Enterprise Scale(Jothimani kanthan Ganapathi, 2025, Journal of Computer Science and Technology Studies)
- Using Agenticai With Kubernetes For Faster Development, Deployment and Delivery in Production Environments.(Hardeep S. Tiwana, 2025, The American Journal of Engineering and Technology)
- AutoDevFlow: A Multi-Agent System for Automated Development Lifecycle Management(Yu-Hxiang Chen, Wei-Hsiang Sung, Chia-Yu Lin, 2025, 2025 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan))
- Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine(Wenyi Wang, Piotr Piekos, Nanbo Li, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, Jurgen Schmidhuber, 2025, ArXiv)
- SECURITY AS CODE USING AGENTIC AI: EFFICIENCY IN ENSURING SOFTWARE DEVELOPMENT LIFECYCLE SECURITY(O.P. Vakhula, P. Vorobets, 2025, Computer systems and network)
- LLM - Driven Autonomous Cloud Automation Agent(M. G S, Vibha Joshi, Pranitha T Manur, 2025, 2025 9th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS))
- Autonomic Microservice Management via Agentic AI and MAPE-K Integration(Matteo Esposito, Alexander Bakhtin, Noman Ahmad, Mikel Robredo, Ruoyu Su, Valentina Lenarduzzi, Davide Taibi, 2025, ArXiv)
安全防御、合规治理与可信 Agent 体系
关注 Agent 带来的新安全风险(如内存投毒、隐私泄露、偏见涌现),并提出威胁模型、区块链审计、红队评估及企业级治理框架。
- Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents(Vineeth Sai Narajala, Om Narayan, 2025, 2025 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI))
- A Governance Framework For Agentic AI: Mitigating Systemic Risks In LLM-Powered Multi-Agent Architectures(Annapurneswar Putrevu, 2025, Journal of International Crisis and Risk Communication Research)
- Securing Agentic AI: Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System(P. Zambare, Venkata Nikhil Thanikella, Ying Liu, 2025, ArXiv)
- Cognitive Trust Architecture for Mitigating Agentic AI Threats: Adaptive Reasoning and Resilient Cyber Defense(Kumrashan Indranil Iyer, 2025, Journal of Information Systems Engineering and Management)
- RedTeamLLM: an Agentic AI framework for offensive security(Brian Challita, Pierre Parrend, 2025, ArXiv)
- iThelma: An Autonomous LLM Agent for Cyber Threat Hunting via Playbook-Driven Intelligence(Nick Cheng, Raymund Lin, Dean Xie, Han Lin, Sally Chen, 2025, 2025 IEEE Conference on Communications and Network Security (CNS))
- DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses(Han Luo, Guy Laban, 2025, ArXiv)
- A Multi-agent Security Framework for AI-Assisted Software Development(D. Mukherjee, 2026, International Journal of AI, BigData, Computational and Management Studies)
- The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration(Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal, 2025, ArXiv)
- SAGA: A Security Architecture for Governing AI Agentic Systems(Georgios Syros, Anshuman Suri, Cristina Nita-Rotaru, Alina Oprea, 2025, ArXiv)
- SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks(Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, Lingming Zhang, 2025, ArXiv)
- CurriculumPT: LLM-Based Multi-Agent Autonomous Penetration Testing with Curriculum-Guided Task Scheduling(Xingyu Wu, Yunzhe Tian, Yuanwan Chen, Ping Ye, Xiaoshu Cui, Jingqi Jia, Shouyang Li, Jiqiang Liu, Wenjia Niu, 2025, Applied Sciences)
- SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection(Yang Feng, Xudong Pan, 2025, ArXiv)
- Agent-vs-Agent Cyber Warfare: Autonomous AI Systems Defending Against AI-Enabled APTs(Dr. Salman Arafath Mohammed, 2025, International Journal on Advanced Computer Engineering and Communication Technology)
- Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems(Jin L. C. Guo, Yingying Xu, 2025, ArXiv)
- An LLM Agent-based Framework for Whaling Countermeasures(Daisuke Miyamoto, Takuji Iimura, Narushige Michishita, 2026, ArXiv)
- A Blockchain-Monitored Agentic AI Architecture for Trusted Perception-Reasoning-Action Pipelines(Salman Jan, Hassan Ali Razzaqi, Ali Akarma, M. R. Belgaum, 2025, ArXiv)
- CAPRI: A Context-Aware Privacy Framework for Multi-Agent Generative AI Applications(J. Park, Vijay K. Madisetti, 2025, IEEE Access)
- A Framework for Autonomous, Cross-Cloud Threat Mitigation Using Multi-Agent Reinforcement Learning(Akshay Mittal, 2025, International Journal of Global Innovations and Solutions)
具身智能、机器人感知与自主控制
探讨 Agent 在物理或模拟环境中的感知-行动闭环,涉及视觉-语言-动作(VLA)模型、异构机器人协作、主动推理、避障导航及复杂任务下的动作规划。
- Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation(Badi Li, Renjie Lu, Yu Zhou, Jingke Meng, Wei-Shi Zheng, 2025, ArXiv)
- Bio-Inspired Topological Autonomous Navigation with Active Inference in Robotics(Daria de Tinguy, Tim Verbelen, Emilio Gamba, Bart Dhoedt, 2025, 2025 25th International Conference on Control, Automation and Systems (ICCAS))
- Active Inference for an Intelligent Agent in Autonomous Reconnaissance Missions(Johan Schubert, Farzad Kamrani, T. Gustavi, 2025, ArXiv)
- Multi-Agent Collaborative Closed-Loop Planning - Evaluation - Assessment Framework: Optimization and Interpretability Evaluation(Jiajing Li, Li Yang, Zihan Wei, Hang Zhou, Kequan Fang, 2025, 2025 4th International Conference on Intelligent Mechanical and Human-Computer Interaction Technology (IHCIT))
- Generalizable Reinforcement Learning with Biologically Inspired Hyperdimensional Occupancy Grid Maps for Exploration and Goal-Directed Path Planning(Shay Snyder, Ryan Shea, Andrew Capodieci, David J. Gorsich, Maryam Parsa, 2025, ArXiv)
- Synergistic Hierarchical AI Framework for USV Navigation: Closing the Loop Between Swin-Transformer Perception, T-ASTAR Planning, and Energy-Aware TD3 Control(Haonan Ye, Hongjun Tian, Qingyun Wu, Yihong Xue, Jiayu Xiao, Guijie Liu, Yang Xiong, 2025, Sensors (Basel, Switzerland))
- Perception to Action with Vision-Language-Action Models for Fast and Reliable Decision Making in Dynamic Environments(Yeryeong Cho, Jaeyoung Choe, Joongheon Kim, 2025, 2025 16th International Conference on Information and Communication Technology Convergence (ICTC))
- ORION: Option-Regularized Deep Reinforcement Learning for Cooperative Multi-Agent Online Navigation(Shizhe Zhang, Jingsong Liang, Zhitao Zhou, Shuhan Ye, Yizhuo Wang, Derek Ming Siang Tan, Jimmy Chiun, Yuhong Cao, G. Sartoretti, 2026, ArXiv)
- Multi-Agent Reinforcement Learning for Control of Heterogeneous AMRs in Smart Factories(Jongin Lee, Soyi Jung, 2025, 2025 16th International Conference on Information and Communication Technology Convergence (ICTC))
- Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop(Justin Kerr, Kush Hari, Ethan Weber, Chung Min Kim, Brent Yi, Tyler Bonnen, Ken Goldberg, Angjoo Kanazawa, 2025, ArXiv)
- LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating(Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen, Chenrui Tie, Jiajun Deng, Lin Shao, 2025, ArXiv)
- MARL-Based AUV Formation for Underwater Intelligent Autonomous Transport Systems Supported by 6G Network(Jingyi He, Meng Xi, Jiabao Wen, Shuai Xiao, Jiachen Yang, 2025, IEEE Transactions on Intelligent Transportation Systems)
- DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge(Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin, 2025, ArXiv)
- RoboOS: A Hierarchical Embodied Framework for Cross-Embodiment and Multi-Agent Collaboration(Huajie Tan, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Yaoxu Lyu, Mingyu Cao, Zhongyuan Wang, Shanghang Zhang, 2025, ArXiv)
- BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments(Yibo Qiu, Zan Huang, Zhiyu Wang, Han Liu, Yilin Qiao, Yifeng Hu, Shu'ang Sun, Hangke Peng, R. X. Xu, Mingzhai Sun, 2025, ArXiv)
- CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments(Mingcong Lei, Ge Wang, Yiming Zhao, Zhixin Mai, Qingrui Zhao, Yao Guo, Zhen Li, Shuguang Cui, Yatong Han, Jinke Ren, 2025, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
自动驾驶与智慧交通调度
专注于 Agent 在自动驾驶决策、多智能体博弈、轨迹规划、城市交通路径优化及自动驾驶场景闭环评估中的应用。
- A Preference-Based Multi-Agent Federated Reinforcement Learning Algorithm Framework for Trustworthy Interactive Urban Autonomous Driving(Sikai Lu, Yingfeng Cai, Ze Liu, Yubo Lian, Long Chen, Hai Wang, 2025, IEEE Transactions on Intelligent Transportation Systems)
- Intelligent path planning technique for autonomous vehicles using improved harmony search optimized fuzzy control(V. Satya, Rahul Kosuru, Ashwin Kavasseri Venkitaraman, 2025, World Journal of Advanced Research and Reviews)
- ALGPT: Multi-Agent Cooperative Framework for Open-Vocabulary Multi-Modal Auto-Annotating in Autonomous Driving(Yijie Zhou, Xianhui Cheng, Qiming Zhang, Lei Wang, Wenchao Ding, Xiangyang Xue, Chunbo Luo, Jian Pu, 2025, IEEE Transactions on Intelligent Vehicles)
- Adaptive Navigation System for an Autonomous Vehicle in a Goal-Oriented Environment(Over Mejia, Ronald Ceballos, Rhonald Torres, J.P. Hoyos, 2025, IEEE Latin America Transactions)
- DriveEval-Agent: A Closed-Loop Framework Combining Zero-Shot Benchmarking and Full Fine-Tuning for Multimodal Autonomous Driving(Jingwen Zhao, Junwei Hu, yichu Liu, 2025, Proceedings of the 7th ACM International Conference on Multimedia in Asia)
- Intelligent scheduling control for coordinated passing of multiple autonomous trucks at intersections in open-pit coal mines via enhanced multi-agent reinforcement learning(Boyu Luan, Yufeng Xiao, Wei Zhou, Hairong Zhou, Xiangcheng Lu, Zhihui Han, 2025, J. Comput. Des. Eng.)
- Intelligent Routing in Connected Autonomous Vehicles using the ADAPT-S Learning Architecture(T. Vaishnavi, M. Y. Al-Safarini, N. Nalini, T. Stalin, J. Krishnaraj, M. Dinesh, 2025, 2025 6th International Conference on Electronics and Sustainable Communication Systems (ICESC))
- Autonomous Intersection Management via Prior-Enhanced Multi-Agent Constrained Decision Transformer(Rui Zhao, Yuze Fan, Yun Li, Kui Wang, Chengyuan Zheng, Fei Gao, Zhenhai Gao, 2025, IEEE Transactions on Intelligent Transportation Systems)
- OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model(Xingcheng Zhou, Xu Han, Feng Yang, Yunpu Ma, Alois Knoll, 2025, ArXiv)
- nuPlan-R: A Closed-Loop Planning Benchmark for Autonomous Driving via Reactive Multi-Agent Simulation(Mingxing Peng, Ruoyu Yao, Xusen Guo, Jun Ma, 2025, ArXiv)
- CHARMS: A Cognitive Hierarchical Agent for Reasoning and Motion Stylization in Autonomous Driving(Jingyi Wang, Duanfeng Chu, Zejian Deng, Liping Lu, Jinxiang Wang, Chen Sun, 2025, ArXiv)
- Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach(Junchao Fan, Xiaolin Chang, 2025, ArXiv)
医疗健康与生物医学智能应用
研究 Agent 在医疗诊断、放射报告生成、EHR分析及医疗软件开发中的应用,并提出了专门的医疗 Agent 评测基准。
- MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks(Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, Lequan Yu, 2025, ArXiv)
- MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning(Peng Xia, Jinglu Wang, Yi Peng, Kaide Zeng, Xiangnan Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Shujie Liu, Yan Lu, Huaxiu Yao, 2025, ArXiv)
- Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making(Kaitao Chen, Mianxin Liu, Daoming Zong, Chaoyue Ding, Shaohao Rui, Yankai Jiang, Mu Zhou, Xiaosong Wang, 2025, ArXiv)
- Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation(A. T. Elboardy, Ghada Khoriba, Essam A. Rashed, 2025, ArXiv)
- NimbleLabs: Accelerating Healthcare AI Development Through Agentic AI(Soorya Ram Shimgekar, Abhay Goyal, Shayan Vassef, Koustuv Saha, C. Poellabauer, Xavier Vautier, Pi Zonooz, Navin Kumar, 2025, 2025 IEEE International Conference on Big Data (BigData))
- Streamlining medical software development with CARE lifecycle and CARE agent: an AI-driven technology readiness level assessment tool(Steven N. Hart, P. Day, Christopher A. Garcia, 2025, BMC Medical Informatics and Decision Making)
科学发现、工业自动化与能源管理
展示 Agent 如何加速科学研究(如材料基因组、EDA优化)、风能优化、建筑能源管理(BEMS)以及工业装配中的人机协作。
- PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration(Yingming Pu, Tao Lin, Hongyu Chen, 2025, ArXiv)
- AGAPI-Agents: An Open-Access Agentic AI Platform for Accelerated Materials Design on AtomGPT.org(Jaehyung Lee, J. Ely, Ke Zhang, A. Ajith, Charles Rhys Campbell, Kamal Choudhary, 2025, ArXiv)
- Multi-AI Agent Framework Reveals the "Oxide Gatekeeper" in Aluminum Nanoparticle Oxidation(Yiming Lu, Tingyu Lu, Di Zhang, Lili Ye, Hao Li, 2025, ArXiv)
- Accelerating Scientific Discovery with Autonomous Goal-evolving Agents(Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, J. Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, Cassandra Masschelein, Yingze Wang, Haorui Wang, Haojun Jia, Chao Zhang, Hongyu Zhao, Martin Ester, T. Head-Gordon, Carla P. Gomes, Huan Sun, Chenru Duan, Philippe Schwaller, Wengong Jin, 2025, ArXiv)
- Agentic AI in Wind Energy Systems: Multi-Agent Architectures for Optimization and Resilience(Agus Hasan, Ege Kandemir, Dmitrij Mordasov, Dong Trong Nguyen, 2026, IEEE Access)
- Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling(Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl, 2025, ArXiv)
- Hybrid Intelligence for Collaborative Assembly: Integrating Human and Robotic Agents via Shared Perception-Cognition-Action Cycles(Sebastian Feldmann, Steffen Schwarzer, M. Merkel, H. Saadati, Lorena Lenz, Jakob Gros, Cedric Kiener, Anusha Arulalan, Matthias Wiedenmann, Doris Aschenbrenner, 2025, 2025 International Conference on Sustainable Technologies for Humanity and Smart World (HSWTech))
- Autonomous Waste Classification Using Multi-Agent Systems and Blockchain: A Low-Cost Intelligent Approach(Sergio García González, David Cruz García, R. Perez, Arturo Álvarez Sánchez, G. V. González, 2025, Sensors (Basel, Switzerland))
企业管理、社会仿真与多模态交互服务
涵盖 Agent 在 CRM、金融分析、教育辅导、GUI 自动化操作、社会行为模拟(如选举、灾难疏散)及多模态内容生成中的应用。
- Autonomous CRM agents: Architecting intelligent assistants for scalable, human-like customer engagement(Vikas Reddy Penubelli, 2025, Global Journal of Engineering and Technology Advances)
- Business Architecture Copilot: Agentic AI-enabled Business Capability Modelling(Eesha Oaj, Asif Gill, M. Bandara, Terry Roach, 2025, No journal)
- LLM and Agent-Driven Data Analysis: A Systematic Approach for Enterprise Applications and System-level Deployment(Xi Wang, Xianyao Ling, Kun Li, Gang Yin, Liang Zhang, Jiang Wu, Annie Wang, Weizheng Wang, 2025, ArXiv)
- The Evolution of Financial Analysis: From Manual Methods to AI and AI Agents(Z. Yordanova, Y. Hristozov, 2025, ECONOMICS)
- Evolution of AI in Education: Agentic Workflows(Firuz Kamalov, D. S. Calonge, Linda Smail, Dilshod Azizov, D. Thadani, Theresa Kwong, Amara Atif, 2025, ArXiv)
- FlockVote: LLM-Empowered Agent-Based Modeling for Simulating U.S. Presidential Elections(Lingfeng Zhou, Yi Xu, Zhenyu Wang, Dequan Wang, 2025, ArXiv)
- Validating Generative Agent-Based Models of Social Norm Enforcement: From Replication to Novel Predictions(L. Cross, Nick Haber, Daniel L. K. Yamins, 2025, ArXiv)
- UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning(Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Bo Li, Chen Dun, Chong Liu, Fuxing Leng, Han-rui Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Pei-Xiong Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiao-Jun Qin, Yi Lin, Youbing Wu, Zehui Chen, Zihao Wang, B. Zhong, Xinchun Zhang, Xujing Li, YuanFang Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Hao Zhou, Jinlin Pang, Li Han, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Taoran Lu, Woyu Lin, Xiao Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zheng-Wang Jiang, Zili Li, Zi-Hao Zhao, Chenxi Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Jun-Yan Du, Kai Cai, Kuanye Li, Lichen Yuan, Mei Han, Minchao Wang, Shuyu Guo, Tianhao Cheng, Xiaobo Ma, Xiao Xiao, Xiaolong Huang, Xinjie Chen, Yi-Zhen Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhen Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Haolin Chen, Haolin Chen, Jian Chen, Qinghao Zhao, Guang Shi, 2025, ArXiv)
- GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent(Bin Xie, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Jie Liu, Min Zhang, Liqiang Nie, 2025, No journal)
- MIRA: Multimodal Iterative Reasoning Agent for Image Editing(Ziyun Zeng, Hang Hua, Jiebo Luo, 2025, ArXiv)
- Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement(Chao Hao, Shuai Wang, Kaiwen Zhou, 2025, ArXiv)
- Agentic Retoucher for Text-To-Image Generation(Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai, 2026, ArXiv)
网络基础设施、边缘计算与通信优化
关注 Agent 在 6G 网络、Open RAN、边缘资源调度、语义通信以及数据流水线编排中的自主管理与优化。
- AgentRAN: An Agentic AI Architecture for Autonomous Control of Open 6G Networks(Maxime Elkael, Salvatore D’oro, Leonardo Bonati, Michele Polese, Yunseong Lee, Koichiro Furueda, Tommaso Melodia, 2025, ArXiv)
- LLM Enabled Multi-Agent System for 6G Networks: Framework and Method of Dual-Loop Edge-Terminal Collaboration(Zheyan Qu, Wenbo Wang, Zitong Yu, Boquan Sun, Yang Li, Xing Zhang, 2025, ArXiv)
- AI agent for autonomous optical networks: architectures, technologies, and prospects [Invited Tutorial](Yihao Zhang, Qizhi Qiu, Xiaomin Liu, Xiao Yu, Dianxuan Fu, Xingyu Liu, Zihang Wang, Hao Lin, Yuli Chen, L. Yi, Weisheng Hu, Q. Zhuge, 2026, Journal of Optical Communications and Networking)
- Intelligent Edge Resource Provisioning for Scalable Digital Twins of Autonomous Vehicles(M. S. Shahriar, S. Subramaniam, M. Matsuura, Hiroshi Hasegawa, Shih-Chun Lin, 2025, ArXiv)
- Multi-Agent Collaboration for Vehicular Task Offloading Using Federated Deep Reinforcement Learning(Xing Chen, Bohuai Xiao, Xinyu Lin, Zheyi Chen, Geyong Min, 2025, IEEE Transactions on Mobile Computing)
- Semantic-Driven AI Agent Communications: Challenges and Solutions(Kaiwen Yu, Mengying Sun, Zhijin Qin, Xiaodong Xu, Ping Yang, Yue Xiao, Gang Wu, 2025, ArXiv)
- Agentic AI-Enhanced Semantic Communications: Foundations, Architecture, and Applications(Haixiao Gao, Mengying Sun, Ruichen Zhang, Yanhan Wang, Xiaodong Xu, Nan Ma, Dusit Niyato, Ping Zhang, 2025, ArXiv)
- AI-Driven Decentralized Network Management: Leveraging Multi-Agent Large Language Models for Scalable Optimization(Hoon Lee, Mintae Kim, Seunghwan Baek, Wentao Zhou, M. Debbah, Inkyu Lee, 2025, IEEE Communications Magazine)
- Multi-Agent Orchestration for Autonomous Data Pipelines: A Systems Architecture for Self-Healing, Context-Aware, and Resilient Data Processing(Sonika Darshan, 2026, International Journal of Emerging Research in Engineering and Technology)
智能体演化理论、ROI 与评估框架
从宏观视角审视 Agent 的演化路径,包括代理理论重构、任务复杂度分析、Agentic ROI 评估以及从 LLM 向人工话语智能体(ADA)的转变。
- On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems(Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong, 2025, ArXiv)
- The Real Barrier to LLM Agent Usability is Agentic ROI(Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, Weinan Zhang, 2025, ArXiv)
- An organizational theory for multi-agent interactions integrating human agents, LLMs, and specialized AI(Uwe M. Borghoff, Paolo Bottoni, R. Pareschi, 2025, Discover Computing)
- When AI Becomes an Agent of the Firm: Examining the Evolution of AI in Organizations Through an Agency Theory Lens(Beth K. Humberd, Scott F. Latham, 2025, Journal of Management Studies)
- Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?(Amar Lakel, 2025, ArXiv)
- NetIQ: A Tri-Modular Dynamic Evaluation Framework for Intelligent Agent Networks(Haimo Xiao, Jie Zhan, 2025, 2025 2nd International Conference on Electronic Engineering and Information Systems (EEISS))
- Toward the Internet of Agentic AI: Protocols, Architecture, and Challenges(Yuzheng Ren, Jiahui Yang, Haijun Zhang, F. R. Yu, Yifu Xiao, Xueyan Cao, Chunlei Sun, Ying He, 2026, IEEE Communications Magazine)
- The Agent Perspective In LLM-Based Strategic Information Retrieval Ecosystems(Tommy Mordo, 2025, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)
本报告全面梳理了 Agent 从单一模型向具备高度自主性、协作能力和环境适应性的“自主实体”演化的全貌。研究体系涵盖了从底层的认知架构革新(如世界模型、MCP协议)到多智能体协作协议的优化;从软件工程、网络运维等数字世界的自动化,到具身智能、自动驾驶等物理世界的深度集成。同时,随着 Agent 进入企业级应用,安全治理、可信度评估及经济回报(ROI)分析成为了当前研究的核心前沿,共同勾勒出 Agent 作为未来数字化与物理世界交互核心引擎的发展趋势。
总计263篇相关文献
Multi-AI agent integration in cloud computing represents a transformative advancement in distributed artificial intelligence, where interconnected intelligent agents collaborate to solve complex problems. This technological evolution enables sophisticated task distribution, parallel processing, and dynamic resource allocation through coordinated agent networks. The architecture supports both autonomous operation and collaborative decision-making, implementing advanced protocols for inter-agent communication and system optimization. These systems demonstrate remarkable capabilities across various industries, from financial services to healthcare and manufacturing, revolutionizing traditional approaches to data processing, decision support, and process automation. The integration of cognitive architectures and security frameworks further enhances system capabilities, enabling human-like reasoning patterns and robust protection mechanisms while maintaining operational efficiency.
Large language models (LLMs) based Agents are increasingly pivotal in simulating and understanding complex human systems and interactions. We propose the AI-Agent School (AAS) system, built around a self-evolving mechanism that leverages agents for simulating complex educational dynamics. Addressing the fragmented issues in teaching process modeling and the limitations of agents performance in simulating diverse educational participants, AAS constructs the Zero-Exp strategy, employs a continuous"experience-reflection-optimization"cycle, grounded in a dual memory base comprising experience and knowledge bases and incorporating short-term and long-term memory components. Through this mechanism, agents autonomously evolve via situated interactions within diverse simulated school scenarios. This evolution enables agents to more accurately model the nuanced, multi-faceted teacher-student engagements and underlying learning processes found in physical schools. Experiment confirms that AAS can effectively simulate intricate educational dynamics and is effective in fostering advanced agent cognitive abilities, providing a foundational stepping stone from the"Era of Experience"to the"Era of Simulation"by generating high-fidelity behavioral and interaction data.
The rapid proliferation of scientific knowledge presents a grand challenge: transforming this vast repository of information into an active engine for discovery, especially in high-stakes domains like healthcare. Current AI agents, however, are constrained by static, predefined strategies, limiting their ability to navigate the complex, evolving ecosystem of scientific research. This paper introduces HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its high-level problem-solving policies by distilling procedural successes and failures into a durable, structured knowledge base, enabling it to learn not just how to use tools, but how to strategize. To anchor our research and provide a community resource, we introduce EHRFlowBench, a new benchmark featuring complex health data analysis tasks systematically derived from peer-reviewed scientific literature. Our experiments demonstrate that HealthFlow's self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work offers a new paradigm for intelligent systems that can learn to operationalize the procedural knowledge embedded in scientific content, marking a critical step toward more autonomous and effective AI for healthcare scientific discovery.
Mobile metaverses are envisioned as a transformative digital ecosystem that delivers immersive, intelligent, and ubiquitous services through mobile devices. Driven by Large Language Models (LLMs) and Vision-Language Models (VLMs), Artificial Intelligence (AI) agents hold the potential to empower the creation, maintenance, and evolution of mobile metaverses, enabling seamless human-machine interaction and dynamic service adaptation. Currently, AI agents are primarily built upon cloud-based LLMs and VLMs. However, several challenges hinder their efficient deployment, including high service latency and a risk of sensitive data leakage during perception and processing. In this paper, we develop an edge-cloud collaboration-based federated AI agent construction framework in mobile metaverses. Specifically, Edge Servers (ESs), as agent infrastructures, first create agent modules in a distributed manner. The cloud server then integrates these modules into AI agents and deploys them at the edge, thereby enabling low-latency AI agent services for users. Considering that ESs may exhibit dynamic levels of willingness to participate in federated AI agent construction, we design a two-period dynamic contract model to continuously incentivize ESs to participate in agent module creation, effectively addressing the dynamic information asymmetry between the cloud server and ESs. Furthermore, we propose an Enhanced Diffusion Model-based Soft Actor-Critic (EDMSAC) algorithm to effectively generate optimal dynamic contracts. In the algorithm, we apply dynamic structured pruning to DM-based actor networks to enhance denoising efficiency and policy learning performance. Simulation results demonstrate that the EDMSAC algorithm outperforms the DMSAC algorithm by up to $23\%$ in optimal dynamic contract generation.
With the rapid growth of intelligent services, communication targets are shifting from humans to artificial intelligent (AI) agents, which require new paradigms to enable real-time perception, decision-making, and collaboration. Semantic communication, which conveys task-relevant meaning rather than raw data, offers a promising solution. However, its practical deployment remains constrained by dynamic environments and limited resources. To address these issues, this article proposes a semantic-driven AI agent communication framework and develops three enabling techniques. First, semantic adaptation transmission applies fine-tuning with real or generative samples to efficiently adapt models to varying environments. Second, semantic lightweight transmission incorporates pruning, quantization, and perception-aware sampling to reduce model complexity and alleviate computational burden on edge agents. Third, semantic self-evolution control employs distributed hierarchical decision-making to optimize multi-dimensional resources, enabling robust multi-agent collaboration in dynamic environments. Simulation results show that the proposed solutions achieve faster convergence and stronger robustness, while the proposed distributed hierarchical optimization method significantly outperforms conventional decision-making schemes, highlighting its potential for AI agent communication networks.
Current personalized recommender systems predominantly rely on static offline data for algorithm design and evaluation, significantly limiting their ability to capture long-term user preference evolution and social influence dynamics in real-world scenarios. To address this fundamental challenge, we propose a high-fidelity social simulation platform integrating human-like cognitive agents and dynamic social interactions to realistically simulate user behavior evolution under recommendation interventions. Specifically, the system comprises a population of Sim-User Agents, each equipped with a five-layer cognitive architecture that encapsulates key psychological mechanisms, including episodic memory, affective state transitions, adaptive preference learning, and dynamic trust-risk assessments. In particular, we innovatively introduce the Intimacy--Curiosity--Reciprocity--Risk (ICR2) motivational engine grounded in psychological and sociological theories, enabling more realistic user decision-making processes. Furthermore, we construct a multilayer heterogeneous social graph (GGBond Graph) supporting dynamic relational evolution, effectively modeling users' evolving social ties and trust dynamics based on interest similarity, personality alignment, and structural homophily. During system operation, agents autonomously respond to recommendations generated by typical recommender algorithms (e.g., Matrix Factorization, MultVAE, LightGCN), deciding whether to consume, rate, and share content while dynamically updating their internal states and social connections, thereby forming a stable, multi-round feedback loop. This innovative design transcends the limitations of traditional static datasets, providing a controlled, observable environment for evaluating long-term recommender effects.
Sintering processes play a critical role in materials manufacturing; however, their optimization remains highly dependent on empirical knowledge, fragmented datasets, and costly experimental trials. Existing modeling and machine learning approaches often lack a unified structure for representing complex relationships among processing parameters, microstructural evolution, and final material properties. This perspective article argues that knowledge graphs can serve as a missing semantic layer for organizing sintering-related data, enabling structured representation of process–property relationships across heterogeneous databases. Furthermore, the integration of autonomous AI agents equipped with memory-augmented learning models is proposed as a promising direction for continuously constructing, updating, and reasoning over such knowledge graphs. By combining structured knowledge representation with adaptive learning and agent-based optimization, this framework has the potential to transform sintering research into a self-improving, data-driven ecosystem. This perspective highlights future research directions toward intelligent, explainable, and autonomous sintering systems for advanced materials engineering.
With the rapid construction of the new power system, the heterogeneity of market participants, dynamic policies, and uncertainties of renewable energy pose unprecedented challenges to simulation models. Traditional approaches (mathematical optimization, game theory, and rule - based ABM) suffer from rigid decision - making, inadequate policy interpretation, and weak multi - modal data fusion. This study proposes an LLM - driven multi - agent simulation framework for electricity markets, featuring a three - layer agent architecture (“Memory - Reasoning - Action”): Retrieval - Augmented Generation (RAG) integrates historical data and real - time context to enable dynamic strategy evolution;A hierarchical market clearing mechanism (SCUC + SCED) supports multi - time - scale market operations; Multi - modal data preprocessing and model calibration ensure consistency with physical constraints. Experiments using 2022–2024 data from a provincial electricity market demonstrate superior performance in high - renewable penetration, extreme weather, and policy shock scenarios: Electricity price prediction error (RMSE) reaches $5.2/MWh (72% lower than traditional models); Constraint violation rate drops to 3.7% (62% reduction); Policy response latency is shortened to within 1 hour.This framework breaks through the decision - making and interpretability bottlenecks of traditional models, providing an intelligent simulation paradigm for the market - oriented operation of the new power system. Future work will extend to cross - regional markets, electricity - carbon - green certificate synergy, and real - time decision support.
With the rapid advancement of AI technologies, AI agents have undergone remarkable evolution. Their application in decomposing complex problems and enhancing automated problem-solving capabilities has demonstrated growing potential across multiple domains. As a critical component of the educational system, graduate education stands to benefit significantly from the integration of AI agents, which effectively bridge large language models (LLMs) with pedagogical processes. This integration enables competency development to be innovatively augmented at varying granularities. Centered on a competency-driven framework, this paper explores the implementation modalities of AI agents in graduate education, analyzes their roles in curriculum design and mentoring methodologies, and discusses associated risks and challenges throughout the cultivation process.
Aluminum nanoparticles (ANPs) are among the most energy-dense solid fuels, yet the atomic mechanisms governing their transition from passivated particles to explosive reactants remain elusive. This stems from a fundamental computational bottleneck: ab initio methods offer quantum accuracy but are restricted to small spatiotemporal scales (<500 atoms, picoseconds), while empirical force fields lack the reactive fidelity required for complex combustion environments. Herein, we bridge this gap by employing a"human-in-the-loop"closed-loop framework where self-auditing AI Agents validate the evolution of a machine learning potential (MLP). By acting as scientific sentinels that visualize hidden model artifacts for human decision-making, this collaborative cycle ensures quantum mechanical accuracy while exhibiting near-linear scalability to million-atom systems and accessing nanosecond timescales (energy RMSE: 1.2 meV/atom, force RMSE: 0.126 eV/Angstrom). Strikingly, our simulations reveal a temperature-regulated dual-mode oxidation mechanism: at moderate temperatures, the oxide shell acts as a dynamic"gatekeeper,"regulating oxidation through a"breathing mode"of transient nanochannels; above a critical threshold, a"rupture mode"unleashes catastrophic shell failure and explosive combustion. Importantly, we resolve a decades-old controversy by demonstrating that aluminum cation outward diffusion, rather than oxygen transport, dominates mass transfer across all temperature regimes, with diffusion coefficients consistently exceeding those of oxygen by 2-3 orders of magnitude. These discoveries establish a unified atomic-scale framework for energetic nanomaterial design, enabling the precision engineering of ignition sensitivity and energy release rates through intelligent computational design.
Our work begins with the premise that the integration of artificial intelligence (AI) into firm decision making parallels the emergence of the professional manager, which prompted the birth of agency theory. We examine the evolution of AI through an agency theory lens, considering how the nature of firm control and decision rights change as AI evolves. While AI will initially mimic human routines, we theorize a point at which the AI system will achieve a level of autonomy and self‐determination to be considered an agent of the firm. How, then, can we align an agent with the fate of the firm, when that agent is no longer a human? To address this, we integrate agency theory with a model of AI evolution, demonstrating that conflict between an AI agent and the firm will require a reconsideration of agency mechanisms if agent‐principal alignment is to be achieved. We theorize specific forms of monitoring and incentive alignment that serve to align an AI agent with the firm's interests, thus extending agency mechanisms in the context of AI. Our theoretical exercise offers important implications for scholarship at the intersection of AI and organizational theory, as well as considerations for practice and policy.
Multi-agent systems represent a transformative advancement in artificial intelligence, fundamentally changing how complex tasks are managed across distributed environments. These systems demonstrate exceptional capabilities in coordinating autonomous agents for sophisticated problem-solving across various domains. From enterprise operations to smart infrastructure management, MAS implementations have revolutionized traditional approaches through enhanced coordination mechanisms and adaptive learning capabilities. The integration of machine learning and advanced communication protocols has enabled unprecedented levels of system flexibility and resilience. In industrial applications, these systems have transformed manufacturing processes, supply chain operations, and resource management through intelligent automation and real-time optimization. Looking forward, emerging trends in self-organizing systems, ethical decision frameworks, and collective learning mechanisms suggest even greater potential for advancement. The continuous evolution of MAS technology promises to further enhance distributed intelligence capabilities while addressing critical challenges in security, scalability, and system adaptation.
This paper examines the evolution, architecture, and practical applications of AI agents from their early, rule-based incarnations to modern sophisticated systems that integrate large language models with dedicated modules for perception, planning, and tool use. Emphasizing both theoretical foundations and real-world deployments, the paper reviews key agent paradigms, discusses limitations of current evaluation benchmarks, and proposes a holistic evaluation framework that balances task effectiveness, efficiency, robustness, and safety. Applications across enterprise, personal assistance, and specialized domains are analyzed, with insights into future research directions for more resilient and adaptive AI agent systems.
The growing demand for high-bandwidth, zero-trouble services is imposing unprecedented challenges on optical communication networks. Traditional human-centric network management approaches are increasingly inadequate for addressing the complexity, scalability, and reliability requirements of modern optical networks. This tutorial provides a comprehensive overview of the evolution toward autonomous optical networks (AONs), where large language model (LLM)-based artificial intelligence (AI) agents are utilized. We systematically introduce the fundamental concepts and architectural frameworks for AI agent-enabled AONs. Key agentic technologies are examined, including domain adaptation strategies for LLMs, advanced prompting techniques, and the construction of agentic AI systems. Furthermore, we analyze the toolsets that support the operational effectiveness of AI agents in AONs. The monitoring and analytics toolset provides accurate awareness of the network state and predicts future changes. The digital twin (DT) construction toolset enables high-fidelity modeling of optical networks. The intelligent management and control toolset is employed for service provisioning, failure management, and continuous network optimization. By integrating these agentic technologies and toolsets, AI agents can deliver end-to-end autonomous network lifecycle management. Key challenges remain in areas such as reliability, proper utilization of the LLM reasoning capabilities, and cost-effectiveness.
This paper addresses the growing sustainability fatigue in advanced economies. By analyzing Amazon’s artificial intelligence (AI) agent strategy as a model for “Rational Sustainability”, the study identifies a self-propagating growth trajectory that reconciles economic rationality with value creation. It provides a theoretical and empirical framework to overcome technological saturation and strategic homogenization in the generative AI era. To ensure methodological transparency, the analysis was conducted through two distinct stages: (i) Techno-econometric analysis (macro-level): Using an empirical dataset of 160 countries (40 advanced, 70 emerging, and 50 developing) from 2014 to 2024, the study utilized regression models to quantify the correlations and elasticities between three key proxies: GDP per capita (Y); the Human Capital Index (HCI), representing Institutional Capacity Building (ICB); and the E-Government Development Index (EGI), representing Endogenous Institutional Evolution (EIE). (ii) Hybrid AI analysis (case study): Utilizing process-tracing research, the paper examines Amazon’s R&D structure and AI agent strategy. This qualitative and structural analysis identifies how Amazon co-evolves EIE and ICB to conceptualize tacit knowledge and operationalize it into a competitive advantage. The findings reveal a marked disruption of the co-evolutionary mechanism in advanced economies, where the elasticity of EGI to GDP has declined since 2019, leading to a withdrawal state. In contrast, Amazon’s model demonstrates that the co-evolution of EIE and ICB creates a self-propagating growth engine. This research concludes that “Rational Sustainability”—grounded in evidence, economic rationality, and clear trade-offs—offers a viable pathway for revitalizing sustainability strategies in mature digital economies.
We present RoboPhD, a system where AI agents autonomously conduct research to improve Text-to-SQL performance. RoboPhD implements a closed-loop evolution cycle with two coordinated components: a SQL Generation agent composed of a database analysis script and SQL generation instructions, and an Evolution agent that designs new versions based on performance feedback. Central to the framework is an ELO-based selection mechanism enabling survival-of-the-fittest dynamics while handling non-transitivity in performance. Starting from a naive 70-line baseline, RoboPhD evolves agents through iterative cross-pollination, discovering effective techniques without any external guidance on the Text-to-SQL domain. Our best agent, evolved to 1500 lines over 18 iterations, autonomously discovered strategies such as size-adaptive database analysis that adjusts depth based on schema complexity and SQL generation patterns for column selection, evidence interpretation, and aggregation. Evolution provides the largest gains on cheaper models: while we improve by 2.3 points over a strong Claude Opus 4.5 naive baseline, we show an improvement of 8.9 points over the weaker Claude Haiku model. This enables'skip a tier'deployment: evolved Haiku exceeds naive Sonnet accuracy, and evolved Sonnet exceeds naive Opus, both at lower cost. The full system achieves 73.67% accuracy on the BIRD test set, demonstrating that AI can autonomously build a strong agentic system with only a trivial human-provided starting point.
The primary goal of this study is to analyze agentic workflows in education according to the proposed four major technological paradigms: reflection, planning, tool use, and multi-agent collaboration. We critically examine the role of AI agents in education through these key design paradigms, exploring their advantages, applications, and challenges. Second, to illustrate the practical potential of agentic systems, we present a proof-of-concept application: a multi-agent framework for automated essay scoring. Preliminary results suggest this agentic approach may offer improved consistency compared to stand-alone LLMs. Our findings highlight the transformative potential of AI agents in educational settings while underscoring the need for further research into their interpretability and trustworthiness.
The growth of Large Language Model (LLM) technology has raised expectations for automated coding. However, software engineering is more than coding and is concerned with activities including maintenance and evolution of a project. In this context, the concept of LLM agents has gained traction, which utilize LLMs as reasoning engines to invoke external tools autonomously. But is an LLM agent the same as an AI software engineer? In this paper, we seek to understand this question by developing a Unified Software Engineering agent or USEagent. Unlike existing work which builds specialized agents for specific software tasks such as testing, debugging, and repair, our goal is to build a unified agent which can orchestrate and handle multiple capabilities. This gives the agent the promise of handling complex scenarios in software development such as fixing an incomplete patch, adding new features, or taking over code written by others. We envision USEagent as the first draft of a future AI Software Engineer which can be a team member in future software development teams involving both AI and humans. To evaluate the efficacy of USEagent, we build a Unified Software Engineering bench (USEbench) comprising of myriad tasks such as coding, testing, and patching. USEbench is a judicious mixture of tasks from existing benchmarks such as SWE-bench, SWT-bench, and REPOCOD. In an evaluation on USEbench consisting of 1,271 repository-level software engineering tasks, USEagent shows improved efficacy compared to existing general agents such as OpenHands CodeActAgent. There exist gaps in the capabilities of USEagent for certain coding tasks, which provides hints on further developing the AI Software Engineer of the future.
Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers'mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose AnnaAgent, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator's configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on https://github.com/sci-m-wang/AnnaAgent.
Cloud-native organizations increasingly rely on microservices for backend modularity and micro-frontends for scalable user interface delivery. Yet, real-world systems still struggle to evolve these layers coherently under high release velocity, shifting product goals, and variable workloads. This paper presents a unified Agentic AI framework that autonomously coordinates the co-evolution of micro-frontend UIs (implemented in ReactJS and Angular) and microservices. The proposed architecture integrates reinforcement learning for continuous control, large language models for code and configuration synthesis, and a policy-governed multi-agent control plane that executes progressive delivery (feature flags, canary, blue-green) via Kubernetes and service meshes. We formalize decisions using Markov Decision Processes, propose drift detection models for UI-API compatibility, and formulate traffic-shifting optimization for safe rollouts. A mini empirical study across e-commerce, SaaS analytics, and multi-cloud migration scenarios demonstrates reductions in adaptation latency, error rates, and manual intervention relative to strong DevOps baselines. We discuss reliability, explainability, and governance challenges, and lay out future research on hybrid RL-LLM agents, knowledge-graph-aware planning, digital twins, and compliance-aware rewards.
No abstract available
Abstract Purpose: This study examines the transformation of financial decision-making through the adoption of artificial intelligence, focusing on the shift from conventional AI systems to AI agents and agentic AI. It differentiates between automated analytical tools and autonomous, goal-oriented systems that increasingly assume decision-making authority within financial operations. Design/Methodology/Approach: Employing a qualitative multi-method approach—comprising semi-structured expert interviews, industry report synthesis, in-depth case studies, and a comparative performance evaluation—this research investigates AI agent implementation across SMEs, pharmaceutical analytics, and ERP-integrated corporate finance. Theoretically, it extends foundational models including the Efficient Market Hypothesis (EMH), Behavioral Finance, and the Adaptive Markets Hypothesis (AMH) by embedding the dynamic, learning-driven nature of AI agents into financial decision logic. Findings: The results indicate that AI agents introduce novel forms of informational asymmetry, enhance bias mitigation through adaptive modeling, and give rise to emergent decision structures via multi-agent interactions. These dynamics challenge core assumptions of market rationality and static efficiency. Practically, the study offers a structured framework for AI agent integration, emphasizing explainability, hybrid human-AI governance, and risk-specific safeguards to navigate ethical and regulatory constraints. The proposed conceptual taxonomy and cross-industry implementation roadmap reposition agentic AI as a strategic transformation—reshaping how financial institutions process data, execute judgments, and regulate algorithmic autonomy.
The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single- and multi-agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single-agent systems are well-suited for simpler translation tasks, multi-agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain-specific knowledge, and contextual awareness. To demonstrate the feasibility of multi-agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi-agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi-agent systems may have the potential to significantly improve domain-adaptability and contextual awareness, with superior translation quality to traditional MT or single-agent systems. This paper also sets the stage for future research into multi-agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.
The remarkable reasoning abilities of large language models (LLMs) have opened new research opportunities in wireless networks. As demonstrated in [1], pretrained LLMs have been proven to handle various network optimization tasks universally without prior knowledge of systems, such as mathematical models, channel propagation, and scenario-specific fine-tuning processes. This knowledge-free ability promotes LLMs as powerful optimization agents that autonomously determine network management strategies. Such an LLM optimizer technology is still in its early stages and requires significant evolution for real-world implementation. In particular, existing works need centralized operations, which lack the flexibility with distributed devices for wireless networks. To address this challenge, this article presents a multi-agent LLM optimizer (MALO) framework where individual LLM agents make their own decisions for different wireless nodes in a decentralized manner. The effectiveness of the MALO approach is verified in decentralized wireless resource allocation problems. Numerical results confirm that the proposed decentralized MALO framework outperforms existing centralized LLM optimizer methods and achieves performance comparable to traditional optimization algorithms.
Agentic AI is emerging, capable of executing tasks through natural language, such as Copilot for coding or Amazon Rufus for shopping. Evaluating these systems is challenging, as their rapid evolution outpaces traditional human evaluation. Researchers have proposed LLM Agents to simulate participants as digital twins, but it remains unclear to what extent a digital twin can represent a specific customer in multi-turn interaction with an agentic AI system. In this paper, we recruited 40 human participants to shop with Amazon Rufus, collected their personas, interaction traces, and UX feedback, and then created digital twins to repeat the task. Pairwise comparison of human and digital-twin traces shows that while agents often explored more diverse choices, their action patterns aligned with humans and yielded similar design feedback. This study is the first to quantify how closely LLM agents can mirror human multi-turn interaction with an agentic AI system, highlighting their potential for scalable evaluation.
AI innovation is increasingly focusing on the development and utilization of AI agents: systems that gather information, make decisions and take actions with some degree of autonomy. Agents represent a natural evolution from AI that generates outputs following user commands, such as ChatGPT, to AI that plays a much more active role by completing tasks. For example, in the future, a consumer AI agent might book and pay for a holiday or decide which groceries to order. Tech luminaries believe agents will make AI truly transformative, and tech firms and investors have been investing in AI agents as they look to monetize AI. Given the emerging focus on agents, it will not be long before competition agencies take a closer look to assess any concerns against benefits. This article previews some of the likely debates to come. Many possible issues that could be raised appear analogous to those of other digital markets, such as self-preferencing, pre-installation and defaults, market tipping and ecosystem effects. But agencies may also raise new or potentially more challenging issues about AI agents, ranging from access to hardware/software, conflict of interests, use of personalized data, to a lack of transparency. Equally, agents might trigger more seismic changes in the competitive landscape.
While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases. Previous studies have focused on biases inherited from training data, but whether stereotypes can emerge spontaneously in AI agent interactions merits further exploration. Through a novel experimental framework simulating workplace interactions with neutral initial conditions, we investigate the emergence and evolution of stereotypes in LLM-based multi-agent systems. Our findings reveal that (1) LLM-Based AI agents develop stereotype-driven biases in their interactions despite beginning without predefined biases; (2) stereotype effects intensify with increased interaction rounds and decision-making power, particularly after introducing hierarchical structures; (3) these systems exhibit group effects analogous to human social behavior, including halo effects, confirmation bias, and role congruity; and (4) these stereotype patterns manifest consistently across different LLM architectures. Through comprehensive quantitative analysis, these findings suggest that stereotype formation in AI systems may arise as an emergent property of multi-agent interactions, rather than merely from training data biases. Our work underscores the need for future research to explore the underlying mechanisms of this phenomenon and develop strategies to mitigate its ethical impacts.
A B S T R A C T The contribution of this work includes the realization of a self-learning artificial intelligence agent that can play the popular game Flappy Bird through the Neuroevolution of Augmenting Topologies algorithm. NEAT is a neuroevolutionary method based on simultaneously evolving architecture and weights of the neural networks to enable the agent to improve its performance through mechanisms of simulated natural selection and genetic variation. This work aims to evidence the efficiency of using NEAT for training an autonomous agent to play video games with no previous predefined strategy, emphasizing adaptability and learning skills of the evolved neural networks to the environmental dynamics. Python with Pygame has been implemented to simulate the environment of the game; it allows for the population of neural networks to evolve across generations. The experimental results confirm that NEAT efficiently improves gameplay over time by the AI agent, leading to a significant increase in survival time and enhanced scores. Advantages of neuroevolution against traditional methods of reinforcement learning are underlined, among which stands out the capability to find complex topologies of neural networks fitted for the task at hand without being designed by a human. Considering these circumstances, the study postulates that NEAT will be able to contribute to ongoing research efforts in the field of adaptive game bots, autonomous systems, and architectures of evolving artificial intelligence. Possible further work may consider extending the approach described here to more complex games and the exploration of hybrid models that integrate neuro-evolution with deep learning techniques. Keywords: Neuroevolution, NEAT, Flappy Bird, Self- Learning AI, Neural Network Evolution, Game Playing AI, Adaptive Game Bot, Reinforcement Learning
The current evolution of artificial intelligence introduces a paradigm shift toward agentic AI built upon multi-agent systems (MAS). Agent communications serve as a key to effective agent interactions in MAS and thus have a significant impact on the performance of agentic AI applications. Recent research on agent communications has made exciting, rapid progress, leading to a variety of protocol designs, among which the Agent2Agent (A2A) protocol is considered the most representative. Simultaneously, the rise of edge intelligence is expected to enable agentic AI at the network edge. In this paper, we evaluate the capabilities of agent communication technologies to address the challenges of edge computing, using the A2A protocol as a representative case. We first discuss the core functionalities of agent communications, present a landscape of agent communication protocols, and identify the main challenges introduced by edge computing. Then, we conduct a case study on the A2A protocol to examine the key technologies leveraged in the protocol for their effectiveness in meeting the requirements of agent communications in edge computing. Based on the insights obtained from this assessment, we identify open issues in the current agent communication technologies and discuss directions for future research to address these issues.
The rapid evolution of AI-generated images poses unprecedented challenges to information integrity and media authenticity. Existing detection approaches suffer from fundamental limitations: traditional classifiers lack interpretability and fail to generalize across evolving generative models, while vision-language models (VLMs), despite their promise, remain constrained to single-shot analysis and pixel-level reasoning. To address these challenges, we introduce AIFo (Agent-based Image Forensics), a novel training-free framework that emulates human forensic investigation through multi-agent collaboration. Unlike conventional methods, our framework employs a set of forensic tools, including reverse image search, metadata extraction, pre-trained classifiers, and VLM analysis, coordinated by specialized LLM-based agents that collect, synthesize, and reason over cross-source evidence. When evidence is conflicting or insufficient, a structured multi-agent debate mechanism allows agents to exchange arguments and reach a reliable conclusion. Furthermore, we enhance the framework with a memory-augmented reasoning module that learns from historical cases to improve future detection accuracy. Our comprehensive evaluation spans 6,000 images across both controlled laboratory settings and challenging real-world scenarios, including images from modern generative platforms and diverse online sources. AIFo achieves 97.05% accuracy, substantially outperforming traditional classifiers and state-of-the-art VLMs. These results demonstrate that agent-based procedural reasoning offers a new paradigm for more robust, interpretable, and adaptable AI-generated image detection.
Research and practice in Intelligent Design (ID) have significantly enhanced engineering innovation, efficiency, quality, and productivity over recent decades, fundamentally reshaping how engineering designers think, behave, and interact with design processes. The recent emergence of Foundation Models (FMs), particularly Large Language Models (LLMs), has demonstrated general knowledge-based reasoning capabilities, and open new avenues for further transformation in engineering design. In this context, this paper introduces Intelligent Design 4.0 (ID 4.0) as an emerging paradigm empowered by foundation model-based agentic AI systems. We review the historical evolution of ID across four distinct stages: rule-based expert systems, task-specific machine learning models, large-scale foundation AI models, and the recent emerging paradigm of foundation model-based multi-agent collaboration. We propose an ontological framework for ID 4.0 and discuss its potential to support end-to-end automation of engineering design processes through coordinated, autonomous multi-agent-based systems. Furthermore, we discuss challenges and opportunities of ID 4.0, including perspectives on data foundations, agent collaboration mechanisms, and the formulation of design problems and objectives. In sum, these insights provide a foundation for advancing Intelligent Design toward greater adaptivity, autonomy, and effectiveness in addressing the growing complexity of engineering design.
Digital transformation forces enterprises to adopt agile Continuous Integration/Continuous Deployment(CI/CD) pipelines for competitiveness. However, traditional tools like Jenkins and GitHub Actions rely on manual configurations, causing errors in code migration, inefficient debugging, and complex multi-cloud deployments. To address these issues, we introduce AutoDevFlow, an innovative LLM-powered multi-agent system that fully automates the software development lifecycle. Through initial task analysis by the Control Layer and division of labor among the four agent teams of the Processing Layer, AutoDevFlow enables accurate, scalable, and adaptive deployments. Experiments show robust error correction, effective cross-language conversion, and automatic deployment, reducing operational burdens and streamlining workflows. Our approach transforms CI/CD practices, enhancing efficiency, delivery speed, and quality.
Developing medical software requires navigating complex regulatory, ethical, and operational challenges. A comprehensive framework that supports both technical maturity and clinical safety is essential for effective artificial intelligence and machine learning system deployment. This paper introduces the Clinical Artificial Intelligence Readiness Evaluator Lifecycle and the Clinical Artificial Intelligence Readiness Evaluator Agent—a framework and AI-driven tool designed to streamline technology readiness level assessments in medical software development. We developed the framework using an iterative process grounded in collaborative stakeholder analysis. Key institutional stakeholders—including clinical informatics experts, data engineers, ethicists, and operational leaders—were engaged to identify and prioritize the regulatory, ethical, and technical requirements unique to clinical AI/ML development. This approach, combined with a thorough review of existing methodologies, informed the creation of a lifecycle model that guides technology maturation from initial concept to full deployment. The AI-driven tool was implemented using a retrieval-augmented generation strategy and evaluated through a synthetic use case (the Diabetes Outcome Predictor). Evaluation metrics included the proportion of correctly addressed assessment questions and the overall time required for automated review, with human adjudication validating the tool’s performance. The findings indicate that the proposed framework effectively captures the complexities of clinical AI development. In the synthetic use case, the AI-driven tool identified that 32.8% of the assessment questions remained unanswered, while human adjudication confirmed discrepancies in 19.4% of these instances. These outcomes suggest that, when fully refined, the automated assessment process can reduce the need for extensive multi-stakeholder involvement, accelerate project timelines, and enhance resource efficiency. The Clinical Artificial Intelligence Readiness Evaluator Lifecycle and Agent offer a robust and methodologically sound approach for evaluating the maturity of medical AI systems. By integrating stakeholder-driven insights with an AI-based assessment process, this framework lays the groundwork for more streamlined, secure, and effective clinical AI development. Future work will focus on optimizing retrieval strategies and expanding validation across diverse clinical applications.
This paper presents a framework for automating software development security using a Security as Code approach enhanced with a multi-agent artificial intelligence system. The research addresses the limitations of traditional DevSecOps practices by deploying specialized AI agents to perform static code analysis, generate and enforce security policies, and monitor system behavior. The architecture integrates security throughout the CI/CD pipeline and runtime, enabling autonomous decision-making, adaptability to threats, and reduced developer overhead. Experimental evaluation demonstrates the framework’s effectiveness in early vulnerability detection, consistent policy enforcement, and reduced response time. The proposed solution balances automation speed with human oversight, enhancing the resilience and scalability of secure software development processes. Keywords: Artificial intelligence, DevSecOps, CI/CD, multi-agent system, security automation, Security as Code, security policies.
The challenges behind modern software engineering are large, distributed teams that maintain complex systems with changing requirements, a multi-toolchain, and very limited release intervals. Since the Software Development Lifecycle (SDLC) requires manual coordination, it is prone to mistakes and inefficiencies, resulting in bottlenecks, poor communication, and productivity losses. The paper presents the idea of a scalable end-to-end lifecycle automation based on a reconfigurable agent-based framework that can deal with all principal phases of required engineering, testing, deployment, production, and maintenance. Its runtime supports extensibility, allowing for the easy integration of new tools or agents with minimal disturbance to the architecture. In order to test the framework against two practical enterprise examples with a multi-repo DevOps pipeline and CI constants. According to experimental findings, the tasks have been assembled with improved throughput (21-36%), a decrease in cycle time of up to 28%, and increased traceability in the workflow stages. The results confirm the framework's ability to scale, modularity, and pragmatic suitability in engineering ecosystems of complexity. The piece of work presents an elastic and intelligent model of automation to serve continuous improvement and phase-wise optimization in recent software engineering.
Cyber-Physical Systems (CPS) integrate distributed software and hardware, requiring systematic engineering approaches. This paper presents a Model-Driven Development (MDD) process that spans the entire CPS lifecycle, from conceptual modeling (CIM) to functional implementation (Code). By leveraging agent orientation, the approach simplifies system structuring and enables semi-automated transformations through a domain-specific language (DSL). A proof of concept validates the process in a greenhouse automation scenario, demonstrating that the generated software functions as expected on real hardware. The results confirm the feasibility of this end-to-end MDD workflow for CPS development.
AI-enabled tools for code generation have drastically changed software development, but security holes in the code created by AI are still significant. Several new studies show security vulnerabilities could increase by 37.6% after five rounds of iterative software refinement using AI, with 19-50% of AI-generated code containing security flaws. This paper describes a new multi-agent security framework that integrates security-first principles throughout the Software Development Lifecycle (SDLC). The framework consists of seven specialized AI agents (Threat Modeling, Security Design, Secure Code Generation, Security Testing, CI/CD Security, Runtime Security, and Compliance), each of which handles a unique SDLC phase. The key differentiator of the innovation is a continuous security gate mechanism of the Secure Code Generation Agent which helps in keeping security on track during the process of coding via confidence scoring and automated safety checkpoints. It combines webhook-based trigger mechanisms directly with current development tools like Jira, GitHub, Jenkins, SIEM and uses hybrid enforcement (rule-based security tools – SAST, DAST, SCA) and LLM-based contextual analysis. The approach proposed strives for >90% sensitivity for critical vulnerabilities and >85% specificity to minimize alert fatigue, with holistic metrics on detection accuracy, performance, and operational effectiveness. This solution proactively addresses security at every SDLC stage rather than reactively once deployed, allowing organizations to leverage AI-assisted development while maintaining robust security posture and regulatory compliance.
Automating radiology report generation poses a dual challenge: building clinically reliable systems and designing rigorous evaluation protocols. We introduce a multi-agent reinforcement learning framework that serves as both a benchmark and evaluation environment for multimodal clinical reasoning in the radiology ecosystem. The proposed framework integrates large language models (LLMs) and large vision models (LVMs) within a modular architecture composed of ten specialized agents responsible for image analysis, feature extraction, report generation, review, and evaluation. This design enables fine-grained assessment at both the agent level (e.g., detection and segmentation accuracy) and the consensus level (e.g., report quality and clinical relevance). We demonstrate an implementation using chatGPT-4o on public radiology datasets, where LLMs act as evaluators alongside medical radiologist feedback. By aligning evaluation protocols with the LLM development lifecycle, including pretraining, finetuning, alignment, and deployment, the proposed benchmark establishes a path toward trustworthy deviance-based radiology report generation.
Agent-based models (ABMs) are increasingly used in disaster evacuation simulation to capture system-level dynamics. While ABMs are often combined with human behavior models (HBMs), few approaches integrate these with infrastructure and demographic data that are carefully modeled using local knowledge, along with hazard-specific impacts and policy settings. Even fewer embed this integration within a co-creation loop that involves local stakeholders throughout the entire development lifecycle, from conception and design to implementation, testing, and beyond. This paper introduces the methodology that we developed to address this gap by combining a structured co-creation process with technical simulation development. The co-creation process engages local stakeholders, planners, and experts to iteratively shape evacuation scenarios, define assumptions, and validate outcomes, ensuring the model aligns with local realities. These inputs are translated into a multi-dimensional simulation framework built in MATSim, integrating network and infrastructure models, hazard effects, population, and behavior modeling enhanced through Belief-Desire-Intention cognitive architectures. We applied this methodology in different case study areas, demonstrating its capacity to simulate heterogeneous evacuation dynamics and provide diverse performance metrics. Finally, we explore how this methodology can be applied in other hazards, geographic regions, and evacuation scenarios, offering pathways for broader application and future development. Keywords: Agent-Based Models, Human Behavior Models, Co-creation Processes, Disaster Evacuation Simulation, Disaster Preparedness
: This paper explores the integration of cognitive agents powered by Large Language Models (LLMs) into software project management within the Scaled Agile Framework (SAFe). We introduce the CogniSim framework, an ecosystem where virtual agents operate in a simulated software environment to fulfill key roles in IT project development. Emphasis is placed on the adaptability of these agents to the Scrum methodology, particularly in decision-making and problem-solving. By combining LLMs with Multi-Agent Systems (MAS), we focus on improvements in project management, development processes, and Agile methodologies. Through simulations and case studies, we demonstrate advancements in task delegation, communication, and project lifecycle management, highlighting the potential of LLM-augmented MAS to manage software projects with increased precision and intelligence. Our findings provide insights into essential components for an effective cognitive multi-agent ecosystem, including Dynamic Context techniques and Theory of Mind for enhanced agent collaboration, laying the groundwork for future research in this field.
The vision of End-User Software Engineering (EUSE) is to empower non-professional users with full control over the software development lifecycle. It aims to enable users to drive generative software development using only natural language requirements. However, since end-users often lack knowledge of software engineering, their requirement descriptions are frequently ambiguous, raising significant challenges to generative software development. Although existing approaches utilize structured languages like Gherkin to clarify user narratives, they still struggle to express the causal logic between preconditions and behavior actions. This paper introduces RequireCEG, a requirement elicitation and self-review agent that embeds causal-effect graphs (CEGs) in a neuro-symbolic collaboration architecture. RequireCEG first uses a feature tree to analyze user narratives hierarchically, clearly defining the scope of software components and their system behavior requirements. Next, it constructs the self-healing CEGs based on the elicited requirements, capturing the causal relationships between atomic preconditions and behavioral actions. Finally, the constructed CEGs are used to review and optimize Gherkin scenarios, ensuring consistency between the generated Gherkin requirements and the system behavior requirements elicited from user narratives. To evaluate our method, we created the RGPair benchmark dataset and conducted extensive experiments. It achieves an 87% coverage rate and raises diversity by 51.88%.
Software testing is critical in the software development lifecycle, yet translating requirements into executable test scripts remains manual and error-prone. While Large Language Models (LLMs) can generate code, they often hallucinate non-existent UI elements. We present the Autonomous QA Agent, a Retrieval-Augmented Generation (RAG) system that grounds Selenium script generation in project-specific documentation and HTML structure. By ingesting diverse formats (Markdown, PDF, HTML) into a vector database, our system retrieves relevant context before generation. Evaluation on 20 e-commerce test scenarios shows our RAG approach achieves 100% (20/20) syntax validity and 90% (18/20, 95% CI: [85%, 95%], p<0.001) execution success, compared to 30% for standard LLM generation. While our evaluation is limited to a single domain, our method significantly reduces hallucinations by grounding generation in actual DOM structure, demonstrating RAG's potential for automated UI testing.
The rapid development of Generative Artificial Intelligence (GenAI) has catalyzed a transformative technological revolution across all walks of life. As the backbone of wideband communication, optical networks are expecting high-level autonomous operation and zero-touch management to accommodate their expanding network scales and escalating transmission bandwidth. The integration of GenAI is deemed as the pivotal solution for realizing zero-touch optical networks. However, the lifecycle management of optical networks involves a multitude of tasks and necessitates seamless collaboration across multiple layers, which poses significant challenges to the existing single-agent GenAI systems. In this article, we propose a GenAI-driven hierarchical multi-agent framework designed to streamline multi-task autonomous execution for zero-touch optical networks. We present the architecture, implementation, and applications of this framework. A field-deployed mesh network is utilized to demonstrate three typical scenarios throughout the lifecycle of optical network: quality of trans-mission estimation in the planning stage, dynamic channel adding/dropping in the operation stage, and system capacity increase in the upgrade stage. The case studies, illustrate the capabilities of multi-agent framework in multi-task allocation, coordination, execution, evaluation, and summarization. This work provides a promising approach for the future development of intelligent, efficient, and collaborative network management solutions, paving the way for more specialized and adaptive zero-touch optical networks.
The rapid growth of artificial intelligence (AI) and machine learning (ML) has necessitated the automation of workflows to ensure scalability and efficiency in the development of intelligent applications. This paper proposes an AI agent-driven approach to automate ML workflows on Amazon Web Services (AWS) to facilitate the seamless deployment and management of scalable, intelligent applications. The integration of AI agents in the workflow automation process helps in optimizing tasks such as data preprocessing, model training, hyperparameter tuning, and model deployment. Leveraging AWS services like SageMaker, Lambda, and EC2, the proposed system automates the end-to-end ML lifecycle, significantly reducing the manual effort and time required for model iteration and deployment. We demonstrate the effectiveness of this approach by implementing a use case in a cloud-based environment, showcasing its impact on computational efficiency, resource optimization, and scalability. The results highlight that AI agent-driven automation can achieve faster development cycles, better resource utilization, and improved performance for large-scale intelligent applications.
This study introduces a novel methodology and framework for the verification, validation, and testing of agent-based simulation models: RatKit. Building on repeatable automated testing in ABMS, the present contribution significantly extends the foundation by proposing an integrated metamodel and systematic development methodology that embeds these activities throughout the simulation lifecycle. The RatKit methodology is both general, in that it applies to a wide range of agent-based simulation models using a well-defined metamodel, and comprehensive, in that it addresses the macro-level (societal), the meso-level (interaction) and the micro-level (agent) aspects of simulations. It also provides a generic infrastructure to be able to support various VV&T techniques. RatKit is designed as a general VV&T framework for all ABM frameworks. The methodology comes with a dedicated implemented framework. It is implemented by selecting the Repast ABM development framework. RatKit is demonstrated through a detailed case study of the Boids model, where the dynamics of alignment, cohesion, and separation are examined. Results from the case study show that a test-driven approach can enhance model reliability and ensure that individual agent behaviors coalesce into realistic emergent phenomena. Experiences and feedback obtained during the case studies show that developing ABM with a test-driven method based on VV&T facilitates the creation of desired models.
Large Language Model (LLM)-based autonomous agents provide state-of-the-art performance in various tasks throughout the software development lifecycle. Though the applications employing the agents can be implemented in Integrated Development Environments (IDEs), the standalone applications are not optimal for human programmers' utilization. This doctoral project proposes a framework of a mediator agent acting as the sole medium between programmer, IDE tools, agentic tools, and external multi-agent systems. The framework aims to simplify tool usage and automate the management of in-IDE tools and other agentic tools that can be implemented in the future. By implementing a unified mediator agent, this project aims to set the stage for human-computer interaction in the era of LLM-based autonomous agents, improve programmer productivity, and advance software engineering automation.
AgenticAIs makes autonomous decisions, optimize, and organize workflows on their own are becoming a disruptive layer in cloud-native software delivery. Karpenter collaborated with Kubernetes to enable intelligent node provisioning, simplify resource allocation, and provide production-grade reliability at scale. As explored in this paper, Agentic AI enhances the following Kubernetes-based DevOps processes: optimizing CI/CD orchestration, forecasting resource demand, accelerating fault recovery, and improving application lifecycle management. Continuous monitoring and control of cluster behavior, advanced diagnostics, and targeted corrective actions are all aspects of the Integrated Agentic AI models that makes great independent operations with limited human control a possibility. The paper presents key technical underpinnings, including multi-agent systems (for example MCP server), reinforcement learning, operator-based AI control loops, and AI-based policy enforcement. Practical examples are examined, such as automated scaling, self-healing clusters, smart canary rollouts, drift detection, and cost-constrained resource allocation. Issues such as model reliability, governance, interpretability, and production-grade security are addressed with mitigating solutions. Combining existing practices and fresh innovations, this article can serve as a comprehensive guide for leaders in the engineering community, DevOps teams, and platform designers who want to use Agentic AI in Kubernetes-driven environments. The insights highlight how automated intelligence can greatly reduce development cycles, minimize operational friction, and enable continuous, dependable delivery in the evolving, dynamic ecosystem of production.
Enterprise software development faces a significant challenge in harnessing the full potential of generative artificial intelligence (GenAI) tools, which often operate as disconnected point solutions across the software delivery lifecycle. This fragmentation leads to context loss, redundant workflows, and missed opportunities for true digital transformation. This article introduces Cognitive Software Delivery (CSD), a framework designed to integrate isolated AI tools into a unified, context-aware ecosystem. CSD is built upon four foundational architectural components: the Enterprise Context Mesh (ECM) for unified, versioned knowledge management; the Model Context Protocol (MCP) for standardized and secure data access; Retrieval-Augmented Generation (RAG) for producing contextually grounded AI outputs; and Agent-to-Agent Orchestration (A2A) for intelligent, automated workflow coordination. Through seamless integration across market research, requirements engineering, design, development, testing, deployment, and continuous improvement, CSD addresses the interoperability and traceability gaps that hinder enterprise-scale AI adoption. Validation through multi-industry case studies demonstrates measurable benefits, including reductions in cycle time (up to two-thirds), defect rates (up to 90%), and review delays, alongside qualitative gains in developer satisfaction and organizational adaptability. While challenges remain—including infrastructure readiness, change management complexity, and ethical considerations such as AI bias and transparency—CSD’s modular architecture and continuous learning capabilities offer a practical and strategic pathway to accelerated, AI-native software engineering.
Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents'capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.
This study explores agentic AI's transformative role in product management, proposing a conceptual co-evolutionary framework to guide its integration across the product lifecycle. Agentic AI, characterized by autonomy, goal-driven behavior, and multi-agent collaboration, redefines product managers (PMs) as orchestrators of socio-technical ecosystems. Using systems theory, co-evolutionary theory, and human-AI interaction theory, the framework maps agentic AI capabilities in discovery, scoping, business case development, development, testing, and launch. An integrative review of 70+ sources, including case studies from leading tech firms, highlights PMs'evolving roles in AI orchestration, supervision, and strategic alignment. Findings emphasize mutual adaptation between PMs and AI, requiring skills in AI literacy, governance, and systems thinking. Addressing gaps in traditional frameworks, this study provides a foundation for future research and practical implementation to ensure responsible, effective agentic AI integration in software organizations.
Large Language Model (LLM)-based autonomous agents provide state-of-the-art performance in various tasks throughout the Software Development Lifecycle (SDLC). Although these agents can be integrated into Integrated Development Environments (IDEs), standalone applications impose cognitive burdens on programmers due to fragmented tool usage across the SDLC. This paper proposes a framework for a mediator agent that serves as the unified interface between the programmer, IDE tools, agentic tools, and external multi-agent systems. The framework aims to simplify interaction, reduce context switching, and automate the orchestration of the IDE tools. By aligning with proposed levels of SE automation, the mediator agent sets the stage for next-generation human-computer collaboration, enhancing productivity, and advancing software engineering automation.
Agentic Artificial Intelligence (AI) is the emerging paradigm in which autonomous software agents coordinate reasoning, tool use, retrieval, and actuation to pursue enterprise goals. We present a research-grounded and practice-ready treatment of agentic AI for the enterprise, with Human Resources (HR) and recruitment as the primary locus of application. The paper synthesizes recent advances in large language models (LLMs), retrieval-augmented generation (RAG), and vector similarity search, and operationalizes them through an agentic architecture deployed for resume parsing, profile augmentation, and candidate–job matching. We report a development roadmap, acceptance criteria, and partner validation plan derived from a production case. We close with governance patterns for bias, privacy, and explainability, and a forward look at multi-agent orchestration across the employee lifecycle.
The rapid adoption of AI coding agents for software development has raised important questions about the quality and maintainability of the code they produce. While prior studies have examined AI-generated source code, the impact of AI coding agents on build systems-a critical yet understudied component of the software lifecycle-remains largely unexplored. This data mining challenge focuses on AIDev, the first large-scale, openly available dataset capturing agent-authored pull requests (Agentic-PRs) from real-world GitHub repositories. Our paper leverages this dataset to investigate (RQ1) whether AI coding agents generate build code with quality issues (e.g., code smells), (RQ2) to what extent AI agents can eliminate code smells from build code, and (RQ3) to what extent Agentic-PRs are accepted by developers. We identified 364 maintainability and security-related build smells across varying severity levels, indicating that AI-generated build code can introduce quality issues-such as lack of error handling, and hardcoded paths or URLs-while also, in some cases, removing existing smells through refactorings (e.g., Pull Up Module and Externalize Properties). Notably, more than 61\% of Agentic-PRs are approved and merged with minimal human intervention. This dual impact underscores the need for future research on AI-aware build code quality assessment to systematically evaluate, guide, and govern AI-generated build systems code.
Testing is a critical practice for ensuring software correctness and long-term maintainability. As agentic coding tools increasingly submit pull requests (PRs), it becomes essential to understand how testing appears in these agent-driven workflows. Using the AIDev dataset, we present an empirical study of test inclusion in agentic pull requests. We examine how often tests are included, when they are introduced during the PR lifecycle and how test-containing PRs differ from non-test PRs in terms of size, turnaround time, and merge outcomes. Across agents, test-containing PRs are more common over time and tend to be larger and take longer to complete, while merge rates remain largely similar. We also observe variation across agents in both test adoption and the balance between test and production code within test PRs. Our findings provide a descriptive view of testing behavior in agentic pull requests and offer empirical grounding for future studies of autonomous software development.
Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent's self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley's concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $\mathrm{CMP}$ is sufficient to simulate how the G\"odel Machine would behave under certain assumptions. We introduce the Huxley-G\"odel Machine (HGM), which, by estimating $\mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using fewer allocated CPU hours. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is publicly available at https://github.com/metauto-ai/HGM.
Autonomous LLM-based agents have emerged as a powerful paradigm for complex task execution, yet the field lacks standardized tools for development, deployment, distribution and discovery of agents. We present Cerebrum, an Agent SDK for AIOS that addresses this gap through three key components: (1) a comprehensive SDK featuring a modular four-layer architecture for agent development, encompassing LLM, memory, storage, and tool management; (2) a community-driven Agent Hub for sharing and discovering agents, complete with version control and dependency management; (3) an interactive web interface for testing and evaluating agents. The platform's effectiveness is demonstrated through implementations of various agent architectures, including Chain of Thought (CoT), ReAct, and tool-use agents. Cerebrum advances the field by providing a unified framework that standardizes agent development while maintaining flexibility for researchers and developers to innovate and distribute their agents. The live website is at https://app.aios.foundation, the code is at https://github.com/agiresearch/Cerebrum, and video is at https://app.aios.foundation/video-demo.
We develop an active inference route-planning method for the autonomous control of intelligent agents. The aim is to reconnoiter a geographical area to maintain a common operational picture. To achieve this, we construct an evidence map that reflects our current understanding of the situation, incorporating both positive and"negative"sensor observations of possible target objects collected over time, and diffusing the evidence across the map as time progresses. The generative model of active inference uses Dempster-Shafer theory and a Gaussian sensor model, which provides input to the agent. The generative process employs a Bayesian approach to update a posterior probability distribution. We calculate the variational free energy for all positions within the area by assessing the divergence between a pignistic probability distribution of the evidence map and a posterior probability distribution of a target object based on the observations, including the level of surprise associated with receiving new observations. Using the free energy, we direct the agents'movements in a simulation by taking an incremental step toward a position that minimizes the free energy. This approach addresses the challenge of exploration and exploitation, allowing agents to balance searching extensive areas of the geographical map while tracking identified target objects.
The intelligent scheduling of autonomous trucks at intersections in open-pit coal mines poses a fundamental challenge due to nonlinear dynamics and complex multi-truck interactions under gradient–load coupling. Unlike existing intersection control studies that are mainly designed for flat urban roads and neglect mining-specific physical constraints, this study proposes a Gradient-Load Aware Multi-Agent Attention-Enhanced DDPG (GLA-MA-MADDPG) approach that explicitly embeds mining-domain physical laws into multi-agent reinforcement learning for intersection scheduling. A gradient–load dynamic coupling model is developed to inject physics fidelity into MARL-based intersection scheduling, capturing nonlinear and load-sensitive kinematics overlooked by flat-road assumptions in prior studies. To handle coordination under these constraints, we design a task-oriented multi-dimensional attention mechanism that jointly interprets physical heterogeneity, asymmetric dynamics, priorities, and collision risks. Additionally, a gradient-aware adaptive priority strategy redefines right-of-way as a physics-grounded, state-dependent process, ensuring safe and preferential passage for loaded and downhill trucks. In the simulated four-branch gradient intersection environment, the proposed GLA-MA-MADDPG method achieves substantial performance gains over traditional MADDPG across 20 independent runs. Specifically, it improves throughput by 23.5%, reduces average transit time by 8.3%, decreases waiting time by 31.2%, and lowers the collision rate from 3.47‰ to 1.32%, achieving efficiency gains of 39.2% in mixed uphill–downhill scenarios. Overall, this study contributes a first-of-its-kind integration of physics-informed gradient–load modeling with attention-driven MARL, providing a generalizable computational framework for intelligent intersection scheduling in open-pit mining and other large-scale, safety-critical engineering systems.
The increase in garbage generated in modern societies demands the implementation of a more sustainable model as well as new methods for efficient waste management. This article describes the development and implementation of a prototype of a smart bin that automatically sorts waste using a multi-agent system and blockchain integration. The proposed system has sensors that identify the type of waste (organic, plastic, paper, etc.) and uses collaborative intelligent agents to make instant sorting decisions. Blockchain has been implemented as a technology for the immutable and transparent control of waste registration, favoring traceability during the classification process, providing sustainability to the process, and making the audit of data in smart urban environments transparent. For the computer vision algorithm, three versions of YOLO (YOLOv8, YOLOv11, and YOLOv12) were used and evaluated with respect to their performance in automatic detection and classification of waste. The YOLOv12 version was selected due to its overall performance, which is superior to others with mAP@50 values of 86.2%, an overall accuracy of 84.6%, and an average F1 score of 80.1%. Latency was kept below 9 ms per image with YOLOv12, ensuring smooth and lag-free processing, even for utilitarian embedded systems. This allows for efficient deployment in near-real-time applications where speed and immediate response are crucial. These results confirm the viability of the system in both accuracy and computational efficiency. This work provides an innovative solution in the field of ambient intelligence, characterized by low equipment cost and high scalability, laying the foundations for the development of smart waste management infrastructures in sustainable cities.
Vision-centric perception systems struggle with unpredictable and coupled weather degradations in the wild. Current solutions are often limited, as they either depend on specific degradation priors or suffer from significant domain gaps. To enable robust and autonomous operation in real-world conditions, we propose JarvisIR, a VLM-powered agent that leverages the VLM as a controller to manage multiple expert restoration models. To further enhance system robustness, reduce hallucinations, and improve generalizability in real-world adverse weather, JarvisIR employs a novel two-stage framework consisting of supervised fine-tuning and human feedback alignment. Specifically, to address the lack of paired data in real-world scenarios, the human feedback alignment enables the VLM to be fine-tuned effectively on large-scale real-world data in an unsupervised manner. To support the training and evaluation of JarvisIR, we introduce CleanBench, a comprehensive dataset consisting of high-quality and large-scale instruction-responses pairs, including 150K synthetic entries and 80K real entries. Extensive experiments demonstrate that JarvisIR exhibits superior decision-making and restoration capabilities. Compared with existing methods, it achieves a 50% improvement in the average of all perception metrics on CleanBench-Real.
Manus AI is a general-purpose AI agent introduced in early 2025, marking a significant advancement in autonomous artificial intelligence. Developed by the Chinese startup Monica.im, Manus is designed to bridge the gap between"mind"and"hand"- combining the reasoning and planning capabilities of large language models with the ability to execute complex, end-to-end tasks that produce tangible outcomes. This paper presents a comprehensive overview of Manus AI, exploring its core technical architecture, diverse applications across sectors such as healthcare, finance, manufacturing, robotics, and gaming, as well as its key strengths, current limitations, and future potential. Positioned as a preview of what lies ahead, Manus AI represents a shift toward intelligent agents that can translate high-level intentions into real-world actions, heralding a new era of human-AI collaboration.
With the advancement of communication technology from 5G to 6G, future communication networks will no longer be limited to land and air, and the ocean will also become the battlefield for 6G networks. The expansion of the network has expanded the scope of Intelligent Autonomous Transport Systems (IATS). As a new type of underwater transport system, Autonomous Underwater Vehicle (AUV) has gained popularity due to their advantages of autonomy, endurance, and concealment. In practical applications, it is necessary to fully consider the impact of uncertain marine environments on AUV’s motion, and also design stable control unit to achieve AUV formation. The core of the control unit is the AUV formation control algorithm, which should enable AUV to complete path planning and obstacle avoidance while ensuring formation control. In order to solve the above problems, an Intelligent Multi-agent path planning and formation control algorithm based on Value-decomposition networks (IMV) is proposed in this paper. Specifically, a three-dimensional high-resolution marine simulation environment located in the Mariana Trench is established, the state transition function and reward function are well designed under uncertain conditions for stable Multi-Agent Reinforcement Learning (MARL) mechanism, a Value-Decomposition Networks (VDN) based training framework is constructed to improve the convergence speed of the proposed method. The experimental results verify the excellent performance of the IMV method proposed in this paper, demonstrating that our method can outperform other methods in the aspect of stability, adaptability, intelligence, and timeliness.
This paper proposes NetIQ, a hierarchical evaluation framework for intelligent agent networks that integrates Network Knowledge Linking (NKL), Group Collaboration Synergy (GCS), and Autonomous Cognitive Capability (ACC) into a dynamically quantifiable system via differential coupling equations. Grounded in information theory, game theory, and cognitive science, NetIQ implements a nine-tier intelligence classification with adaptive thresholds. Multimodal attack simulations demonstrate rapid recovery and key innovations: better OODA cycle efficiency gain through dynamic feedback; higher root-cause diagnosis accuracy via modular causal tracing; and environmentadaptive thresholds for heterogeneous scenarios.
In autonomous driving (AD) tasks, data-driven deep reinforcement learning (DRL) outperforms rule-based methods in terms of continuous decision-making and adaptability. However, traditional DRL relies on hand-crafted reward functions, which introduce objective alignment challenges and reward loopholes. Moreover, the black-box structure makes it difficult to explain the decision-making process, which has a direct impact on DRL performance in complex driving situations. To address these shortcomings, a preference-based decomposable proximal policy optimization algorithm (PDPPO) is proposed for reliable interactive urban AD. The framework deconstructs the federated reinforcement learning (FRL) algorithm from various perspectives using a rule-based preference model, resulting in high-availability algorithmic performance for AD. PDPPO employs a data-rule fusion-driven hybrid vision transformer to overcome the objective alignment and high-dimensional state-space representation challenges of traditional DRL in complex urban traffic environments. Furthermore, to address the issue of algorithmic trustworthiness, PDPPO models the multi-agent FRL co-optimization process as an interpretable self-organized group collaboration process. This approach enables the algorithm to strike a balance between model robustness and sample efficiency using preference-heuristic parameter aggregation. The simulation results demonstrate that the proposed PDPPO algorithmic framework can implement interpretable single-agent decision control and multi-agent co-optimization processes. Furthermore, it exhibits competitive performance on various benchmark tests.
While powerful and well-established, tools like ParaView present a steep learning curve that can discourage many potential users. This work introduces ParaView-MCP, an autonomous agent that integrates modern multimodal large language models (MLLMs) with ParaView to not only lower the barrier to entry but also augment ParaView with intelligent decision support. By leveraging the state-of-the-art reasoning, command execution, and vision capabilities of MLLMs, ParaView-MCP enables users to interact with ParaView through natural language and visual inputs. Specifically, our system adopted the Model Context Protocol (MCP), a standardized interface for model-application communication, which facilitates direct interaction between MLLMs and ParaView’s Python API, allowing seamless information exchange between the user, the language model, and the visualization tool itself. Furthermore, by implementing a visual feedback mechanism that allows the agent to observe the viewport, we unlock a range of new capabilities, including recreating visualizations from examples, closed-loop visualization parameter updates based on user-defined goals, and even cross-application collaboration involving multiple tools.
Large Language Models (LLMs) have achieved impressive progress in decision-making and task automation for intelligent agents. However, multiple agents must cooperate to complete tasks in complex real-world applications, such as auto-annotating in autonomous driving. The primary challenges lie in how multiple agents effectively communicate and collaborate in a multi-modal environment and how to automatically refine annotating results to reduce human intervention. These challenges also hinder LLMs from fully evolving into embodied intelligent agents. Driven by these motivations, we propose ALGPT, a multi-agent cooperative framework for open-vocabulary multi-modal auto-annotation in autonomous driving. ALGPT dynamically assembles agent teams with different roles, and agents cooperate to complete annotation tasks according to requirements. By leveraging Chain of Thought (CoT) and In-Context Learning (ICL) techniques, ALGPT's reasoning capabilities are enhanced, allowing it to develop suitable plans autonomously without human intervention. Furthermore, drawing from project management standards, we introduce project management documents and Standard Operating Procedures (SOPs), which further align ALGPT's behavior with human expectations and mitigate the impact of GPT illusions caused by the cascading effects of multiple GPTs.
Learning from experience is a fundamental capability of intelligent agents. Autonomous systems rely on sensors that provide data about the environment and internal situations to their perception systems for learning and inference mechanisms. These systems can also learn Self-Aware and Situation-Aware generative modules from these data to localize themselves and interact with the environment. In this paper, we propose a self-aware cognitive architecture capable to perform tasks where the interactions between the self-state of an agent and the surrounding environment are explicitly and dynamically represented. We specifically develop a Deep Learning (DL) based Self-Aware interaction model, empowered by learning from Multi-Modal Perception (MMP) and World Models using multi-sensory data in a novel Multi-Agent Self-Awareness Architecture (MASAA). Two sub-modules are developed, the Situation Model (SM) and the First-Person model (FPM), that address different and interrelated aspects of the World Model (WM). The MMP model, instead, aims at learning the mapping of different sensory perceptions into Exteroceptive (EI) and Proprioceptive (PI) latent information. The WM then uses the learned MMP model as experience to predict dynamic self-behaviors and interaction patterns within the experienced environment. WM and MMP Models are learned in a data-driven way, starting from the lower-dimensional odometry data used to guide the learning of higher-dimensional video data, thus generating coupled Generalized State Hierarchical Dynamic Bayesian Networks (GS-HDBNs). We test our model on KITTI, CARLA, and iCab datasets, achieving high performance and a low average localization error (RMSE) of 2.897%, when considering two interacting agents.
Large language models (LLMs) and vision-language models (VLMs) have the potential to transform biological research by enabling autonomous experimentation. Yet, their application remains constrained by rigid protocol design, limited adaptability to dynamic lab conditions, inadequate error handling, and high operational complexity. Here we introduce BioMARS (Biological Multi-Agent Robotic System), an intelligent platform that integrates LLMs, VLMs, and modular robotics to autonomously design, plan, and execute biological experiments. BioMARS uses a hierarchical architecture: the Biologist Agent synthesizes protocols via retrieval-augmented generation; the Technician Agent translates them into executable robotic pseudo-code; and the Inspector Agent ensures procedural integrity through multimodal perception and anomaly detection. The system autonomously conducts cell passaging and culture tasks, matching or exceeding manual performance in viability, consistency, and morphological integrity. It also supports context-aware optimization, outperforming conventional strategies in differentiating retinal pigment epithelial cells. A web interface enables real-time human-AI collaboration, while a modular backend allows scalable integration with laboratory hardware. These results highlight the feasibility of generalizable, AI-driven laboratory automation and the transformative role of language-based reasoning in biological research.
While autonomous driving systems and intelligent transportation infrastructures become increasingly software-defined and network-connected, ensuring their cybersecurity has become a critical component of traffic safety. Large language models (LLMs) have recently shown promise in automating aspects of penetration testing, yet most existing approaches remain limited to simple, single-step exploits. They struggle to handle complex, multi-stage vulnerabilities that demand precise coordination, contextual reasoning, and knowledge reuse. This is particularly problematic in safety-critical domains, such as autonomous vehicles, where subtle software flaws can cascade across interdependent subsystems. In this work, we present CurriculumPT, a novel LLM-based penetration testing framework specifically designed for the security of intelligent systems. CurriculumPT combines curriculum learning and a multi-agent system to enable LLM agents to progressively acquire and apply exploitation skills across common vulnerabilities and exposures-based tasks. Through a structured progression from simple to complex vulnerabilities, agents build and refine an experience knowledge base that supports generalization to new attack surfaces without requiring model fine-tuning. We evaluate CurriculumPT on 15 real-world vulnerabilities scenarios and demonstrate that it outperforms three state-of-the-art baselines by up to 18 percentage points in exploit success rate, while achieving superior efficiency in execution time and resource usage. Our results confirm that CurriculumPT is capable of autonomous, scalable penetration testing and knowledge transfer, laying the groundwork for intelligent security auditing of modern autonomous driving systems and other cyberphysical transportation platforms.
Autonomous Intersection Management (AIM) systems present a novel paradigm for the cooperative control of Connected and Automated Vehicles (CAVs) at unsignalized intersections in future cities. Although Reinforcement Learning (RL) offers potential for increased computational efficiency and optimized solutions, challenges remain. These include limited inference capabilities and poor generalization due to simplified neural networks, along with insufficient safety-focused policy optimization. This study presents a novel offline-to-online framework, Prior-Enhanced Multi-Agent Constrained Decision Transformer (PE-MACDT), designed to tackle these challenges. The process begins with sequential decision-making using offline safe RL, which determines optimal actions through autoregressive modeling based on past states, actions, and both reward and cost returns. Leveraging the superior reasoning abilities and strong generalization of large language models like GPT-x and BERT, the sequence modeling challenges are addressed using the Transformer architecture, enhanced by sequence-level entropy regularizers to foster policy exploration. Subsequently, the safety policy learned from the offline dataset is deployed in the online environment and fine-tuned using the Multi-Agent Constrained Policy Optimization (MACPO) method combined with prior knowledge. This approach employs trust and constraint domains for policy updates, ensuring adherence to high standards of safety, comfort, and efficiency in dynamic traffic environments. Simulation results show our methodology outperforms state-of-the-art AIM methods in training convergence speed and asymptotic performance, as well as post-deployment outcomes in traffic efficiency, driving safety, and passenger comfort. The integration of offline pre-training with MACDT and online fine-tuning using MACPO offers a groundbreaking approach with significant potential for advancements in intelligent transportation systems.
Driving decision-making in mixed traffic, characterized by high-dynamic interactions and stochastic behaviors of human-driven vehicles, poses significant challenges for autonomous driving systems. To address these issues, we propose a novel Transformer-based Spatial Temporal Fusion (TSTF) module integrated with an auxiliary contrastive learning task within a multi-agent reinforcement learning (MARL) framework. The TSTF module captures interaction-aware behaviors and long-term temporal dependencies that tackle mixed cooperative driving scenarios, while the auxiliary contrastive learning task refines feature representations to enhance exploration efficiency and decision stability. Experimental evaluations on the MetaDrive platform demonstrate that the proposed approach outperforms baseline algorithms in safety, adaptability and robustness to dynamic traffic scenarios. The results highlight the effectiveness of the TSTF module in enabling robust and context-aware collaborative driving behaviors, offering a scalable solution for real-world mixed traffic. This work advances MARL by addressing key challenges in interaction modeling and driving decision-making under uncertainty, with significant implications for the development of intelligent transportation systems.
This paper addresses the challenge of decentralized task allocation within heterogeneous multiagent systems operating under communication constraints. We introduce a novel framework that integrates Graph Neural Networks (GNNs) with a centralized training and decentralized execution (CTDE) paradigm, further enhanced by a tailored Proximal Policy Optimization (PPO) algorithm for multi-agent deep reinforcement learning (MARL). Our approach enables unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to dynamically allocate tasks efficiently without necessitating central coordination in a 3D grid environment. The framework minimizes total travel time while simultaneously avoiding conflicts in task assignments. For the cost calculation and routing, we employ reservation-based $A^{*}$ and $R^{*}$ path planners. Experimental results revealed that our method achieves a high 92.5% conflict-free success rate, with only a 7.49% performance gap compared to the centralized Hungarian method, while outperforming the heuristic decentralized baseline based on a greedy approach. Additionally, the framework exhibits scalability with up to 20 agents with allocation processing of 2.8 s and robustness in responding to dynamically generated tasks, underscoring its potential for real-world applications in complex multi-agent scenarios.
Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.
Path planning is one of the most crucial elements of autonomous driving (AD). Due to its capacity to directly make judgments based on observation and learn from the environment, learning-based path planning techniques are of interest to many academics. The standard reinforcement learning approach of the deep Q-network has made major strides in AD since the agent normally learns driving tactics simply by the intended reward function, which is difficult to adapt to the driving scenarios of urban roadways. However, such methodologies rarely use the global path data to address the problem of directional planning, like turning around at an intersection. In addition, the link between different motion instructions like these might easily lead to an erroneous prediction of the route orders due to the fact that the steering and the accelerator are independently governed in a real-world driving system. This research proposes and implements a Provisional Cross-layered Deep Q-Network (PC-DQN) for path planning in end-to-end autonomous vehicles, where the universal path is employed to direct the vehicles from the starting point to ending point. We employ the concept of Improved Harmony Search optimized fuzzy control (HIS-FC) and propose a defuzzification approach to increase the stability of anticipating the values of various path instructions in order to manage the reliance of distinct path instructions in Q-networks. We carry out extensive tests in the CARLA simulator and contrast our approach with cutting-edge approaches. The suggested strategy outperforms existing methods in terms of learning efficiency and driving reliability, according to experimental findings.
In the mobile network, digital twin (DT)-aided artificial intelligence (AI)-empowered link control is vital to enhance the performance of wireless communication. This article proposes a deep reinforcement learning (DRL)-convex optimization enhanced time-frequency domain power allocation scheme to reduce the long-term average bit-error-rate (BER) in multiuser orthogonal frequency division multiplexing (OFDM) systems. To alleviate performance loss caused by trial-and-error during the training period of DRL algorithms, we design a novel practical DT-aided “prediction-then-decision” autonomous wireless link control framework considering the periodic interaction mechanism between the DT and its physical counterpart. A Transformer-based channel generator Mucomformer is implemented in the DT layer to generate large amounts of multiuser virtual channel state information (CSI) in future transmission frames. In addition, the DRL agent is trained over the DT channel in advance and executed in the real-world OFDM system to generate the optimal transmission strategy by considering the interaction mechanism between the DT and the physical counterpart. The simulation results demonstrate that the proposed Mucomformer has lower average prediction error of 2.51 dB compared to the Transformer baseline. The DRL and convex-based power allocation scheme further outperforms the classic strategy. Moreover, the practical DT-aided autonomous link control framework effectively mitigates the performance impairment, achieves an average BER performance gain 45.65% higher than that without DT and achieves faster convergence during the whole training period.
The increasing complexity of intelligent transportation systems driven by the deployment of connected autonomous vehicles necessitates advanced, adaptive traffic management solutions. This paper presents a novel hybrid learning architecture that integrates deep reinforcement learning with ant colony optimization to enable edge-aware, real-time routing decisions for connected vehicle networks. Each vehicle functions as an intelligent agent trained via a multi-agent actor-critic deep reinforcement learning model, enhanced with attention mechanisms for context-sensitive navigation. Ant colony optimization is used to refine global routing paths through pheromone-inspired collective learning, ensuring adaptive optimization under dynamic traffic conditions. The proposed approach is formalized as ADAPT-S (Ant-Driven Adaptive Policy Training with Simulation-Based Validation), combining swarm intelligence, policy learning, and SUMO-based simulation. The model demonstrates superior performance in Simulation of Urban Mobility (SUMO) environments, outperforming traditional routing methods in reducing travel time, congestion, and signal wait periods. The proposed method results validated the framework’s effectiveness in addressing key limitations of existing methods, particularly in achieving decentralized coordination, scalability, and responsiveness in connected autonomous vehicle routing systems.
Recent advancements in vision-language models (VLMs) have opened new possibilities for intelligent cultural heritage preservation, where autonomous systems can assist in navigation, inspection, and trajectory prediction within complex heritage environments. To this end, this paper proposes a novel optimization framework that integrates zero-shot multimodal evaluation with full-parameter fine-tuning, forming an iterative “Evaluate–Train–Evaluate” closed loop for autonomous driving models. We design a multi-modal autonomous driving agent comprising four core modules—Perception, Evaluation, Policy, and Learning—to enable adaptive optimization and enhance both task generalization and engineering deployability. The proposed agent supports end-to-end multimodal reasoning and decision-making, allowing seamless transition from benchmarking to real-world deployment. Experimental results show that the framework not only improves trajectory prediction accuracy and stability but also ensures real-time inference capability, demonstrating the feasibility of bridging evaluation and optimization in autonomous driving VLMs.
Emerging mobility systems are an example of Cyber-Physical Systems (CPSs) in which multiple autonomous agents (vehicles) interact with each other as well as with the infrastructure resources (road side units, traffic lights, etc). Control-theoretic and optimization methods provide a rich framework for managing these complex mixed-traffic socio economic multi-agent systems. Given the complexity involved and the abundance of data now available, it is essential to integrate learning-based methods not only to design optimal controllers with safety guarantees, but to also gain an understanding of human driving behavior, as well as user preferences for the mobility options that intelligent transportation systems provide. The three objectives of this tutorial paper are: (1) Set the stage for emerging mobility systems consisting of both autonomous and human-driven vehicles in a mixed traffic environment by formulating basic optimal control problems for autonomous vehicles that seek to jointly optimize travel time, energy, and comfort while ensuring that safety constraints are always satisfied. (2) Present methods for solving the formulated problems using a combination of optimization techniques and Control Barrier Functions (CBFs) that provide safety guarantees, as well as state of the art learning-based methods to design effective controllers for mixed traffic transportation systems. (3) Address the societal issues accompanying emerging mobility systems, including new metrics that incorporate accessibility and fairness in a transportation network consisting of both autonomous and human-driven vehicles.
There is a lack of systematic research on the behavioral design of charging decision-making for Shared Autonomous Electric Vehicles (ASEVs), and the thresholds of “when to charge and where to charge” have not been clarified. Therefore, this paper investigates the optimization of charging decisions of SAEVs and the impact of different decision-making objectives to provide theoretical support and practical guidance for intelligent operation. A multi-agent simulation model (which accurately simulates complex interaction systems) is constructed to simulate the operation and charging behavior of SAEVs. Four charging decision optimization objective functions are defined, and a weighted multi-objective optimization method is adopted. A comprehensive solution process combining the multi-agent simulation model and genetic algorithm (efficiently solving complex objective optimization problems) is applied to approximate the global optimal solution among 35 scenarios through 100 iterative runs. In this paper, factors such as passenger demand (e.g., average remaining battery power, demand response time) and operator demand (e.g., empty vehicle mileage, charging cost) are considered, and the impacts of different objectives and decision variables are analyzed. The optimization results show that (1) when a single optimization objective is selected, minimizing the total charging cost effectively balances the overall fleet operation; (2) there are trade-offs between different objectives, such as the conflict between the remaining battery power and charging cost, and the balance between the demand response time and the empty vehicle mileage; and (3) in order to satisfy the operational requirements, the weight distribution, charging probability, stopping probability, and recommended battery power should be adjusted. In conclusion, this study provides optimal charging decision strategies for the intelligent operation of SAEVs in different scenarios, which can optimize target weights and charging parameters, and achieve dynamic, balanced fleet management.
The rapid enterprise adoption of multi-cloud, microservice architectures introduces unprecedented complexity and security challenges. Traditional, reactive security models are proving inadequate, as code changes can propagate to global production systems within minutes, leaving minimal time for after-the-fact audits. Existing security solutions often operate in silos, failing to provide a coordinated and autonomous defense posture capable of addressing threats that span heterogeneous cloud environments. This paper introduces a novel framework for autonomous, cross-cloud threat mitigation that utilizes Multi-Agent Reinforcement Learning (MARL). In our proposed system, lightweight, self-defending artificial intelligence agents are deployed within each cloud environment to act as intelligent sentinels inside the software-delivery pipeline. These agents learn collaboratively to identify and remediate security risks in real-time, functioning as self-healing remediation agents. Through simulated multi-cloud failure scenarios, we demonstrate that this approach can significantly reduce mean-time-to-resolution for security incidents, projecting improvements comparable to the 60\% reduction in vulnerability patch time observed in related empirical studies.
: Distributed Space Systems (DSS) play a vital role in the success of multi-spacecraft missions, which are gar-nering considerable attention because of their affordability through lower costs of multiple smaller spacecraft, adaptability through reconfiguration, and resilience to failure through redundancy. These systems enable collaborative endeavours among spacecraft, thus amplifying exploration capabilities within such missions. Never-theless, the presence of multiple satellites amplifies the system’s complexity and raises the probability of fault occurrences. Consequently, an efficient health and mission management (HMM) system capable of accurately detecting and identifying faults within such a complex system is imperative to enhance mission success. In this study, we introduce an innovative Intelligent Agent-based HMM (IHMM) architecture for multi-spacecraft systems, leveraging Intelligent Agents (IAs) to seamlessly integrate mission success with satellite health and resilience. A thorough exploration and classification of diverse data sources suitable for integration into IAs is conducted, categorised according to their deployment type and intended roles. To evaluate and validate our proposed architecture, we conducted a preliminary analysis using one-time and continuous friction faults on a reaction wheel. The experiments show our approach outperforms traditional methods by proactively adapting control strategies in real-time and preventing saturation of other reaction wheels.
With multi-agent systems developing rapidly in various fields, cooperative perception has attracted much attention as a critical technology to enhance the intelligence level of autonomous systems. However, in the face of actual complex scenes, the perception ability of a single intelligent agent is often constrained by problems such as occlusion and limited perception range. To this end, multi-agent inter-intelligence cooperative sensing has emerged to enhance the overall perception performance by sharing perception information cooperatively. In this paper, we structurally study the multi-agent cooperative perception in depth and propose a novel and comprehensive evaluation framework to address the current situation in existing literature, which mainly focuses on latency or communication factors in cooperative perception evaluation. The framework covers four key modules: feature extraction, feature compression, feature fusion, and target detection, and aims to address the multifaceted challenges in multi-agent perception. Through an in-depth evaluation of existing cooperative perception algorithms, we comprehensively map the performance of each algorithm under the guidance of the framework. Particularly, we find that in the absence of aligning the correct pose, the detection performance degrades drastically as the latency increases. Our comprehensive framework will drive the development of multi-agent cooperative perception by providing researchers with a transparent and standardised methodology for evaluating, comparing, and improving existing cooperative perception approaches.
Forest wildfires present a substantial risk to biological systems and human populations, necessitating swift detection and coordinated response techniques. This project introduces a multi-agent drone swarm framework designed for autonomous fire detection and suppression in dynamic environments. The proposed system consists of two phases: an intelligent fire detection phase that employs Ant Colony Optimization (ACO) and a fire suppression phase that utilizes Particle Swarm Optimization (PSO), Artificial Potential Fields (APF). The project investigates Multi-Agent Reinforcement Learning through Multi-Agent Proximal Policy Optimization (MAPPO) inside a centralized training and decentralized execution (CTDE) framework. A Markovian grid-based simulation environment was developed to evaluate swarm behavior, incorporating realistic limitations such as restricted energy and communication capability. A hardware prototype employing ESP32-based ground robots was developed, incorporating centralized vision-based tracking and wireless control to evaluate real-world viability. A comparison analysis highlights the advantages of each algorithm in terms of coverage, energy efficiency, and suppression efficiency, emphasizing the robustness and scalability of the suggested swarm technique. This study presents a novel method for improving fire response systems through the integration of control techniques, artificial intelligence, and biological inspirations, while minimizing human involvement.
The rapid growth of digital assets in higher education has intensified the need for intelligent, secure, and scalable campus knowledge hubs capable of supporting teaching, research, and administrative decision-making. Autonomous AI agents offer a promising paradigm by enabling proactive, context-aware, and adaptive knowledge services with minimal human intervention. This paper presents a secure and intelligent system architecture for autonomous AI agents designed specifically for campus knowledge hubs in 2025. The proposed architecture integrates layered knowledge management, event-driven orchestration, and specialized autonomous agents for search, reasoning, learning, and compliance. Core functionalities include intelligent knowledge ingestion, semantic enrichment, context-aware retrieval, and continuous learning, all governed by embedded security, privacy, and policy enforcement mechanisms. Trust-aware intelligence is achieved through identity and access management, secure agent communication, audit logging, and regulatory compliance by design. Performance evaluation using representative 2025 benchmarks demonstrates that the system achieves high accuracy, low latency, and efficient user interaction while maintaining minimal security overhead. The results indicate that autonomous AI agents significantly enhance knowledge retrieval efficiency and user experience compared to conventional AI-based systems. Overall, this work demonstrates that secure autonomous agent architectures can serve as a foundational enabler for next-generation campus knowledge hubs, supporting intelligent, transparent, and responsible knowledge management in higher education institutions.
The next generation networks offers significant potential to advance Intelligent Transportation Systems (ITS), particularly through the integration of Digital Twins (DTs). However, ensuring the uninterrupted operation of DTs through efficient computing resource management remains an open challenge. This paper introduces a distributed computing archi tecture that integrates DTs and Mobile Edge Computing (MEC) within a software-defined vehicular networking framework to enable intelligent, low-latency transportation services. A network aware scalable collaborative task provisioning algorithm is de veloped to train an autonomous agent, which is evaluated using a realistic connected autonomous vehicle (CAV) traffic simulation. The proposed framework significantly enhances the robustness and scalability of DT operations by reducing synchronization errors to as low as 5% while achieving up to 99.5% utilization of edge computing resources.
The evolution of customer relationship management (CRM) platforms is entering a new era with the integration of generative AI and autonomous agent architectures. This article explores the transformation of CRM systems, from traditional models to AI-powered platforms that leverage large language models for automating customer engagement. We discuss the architectural primitives, design patterns, and agent-enablement infrastructures that support autonomous CRM agents. Security, compliance, and transparency in these systems are also addressed. Furthermore, we explore the skills and platform capabilities necessary to thrive in the new CRM landscape, emphasizing the potential for these systems to scale human-like customer interactions while enhancing business operations.
The growing complexity and interconnectivity of cyber-physical systems pose challenges to real-time decision making, resource optimization, and security. Existing systems usually fail to incorporate autonomous robotics, IoT networks, and intelligent energy management with secure, adaptable operation. This project envisions a next-generation AI-enabled cyber-physical ecosystem that overcomes such challenges through the integration of autonomous robot and drone swarms, distributed IoT sensors, cloud computing, smart energy modules, sophisticated cybersecurity, and brain-computer interfacing. Multi-modal AI architectures analyze real-time data for predictive analysis and adaptable system behavior, while multi-agent reinforcement learning coordinates autonomous agents to perform tasks such as environmental monitoring, disaster management, logistics, and smart city operations. Firebase facilitates secure, real-time data storage with cross-platform accessibility, and a brain-computer interface facilitates the direct neural control of devices. The innovation in this system is the fully autonomous, cloud-united platform that offers intelligent, predictive, and secure management to future smart ecosystems.
Mass gatherings such as the Hajj pilgrimage and Kumbh Mela attract millions of participants, presenting unique challenges in crowd management, mobility coordination, and emergency response. Traditional systems often struggle with the scale, unpredictability, and real-time decision-making demands of such events. This article proposes the use of autonomous multi-agent systems (MAS) to address these challenges through decentralized, intelligent coordination among agents equipped with sensing, communication, and decision-making capabilities. Drawing upon a wide body of recent research in agent-based modeling, GIS integration, real-time evacuation management, and UAV autonomy, the study develops a conceptual MAS framework tailored for pilgrim mobility and emergency intervention. The framework is validated through simulation studies modeling crowd behaviors and agent responsiveness during hypothetical crises. Results demonstrate significant improvements in evacuation efficiency, congestion mitigation, and hazard avoidance when compared to centralized systems. Furthermore, the integration of real-time data analytics and autonomic computing enhances the adaptability and responsiveness of the system. This research contributes to the growing body of literature on AI-driven urban safety and offers practical implications for planners and policymakers managing high-density religious events in complex environments.
To address the challenge of insufficient interactivity and behavioral diversity in autonomous driving decision-making, this paper proposes a Cognitive Hierarchical Agent for Reasoning and Motion Stylization (CHARMS). By leveraging Level-k game theory, CHARMS captures human-like reasoning patterns through a two-stage training pipeline comprising reinforcement learning pretraining and supervised fine-tuning. This enables the resulting models to exhibit diverse and human-like behaviors, enhancing their decision-making capacity and interaction fidelity in complex traffic environments. Building upon this capability, we further develop a scenario generation framework that utilizes the Poisson cognitive hierarchy theory to control the distribution of vehicles with different driving styles through Poisson and binomial sampling. Experimental results demonstrate that CHARMS is capable of both making intelligent driving decisions as an ego vehicle and generating diverse, realistic driving scenarios as environment vehicles. The code for CHARMS is released at https://github.com/chuduanfeng/CHARMS.
Recent advances in closed-loop planning benchmarks have significantly improved the evaluation of autonomous vehicles. However, existing benchmarks still rely on rule-based reactive agents such as the Intelligent Driver Model (IDM), which lack behavioral diversity and fail to capture realistic human interactions, leading to oversimplified traffic dynamics. To address these limitations, we present nuPlan-R, a new reactive closed-loop planning benchmark that integrates learning-based reactive multi-agent simulation into the nuPlan framework. Our benchmark replaces the rule-based IDM agents with noise-decoupled diffusion-based reactive agents and introduces an interaction-aware agent selection mechanism to ensure both realism and computational efficiency. Furthermore, we extend the benchmark with two additional metrics to enable a more comprehensive assessment of planning performance. Extensive experiments demonstrate that our reactive agent model produces more realistic, diverse, and human-like traffic behaviors, leading to a benchmark environment that better reflects real-world interactive driving. We further reimplement a collection of rule-based, learning-based, and hybrid planning approaches within our nuPlan-R benchmark, providing a clearer reflection of planner performance in complex interactive scenarios and better highlighting the advantages of learning-based planners in handling complex and dynamic scenarios. These results establish nuPlan-R as a new standard for fair, reactive, and realistic closed-loop planning evaluation. We will open-source the code for the new benchmark.
Traditional automation and security management cannot handle the operational complexity brought about by the emergence of hyperscale cloud computing and dynamic DevSecOps workflows. We provide a novel system based on a Large Language Model (LLM)-Driven Autonomous Cloud Automation Agent[1] to overcome these issues. This agent uses adaptive policy enforcement, continuous learning, and intelligent decision-making to manage, secure, and optimize native cloud settings. Our agent uses the cognitive power of transformer-based LLMs to reason over multimodal telemetry, interpret natural language questions, and independently carry out security and operational regulations. The architecture seamlessly integrates with Kubernetes and service mesh environments and includes AI-based anomaly detection, dynamic access restriction, and enhanced identity verification. When our approach is evaluated using adversarial security events and simulated cloud operations, it outperforms traditional systems in terms of mean time to remediation (MTTR), policy enforcement accuracy, and responsiveness. This work introduces self-adaptive, LLM-driven orchestration that is in line with the Zero Trust principles, which represents a major improvement in cloud security and automation.
The cyber-security ecosystem is evolving very fast, with Artificial Intelligence (AI) giving rise to both highly defensive and more sophisticated forms of Advanced Persistent Threats (APTs). AI-powered APTs are a new breed of intelligent, adaptive and self-learning cyber attackers that can autonomously use vulnerabilities, evade detection and continue operating within networks. Organizations in their turn are moving towards the shift between stationary, rule-based control and fully autonomous defensive agents able to conduct continuous monitoring, predict the threat, interrupt the attack real-time, and actively respond. It is this paper that examines the new paradigm of Agent-vs-Agent Cyber Warfare, where autonomous AI defenses indirectly respond to AI-driven APTs on dynamic digital platforms. We describe the architecture of the offensive APT agents based on AI, analyze defensive multi-agent systems (MAS), and suggest a proactive cyber-battlefield model, based on reinforcement learning (RL), large language models (LLM), and self-evolving threat intelligence. Lastly, we outline constraints, ethical aspects, and the way forward with regard to obtaining digital ecosystems in an era of autonomous cyber warfare.
The increasing integration of autonomous electric vehicles (EVs) into Intelligent Transportation Systems (ITSs) needs rigorous mechanisms to ensure their safe and effective operation in dynamic environments. The reliability of such vehicles depends not only on their internal capabilities but also on the suitability and safety of the environments in which they operate. This paper introduces a formal modelling framework that captures independently the dynamic evolution of the environmental context and the EV agent using multi-parameter transition systems. Two distinct models are defined: the Context Transition System (CTS), which models changes in environmental states, and the Agent Transition System (ATS), which captures the internal state evolution of the EV. Safety and liveness properties are formally specified in Computation Tree Logic (CTL) and verified using the nuXmv model checker. The framework is validated through two representative use cases: a dynamic urban delivery zone and an autonomous electric delivery vehicle. The results highlight the framework’s effectiveness in detecting unsafe conditions, verifying mission objectives, and supporting the reliable deployment of EVs in ITS.
Despite the programmable architecture of Open RAN, today's deployments still rely heavily on static control and manual operations. To move beyond this limitation, we introduce AgentRAN, an AI-native, Open RAN-aligned agentic framework that generates and orchestrates a fabric of distributed AI agents based on natural language intents. Unlike traditional approaches that require explicit programming, AgentRAN's LLM-powered agents interpret natural language intents, negotiate strategies through structured conversations, and orchestrate control loops across the network. AgentRAN instantiates a self-organizing hierarchy of agents that decompose complex intents across time scales (from sub-millisecond to minutes), spatial domains (cell to network-wide), and protocol layers (PHY/MAC to RRC). A central innovation is the AI-RAN Factory, which continuously generates improved agents and algorithms from operational data, transforming the network into a system that evolves its own intelligence. We validate AgentRAN through live 5G experiments, demonstrating dynamic adaptation to changing operator intents across power control and scheduling. Key benefits include transparent decision-making (all agent reasoning is auditable), bootstrapped intelligence (no initial training data required), and continuous self-improvement via the AI-RAN Factory.
This position paper presents A4FN, an Agentic Artificial Intelligence (AI) architecture for intent-driven automation in Flying Networks (FNs) using Unmanned Aerial Vehicles (UAVs) as access nodes. A4FN leverages Generative AI and Large Language Models (LLMs) to enable real-time, context-aware network control via a distributed agentic system. It comprises two components: the Perception Agent (PA), which semantically interprets multimodal input – including imagery, audio, and telemetry data – from UAV-mounted sensors to derive Service Level Specifications (SLSs); and the Decision-and-Action Agent (DAA), which reconfigures the network based on inferred intents. A4FN embodies key properties of Agentic AI, including autonomy, goal-driven reasoning, and continuous perception-action cycles. Designed for mission-critical, infrastructure-limited scenarios such as disaster response, it supports adaptive reconfiguration, dynamic resource management, and interoperability with emerging wireless technologies. The paper details the A4FN architecture, its core innovations, and open research challenges in multi-agent coordination and Agentic AI integration in next-generation FNs.
The application of agentic AI systems in autonomous decision-making is growing in the areas of healthcare, smart cities, digital forensics, and supply chain management. Even though these systems are flexible and offer real-time reasoning, they also raise concerns of trust and oversight, and integrity of the information and activities upon which they are founded. The paper suggests a single architecture model comprising of LangChain-based multi-agent system with a permissioned blockchain to guarantee constant monitoring, policy enforcement, and immutable auditability of agentic action. The framework relates the perception conceptualization-action cycle to a blockchain layer of governance that verifies the inputs, evaluates recommended actions, and documents the outcomes of the execution. A Hyperledger Fabric-based system, action executors MCP-integrated, and LangChain agent are introduced and experiments of smart inventory management, traffic-signal control, and healthcare monitoring are done. The results suggest that blockchain-security verification is efficient in preventing unauthorized practices, offers traceability throughout the whole decision-making process, and maintains operational latency within reasonable ranges. The suggested framework provides a universal system of implementing high-impact agentic AI applications that are autonomous yet responsible.
Traditional static cybersecurity models often struggle with scalability, real-time detection, and contextual responsiveness in the current digital product ecosystems which include cloud services, application programming interfaces (APIs), mobile platforms, and edge devices. This study introduces autonomous goal driven agents capable of dynamic learning and context-aware decision making as part of an adaptive cybersecurity architecture driven by agentic artificial intelligence (AI). To facilitate autonomous threat mitigation, proactive policy enforcement, and real-time anomaly detection, this framework integrates agentic AI across the key ecosystem layers. Behavioral baselining, decentralized risk scoring, and federated threat intelligence sharing are important features. The capacity of the system to identify zero-day attacks and dynamically modify access policies was demonstrated through native cloud simulations. The evaluation results show increased adaptability, decreased response latency, and improved detection accuracy. The architecture provides an intelligent and scalable blueprint for safeguarding complex digital infrastructure and is compatible with zero-trust models, thereby supporting the adherence to international cybersecurity regulations.
Semantic communications (SemCom), as one of the key technologies for 6G, is shifting networks from bit transmission to semantic information exchange. On this basis, introducing agentic artificial intelligence (AI) with perception, memory, reasoning, and action capabilities provides a practicable path to intelligent communications. This paper provides a systematic exposition of how agentic AI empowers SemCom from the perspectives of research foundations, system architecture, and application scenarios. We first provide a comprehensive review of existing studies by agent types, covering embedded agents, large language model (LLM)/large vision model (LVM) agents, and reinforcement learning (RL) agents. Additionally, we propose a unified agentic AI-enhanced SemCom framework covering the application layer, the semantic layer, and the cloud-edge collaboration layer, forming a closed loop from intent to encoding to transmission to decoding to action to evaluation. We also present several typical scenarios, including multi-vehicle collaborative perception, multi-robot cooperative rescue, and agentic operations for intellicise (intelligent and concise) networks. Furthermore, we introduce an agentic knowledge base (KB)-based joint source-channel coding case study, AKB-JSCC, where the source KB and channel KB are built by LLM/LVM agents and RL agents, respectively. Experimental results show that AKB-JSCC achieves higher information reconstruction quality under different channel conditions. Finally, we discuss future evolution and research directions, providing a reference for portable, verifiable, and controllable research and deployment of agentic SemCom.
Agentic AI introduces a new paradigm in enterprise architecture by enabling autonomous, communicative, and goal-driven agents that operate across distributed systems. Building upon generative AI, agentic frameworks deliver scalable, self-orchestrating architectures capable of optimizing performance, fortifying cybersecurity, and enhancing operational resilience. This paper presents a foundational architecture that integrates Agentic AI principles into enterprise systems, offering a unified approach to intelligent orchestration, security, and adaptability. This paper explores the convergence of agentic and generative AI through recent frameworks, emphasizing Agent-to-Agent (A2A) communication protocols, intent-based coordination via the Agent Communication Protocol (ACP), and integration with LLM-powered reasoning engines. Agentic architectures automate complex workflows, detect and respond to cyber threats, and simulate failure scenarios through decentralized, intelligent agents. The proposed framework incorporates event-driven communication, vectorized memory, and optional blockchain-backed verification to support trust, transparency, and traceability across agents. The result is a composable, adaptive infrastructure that redefines how enterprise systems achieve agility, security, and continuity. When implemented with robust governance and AI oversight, Agentic AI powered by A2A emerges as a transformative force for high-performance, resilient enterprise design.
Agentic AI systems increasingly serve critical enterprise functions-planning, decision support, and customer interactions-but their memory substrates remain brittle, ungoverned, and unauditable. We introduce EMA2, an Enterprise Memory Architecture for Agentic AI, which unifies multi-tier memory (context cache, working set, semantic store, cold archive) with policy-aware read/write enforcement and verifiable W3C PROV-based lineage. EMA2 embeds a Policy Guardrail Engine aligned with GDPR, the EU AI Act, NIST AI RMF, and ISO/IEC 42001 to enforce least-privilege access and retention controls, and uses a Retrieval Planner that optimizes a novel rate-distortionrisk (RDR) objective balancing task utility, compression loss, and governance risk. Experiments across enterprise QA and task agents show that EMA2 significantly improves retrieval quality, reduces hallucination, eliminates unauthorized memory access, and achieves $\mathbf{1 0 0 \%}$ provenance traceability-without compromising latency or task performance. EMA2 provides a deployable, auditable memory substrate for trustworthy enterprise AI.
No abstract available
The emergence of Agentic AI autonomous systems that can make and execute decisions without human intervention has presented new and complex challenges in cybersecurity. Traditional trust models and defense mechanisms are insufficient to handle these dynamic, intelligent threats. In this paper, we propose a novel Cognitive Trust Architecture (CTA) aimed at detecting, assessing, and mitigating agentic AI-driven cyber threats. We introduce an adaptive trust reasoning framework that continuously adjusts trust levels based on behavioral indicators, intent inference, and contextual analysis. Additionally, the framework incorporates autonomous adversary modeling to predict and counter potential attack strategies. By leveraging this approach, we demonstrate the efficacy of CTA in enhancing system integrity, reducing false positives in trust assessments, and improving resilience against evolving AI-driven adversaries. This work represents a significant advancement in applying cognitive trust as a proactive defense mechanism to counter intelligent, autonomous cyber threats.
No abstract available
Agentic AI shifts from simple prediction to systems that can act on their own. These systems plan, reason, and conduct tasks across large and complex setups. They can improve work speed and help teams do more with less effort. But with more freedom comes real risks. Teams must think about design, safety, and ethics before they deploy these systems. This paper explains how to bring agentic AI into an enterprise in a safe and clear way. We share design patterns that give agents room to act while keeping strong control and full visibility. We outline an ethics model that covers safety, human impact, and long term use. We also present a governance plan with clear checks, human review, and risk control steps. The goal is to give architects and leaders a simple path to use agentic AI while staying true to company values. The approach supports progress without losing security, trust, or accountability.
This paper explores **Agentic Artificial Intelligence (AI)** systems that exhibit autonomous, goal-directed behavior and independent decision-making. It discusses their **architectural foundations**, including reasoning engines, memory systems, and planning frameworks that enable self-directed actions. The study highlights key **capabilities** such as multi-step reasoning, tool use, and adaptive learning across various domains. It also examines **challenges** related to safety, alignment, and ethical deployment. Overall, the paper emphasizes the transformative potential of agentic AI and the need for responsible governance in its development.
Errors in Retrieval-Augmented Generation (RAG) and agentic AI systems are commonly attributed to limitations of large language models. Based on two years of developing ☸️SAIMSARA, a large-scale agentic AI system for medical scientific article review, this editorial argues that most failures arise instead from architectural design and workload allocation. Key pitfalls include prompt overload, excessive batch size, oversized input items, and the use of speed-optimized models in multi-stage workflows. Practical mitigation strategies are outlined, emphasizing the importance of aligning prompt complexity, batch structure, and input size with model capacity. The findings highlight that robustness in RAG and agentic AI systems is primarily a systems engineering challenge rather than a model selection problem.
No abstract available
The rise of agentic artificial intelligence is changing how businesses operate, manage systems, and oversee digital workflows. These systems are different from normal automation or standalone AI models because they rely on structured thinking secure tool usage advanced teamwork between multiple agents, and ongoing feedback in complex environments with hybrid and multi-cloud systems. But there is a major issue businesses don’t have a clear framework to and use and expand agentic AI while staying compliant. This document tackles that problem by presenting the Enterprise Agentic Architecture Framework. This is a detailed multi-layered reference model built to help large organizations safely and use and manage agentic AI on a bigger scale. EAAF is built on six key layers: infrastructure, enterprise integration, orchestration and coordination, governance and safety, agent intelligence, and agent interaction. A central Control Plane ties all these layers together. The Control Plane plays a major role in managing policies, identity, scheduling, observability, and controlling the lifecycle of individual agents as well as multi-agent systems. Tests on real-world enterprise cases like Opportunity-to-Order automation, DevOps and AIOps pipelines, integration workflows, and collaboration across multiple agents in different domains show that EAAF improves autonomy, ensures reliable reasoning, boosts efficiency in execution, and strengthens operational resilience. Tests reveal significant boosts such as workflows running 3 to 10 times faster, cutting the average resolution time (MTTR) by 60 to 80 percent, and clear improvements in safety guided by policies. To sum up, EAAF acts as a key framework to build future enterprise AI systems. It ensures safe autonomy, sets up consistent architecture, and organizes agent-driven operations for critical tasks.
Service-Oriented Architecture (SOA) structures applications into collections of modular, independent, and reusable services. We propose an SOA-based intelligent service agent framework for building AI applications that decomposes complex tasks into independent functional units. In the framework, the agent operates as an intelligent executor that dynamically orchestrates and invokes diverse services and tools to achieve its goals. The agent is exposed as a self-contained service with a well-defined API, allowing external applications to invoke it directly. By instrumenting requests and responses at both the service and agent layers, the framework enables tracing of the agent’s capabilities, performance, and decision-making. We present the design of an operational scheme for the agent with DID handling, verifiable credentials (VC), and verifiable presentations (VP). Each of the agents collaborates on a shared workspace based on blackboard to handle tasks to reach a goal. Finally, we demonstrate its feasibility through a proof-of-concept (PoC) for Agentic AI service architecture. This proof-of-concept, structured across Phase 1 (discovery, verification, and scoped authorization) and Phase 2 (problem posting and blackboard-mediated collaboration), demonstrates that DID-backed credentialing can securely support multi-agent execution under a least-privilege operational model.
No abstract available
As generative AI (GenAI) agents become more common in enterprise settings, they introduce security challenges that differ significantly from those posed by traditional systems. These agents aren’t just LLMs—they reason, remember, and act, often with minimal human oversight. This paper introduces a comprehensive threat model tailored specifically for GenAI agents, focusing on how their autonomy, persistent memory access, complex reasoning, and tool integration create novel risks. This research work identifies 9 primary threats and organizes them across five key domains: cognitive architecture vulnerabilities, temporal persistence threats, operational execution vulnerabilities, trust boundary violations, and governance circumvention. These threats aren’t just theoretical—they bring practical challenges such as delayed exploitability, cross-system propagation, cross system lateral movement, and subtle goal misalignments that are hard to detect with existing frameworks and standard approaches. To help address this, the research work present two complementary frameworks: ATFAA (Advanced Threat Framework for Autonomous AI Agents), which organizes agent-specific risks, and SHIELD, a framework proposing practical mitigation strategies designed to reduce enterprise exposure. While this work builds on existing work in LLM and AI security, the focus is squarely on what makes agents different—and why those differences matter. Ultimately, this research argues that GenAI agents require a new lens for security. If we fail to adapt our threat models and defenses to account for their unique architecture and behavior, we risk turning a powerful new tool into a serious enterprise liability.
We are in a transformative era, and advances in Artificial Intelligence (AI), especially the foundational models, are constantly in the news. AI has been an integral part of many applications that rely on automation for service delivery, and one of them is mission-critical public safety applications. The problem with AI-oriented mission-critical applications is the humanin-the-loop system and the lack of adaptability to dynamic conditions while maintaining situational awareness. Agentic AI (AAI) has gained a lot of attention recently due to its ability to analyze textual data through a contextual lens while quickly adapting to conditions. In this context, this paper proposes an AAI framework for mission-critical applications. We propose a novel framework with a multi-layer architecture to realize the AAI. We also present a detailed implementation of AAI layer that bridges the gap between network infrastructure and missioncritical applications. Our preliminary analysis shows that the AAI reduces initial response time by 5.6 minutes on average, while alert generation time is reduced by 15.6 seconds on average and resource allocation is improved by up to 13.4%. We also show that the AAI methods improve the number of concurrent operations by 40, which reduces the recovery time by up to 5.2 minutes. Finally, we highlight some of the issues and challenges that need to be considered when implementing AAI frameworks.
No abstract available
No abstract available
From automated intrusion testing to discovery of zero-day attacks before software launch, agentic AI calls for great promises in security engineering. This strong capability is bound with a similar threat: the security and research community must build up its models before the approach is leveraged by malicious actors for cybercrime. We therefore propose and evaluate RedTeamLLM, an integrated architecture with a comprehensive security model for automatization of pentest tasks. RedTeamLLM follows three key steps: summarizing, reasoning and act, which embed its operational capacity. This novel framework addresses four open challenges: plan correction, memory management, context window constraint, and generality vs. specialization. Evaluation is performed through the automated resolution of a range of entry-level, but not trivial, CTF challenges. The contribution of the reasoning capability of our agentic AI framework is specifically evaluated.
While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.
This paper introduces a novel framework that integrates agentic Artificial Intelligence (AI) with Intent-Based Networks (IBN) to enable autonomous management, configuration, and optimization of mobile network services and resources. Leveraging the advanced reasoning and natural language processing capabilities of an Large Language Model (LLM), the proposed architecture translates high-level user intents into precise network actions, facilitating user-friendly and scalable network orchestration. The framework employs a distributed multi-agent system, where specialized agents collaborate to decompose user intents, provide computational infrastructure, and deploy services using industry-standard Infrastructure-as-Code (IaC) tools. By supporting natural language interactions, the system reduces operational complexity and enhances accessibility for users with varying technical expertise. Experimental evaluations demonstrate significant improvements in task completion rates, response accuracy, and operational efficiency compared to traditional manual methods, particularly for complex network management tasks. In essence, this work creates an intelligent network orchestration framework that adapts to user needs by automatically configuring network and computing resources while operating with minimal human intervention.
Large Language Model (LLM)-based agents increasingly interact, collaborate, and delegate tasks to one another autonomously with minimal human interaction. Industry guidelines for agentic system governance emphasize the need for users to maintain comprehensive control over their agents, mitigating potential damage from malicious agents. Several proposed agentic system designs address agent identity, authorization, and delegation, but remain purely theoretical, without concrete implementation and evaluation. Most importantly, they do not provide user-controlled agent management. To address this gap, we propose SAGA, a scalable Security Architecture for Governing Agentic systems, that offers user oversight over their agents'lifecycle. In our design, users register their agents with a central entity, the Provider, that maintains agent contact information, user-defined access control policies, and helps agents enforce these policies on inter-agent communication. We introduce a cryptographic mechanism for deriving access control tokens, that offers fine-grained control over an agent's interaction with other agents, providing formal security guarantees. We evaluate SAGA on several agentic tasks, using agents in different geolocations, and multiple on-device and cloud LLMs, demonstrating minimal performance overhead with no impact on underlying task utility in a wide range of conditions. Our architecture enables secure and trustworthy deployment of autonomous agents, accelerating the responsible adoption of this technology in sensitive environments.
When combining Large Language Models (LLMs) with autonomous agents, used in network monitoring and decision-making systems, this will create serious security issues. In this research, the MAESTRO framework consisting of the seven layers threat modeling architecture in the system was used to expose, evaluate, and eliminate vulnerabilities of agentic AI. The prototype agent system was constructed and implemented, using Python, LangChain, and telemetry in WebSockets, and deployed with inference, memory, parameter tuning, and anomaly detection modules. Two practical threat cases were confirmed as follows: (i) resource denial of service by traffic replay denial-of-service, and (ii) memory poisoning by tampering with the historical log file maintained by the agent. These situations resulted in measurable levels of performance degradation, i.e. telemetry updates were delayed, and computational loads were increased, as a result of poor system adaptations. It was suggested to use a multilayered defense-in-depth approach with memory isolation, validation of planners and anomaly response systems in real-time. These findings verify that MAESTRO is viable in operational threat mapping, prospective risk scoring, and the basis of the resilient system design. The authors bring attention to the importance of the enforcement of memory integrity, paying attention to the adaptation logic monitoring, and cross-layer communication protection that guarantee the agentic AI reliability in adversarial settings.
Self-evolving agentic artificial intelligence (AI) offers a new paradigm for future wireless systems by enabling autonomous agents to continually adapt and improve without human intervention. Unlike static AI models, self-evolving agents embed an autonomous evolution cycle that updates models, tools, and workflows in response to environmental dynamics. This paper presents a comprehensive overview of self-evolving agentic AI, highlighting its layered architecture, life cycle, and key techniques, including tool intelligence, workflow optimization, self-reflection, and evolutionary learning. We further propose a multi-agent cooperative self-evolving agentic AI framework, where multiple large language models (LLMs) are assigned role-specialized prompts under the coordination of a supervisor agent. Through structured dialogue, iterative feedback, and systematic validation, the system autonomously executes the entire life cycle without human intervention. A case study on antenna evolution in low-altitude wireless networks (LAWNs) demonstrates how the framework autonomously upgrades fixed antenna optimization into movable antenna optimization. Experimental results show that the proposed self-evolving agentic AI autonomously improves beam gain and restores degraded performance by up to 52.02%, consistently surpassing the fixed baseline with little to no human intervention and validating its adaptability and robustness for next-generation wireless intelligence.
No abstract available
AI agents have recently shown significant promise in software engineering. Much public attention has been transfixed on the topic of code generation from Large Language Models (LLMs) via a prompt. However, software engineering is much more than programming, and AI agents go far beyond instructions given by a prompt. At the code level, common software tasks include code generation, testing, and program repair. Design level software tasks may include architecture exploration, requirements understanding, and requirements enforcement at the code level. Each of these software tasks involves micro-decisions which can be taken autonomously by an AI agent, aided by program analysis tools. This creates the vision of an AI software engineer, where the AI agent can be seen as a member of a development team. Conceptually, the key to successfully developing trustworthy agentic AI-based software workflows will be to resolve the core difficulty in software engineering - the deciphering and clarification of developer intent. Specification inference, or deciphering the intent, thus lies at the heart of many software tasks, including software maintenance and program repair. A successful deployment of agentic technology into software engineering would involve making conceptual progress in such intent inference via agents. Trusting the AI agent becomes a key aspect, as software engineering becomes more automated. Higher automation also leads to higher volume of code being automatically generated, and then integrated into code-bases. Thus to deal with this explosion, an emerging direction is AI-based verification and validation (V&V) of AI generated code. We posit that agentic software workflows in future will include such AIbased V&V.
The electricity sector transition requires substantial increases in residential demand response capacity, yet Home Energy Management Systems (HEMS) adoption remains limited by user interaction barriers requiring translation of everyday preferences into technical parameters. While large language models have been applied to energy systems as code generators and parameter extractors, no existing implementation deploys LLMs as autonomous coordinators managing the complete workflow from natural language input to multi-appliance scheduling. This paper presents an agentic AI HEMS where LLMs autonomously coordinate multi-appliance scheduling from natural language requests to device control, achieving optimal scheduling without example demonstrations. A hierarchical architecture combining one orchestrator with three specialist agents uses the ReAct pattern for iterative reasoning, enabling dynamic coordination without hardcoded workflows while integrating Google Calendar for context-aware deadline extraction. Evaluation across three open-source models using real Austrian day-ahead electricity prices reveals substantial capability differences. Llama-3.3-70B successfully coordinates all appliances across all scenarios to match cost-optimal benchmarks computed via mixed-integer linear programming, while other models achieve perfect single-appliance performance but struggle to coordinate all appliances simultaneously. Progressive prompt engineering experiments demonstrate that analytical query handling without explicit guidance remains unreliable despite models'general reasoning capabilities. We open-source the complete system including orchestration logic, agent prompts, tools, and web interfaces to enable reproducibility, extension, and future research.
Agentic AI networking (AgentNet) is a novel AI-native networking paradigm that relies on a large number of specialized AI agents to collaborate and coordinate for autonomous decision-making, dynamic environmental adaptation, and complex goal achievement. It has the potential to facilitate real-time network management alongside capabilities for self-configuration, self-optimization, and self-adaptation across diverse and complex networking environments, laying the foundation for fully autonomous networking systems in the future. Despite its promise, AgentNet is still in the early stage of development, and there still lacks an effective networking framework to support automatic goal discovery and multi-agent self-orchestration and task assignment. This paper proposes SANNet, a novel semantic-aware agentic AI networking architecture that can infer the semantic goal of the user and automatically assign agents associated with different layers of a mobile system to fulfill the inferred goal. Motivated by the fact that one of the major challenges in AgentNet is that different agents may have different and even conflicting objectives when collaborating for certain goals, we introduce a dynamic weighting-based conflict-resolving mechanism to address this issue. We prove that SANNet can provide theoretical guarantee in both conflict-resolving and model generalization performance for multi-agent collaboration in dynamic environment. We develop a hardware prototype of SANNet based on the open RAN and 5GS core platform. Our experimental results show that SANNet can significantly improve the performance of multi-agent networking systems, even when agents with conflicting objectives are selected to collaborate for the same goal.
No abstract available
Abstract Security Operations Centers (SOCs) face significant challenges due to the large volume, diversity, and dynamics of incident events. Alarm fatigue, delayed initiation of response, and the high share of false positives or missed threats limit team effectiveness and increase organizational risk. This study presents a methodology for automated management of key performance indicators (KPIs) in an SOC environment through an Agentic AI architecture and machine learning. Within the project, 214 CSV files were processed, comprising over 8.6 million data rows extracted from SIEM, Incident Management, Task Tracking, and CRM systems. Sixteen specific indicators were used, grouped into four categories: detection and filtering (TTD, FNR, FPR), response and resolution (TTR, IRR, SIHR), recovery and operations (MTTR, OE), satisfaction and risk management (CSR, SIER). The system includes ten specialized Agentic AI agents with clearly defined roles ‒ monitoring time parameters, predicting false alarm probabilities, automatically triggering playbooks, calculating operational metrics, and analyzing customer satisfaction. Five machine learning models were trained: two XGBoost classifiers for FPR and FNR, two LightGBM regressors for TTR and MTTR, and a BERT model for textual feedback analysis. The results demonstrate reduced detection and response times, a lower rate of false alarms, and improved operational predictability in calculating KPI values. The methodology shows the applicability of Agentic AI for optimizing SOC processes on real and public data, without the need for manual intervention in most processing phases.
The integration of Agentic AI into SAP Finance represents a transformative advancement in enterprise financial management, combining autonomous decision-making capabilities with sophisticated data analytics to revolutionize traditional financial processes. This comprehensive article explores how Agentic AI is reshaping SAP Finance through enhanced automation of routine financial tasks, deployment of advanced predictive analytics for forecasting and risk assessment, and the provision of real-time financial intelligence that enables dynamic decision-making. By examining the technical architecture, implementation strategies, and organizational impacts, this article demonstrates how Agentic AI empowers finance professionals to transcend operational constraints and focus on strategic initiatives while simultaneously improving accuracy, compliance, and responsiveness in financial operations across the enterprise landscape.
Artificial intelligence is reshaping scientific discovery, yet its use in materials research remains limited by fragmented computational ecosystems, reproducibility challenges, and dependence on commercial large language models (LLMs). Here we introduce AGAPI (AtomGPT.org API), an open-access agentic AI platform that integrates more than eight open-source LLMs with over twenty materials-science API endpoints, unifying databases, simulation tools, and machine-learning models through a common orchestration framework. AGAPI employs an Agent-Planner-Executor-Summarizer architecture that autonomously constructs and executes multi-step workflows spanning materials data retrieval, graph neural network property prediction, machine-learning force-field optimization, tight-binding calculations, diffraction analysis, and inverse design. We demonstrate AGAPI through end-to-end workflows, including heterostructure construction, powder X-ray diffraction analysis, and semiconductor defect engineering requiring up to ten sequential operations. In addition, we evaluate AGAPI using 30+ example prompts as test cases and compare agentic predictions with and without tool access against experimental data. With more than 1,000 active users, AGAPI provides a scalable and transparent foundation for reproducible, AI-accelerated materials discovery. AGAPI-Agents codebase is available at https://github.com/atomgptlab/agapi.
The move toward open Sixth-Generation (6G) networks necessitates a novel approach to full-stack simulation environments for evaluating complex technology developments before prototyping and real-world implementation. This paper introduces an innovative approach11 A lightweight, mock version of the code is available on GitHub at https://github.com/frezazadeh/LangChain-RAG-Technology that combines a multi-agent framework with the Network Simulator 3 (ns-3) to automate and optimize the generation, debugging, execution, and analysis of complex Fifth-Generation (5G) network scenarios. Our framework orchestrates a suite of specialized agents—namely, the Simulation Generation Agent, Test Designer Agent, Test Executor Agent, and Result Interpretation Agent—using advanced LangChain coordination. The Simulation Generation Agent employs a structured chain-of-thought (CoT) reasoning process, leveraging large language models (LLMs) and retrieval-augmented generation (RAG) to translate natural language simulation specifications into precise ns-3 scripts. Concurrently, the Test Designer Agent generates comprehensive automated test suites by integrating knowledge retrieval techniques with dynamic test case synthesis. The Test Executor Agent dynamically deploys and runs simulations, managing dependencies and parsing detailed performance metrics. At the same time, the Result Interpretation Agent utilizes LLM-driven analysis to extract actionable insights from the simulation outputs. By integrating external resources such as library documentation and ns-3 testing frameworks, our experimental approach can enhance simulation accuracy and adaptability, reducing reliance on extensive programming expertise. A detailed case study using the ns-3 5G-LENA module validates the effectiveness of the proposed approach. The code generation process converges in an average of 1.8 iterations, has a syntax error rate of 17.0%, a mean response time of 7.3 seconds, and receives a human evaluation score of 7.5.
No abstract available
This paper introduces SagaLLM, a structured multi-agent architecture designed to address four foundational limitations of current LLM-based planning systems: unreliable self-validation, context loss, lack of transactional safeguards, and insufficient inter-agent coordination. While recent frameworks leverage LLMs for task decomposition and multi-agent communication, they often fail to ensure consistency, rollback, or constraint satisfaction across distributed workflows. SagaLLM bridges this gap by integrating the Saga transactional pattern with persistent memory, automated compensation, and independent validation agents. It leverages LLMs' generative reasoning to automate key tasks traditionally requiring hand-coded coordination logic, including state tracking, dependency analysis, log schema generation, and recovery orchestration. Although SagaLLM relaxes strict ACID guarantees, it ensures workflow-wide consistency and recovery through modular checkpointing and compensable execution. Empirical evaluations across planning domains demonstrate that standalone LLMs frequently violate interdependent constraints or fail to recover from disruptions. In contrast, SagaLLM achieves significant improvements in consistency, validation accuracy, and adaptive coordination under uncertainty—establishing a robust foundation for real-world, scalable LLM-based multi-agent systems.
Heterogeneous multi-agent systems (HMAS) comprise various intelligent agents with specialized functions, such as drones, ground robots, and automated devices, working in coordinated settings. This paper presents AutoHMA-LLM, a novel framework that combines Large Language Models (LLMs) with classical control algorithms to address the challenges of task coordination and scheduling in complex, dynamic environments. The framework is designed with a multi-tier architecture, utilizing a cloud-based LLM as the central planner alongside device-specific LLMs and Generative Agents to improve task execution efficiency and accuracy. Specifically targeting dynamic scenarios, the system enhances resource utilization and stabilizes task execution through refined task scheduling and real-time feedback mechanisms. In experiments conducted across logistics, inspection, and search & rescue scenarios, AutoHMA-LLM demonstrated a 5.7% improvement in task completion accuracy, a 46% reduction in communication steps, and a 31% decrease in token usage and API calls compared to baseline methods. These results highlight our framework’s scalability and efficiency, offering substantial support for effective multi-agent collaboration in complex, resource-constrained environments.
As large language models (LLMs) advance, there is growing interest in using them to simulate human social behavior through generative agent-based modeling (GABM). However, validating these models remains a key challenge. We present a systematic two-stage validation approach using social dilemma paradigms from psychological literature, first identifying the cognitive components necessary for LLM agents to reproduce known human behaviors in mixed-motive settings from two landmark papers, then using the validated architecture to simulate novel conditions. Our model comparison of different cognitive architectures shows that both persona-based individual differences and theory of mind capabilities are essential for replicating third-party punishment (TPP) as a costly signal of trustworthiness. For the second study on public goods games, this architecture is able to replicate an increase in cooperation from the spread of reputational information through gossip. However, an additional strategic component is necessary to replicate the additional boost in cooperation rates in the condition that allows both ostracism and gossip. We then test novel predictions for each paper with our validated generative agents. We find that TPP rates significantly drop in settings where punishment is anonymous, yet a substantial amount of TPP persists, suggesting that both reputational and intrinsic moral motivations play a role in this behavior. For the second paper, we introduce a novel intervention and see that open discussion periods before rounds of the public goods game further increase contributions, allowing groups to develop social norms for cooperation. This work provides a framework for validating generative agent models while demonstrating their potential to generate novel and testable insights into human social behavior.
The collaboration among multiple agents demands for efficient communication. However, the observational data in the multi-agent systems are typically voluminous and redundant and pose substantial challenges to the communication system when transmitted directly. To address this issue, this paper introduces a multi-agent communication scheme based on large language model (LLM), referred to as GPT-based semantic information extraction for multi-agent communication (GMAC). This scheme utilizes an LLM to extract semantic information and leverages the generative capabilities to predict subsequent actions, thereby enabling agents to make more informed decisions. The GMAC approach significantly reduces signaling expenditure exchanged among agents by extracting key semantic data via LLM. This method not only simplifies the communication process but also effectively reduces the communication overhead by approximately 53% compared to the baseline methods. Experimental results indicate that GMAC not only improves the convergence speed and accuracy of decision-making but also substantially decreases the signaling expenditure among agents. Consequently, GMAC offers a straightforward and effective method to achieve efficient and economical communication in the multi-agent systems.
Introduction The surge in the capabilities of large language models (LLMs) has propelled the development of Artificial General Intelligence (AGI), highlighting generative agents as pivotal components for emulating complex AI behaviors. Given the high costs associated with individually training LLMs for each AI agent, there is a critical need for advanced memory retrieval mechanisms to maintain the unique characteristics and memories of individual AI agents. Methods In this research, we developed a text-based simulation of a generative agent world, constructing a community with multiple agents and locations in which certain levels of interaction were enabled. Within this framework, we introduced a novel memory retrieval system using an Auxiliary Cross Attention Network (ACAN). This system calculates and ranks attention weights between an agent's current state and stored memories, selecting the most relevant memories for any given situation. In a novel approach, we incorporated LLM assistance, comparing memories retrieved by our model with those extracted using a base method during training, and constructing a novel loss function based on these comparisons to optimize the training process effectively. To our knowledge, this is the first study to utilize LLMs to train a dedicated agent memory retrieval network. Results Our empirical evaluations demonstrate that this approach substantially enhances the quality of memory retrieval, thereby increasing the adaptability and behavioral consistency of agents in fluctuating environments. Discussion Our findings not only introduce new perspectives and methodologies for memory retrieval in generative agents but also extend the utility of LLMs in memory management across varied AI agent applications.
With the proliferation of Large Language Models (LLMs), the detection of misinformation has become increasingly important and complex. This research proposes an innovative verifiable misinformation detection LLM agent that goes beyond traditional true/false binary judgments. The agent actively verifies claims through dynamic interaction with diverse web sources, assesses information source credibility, synthesizes evidence, and provides a complete verifiable reasoning process. Our designed agent architecture includes three core tools: precise web search tool, source credibility assessment tool and numerical claim verification tool. These tools enable the agent to execute multi-step verification strategies, maintain evidence logs, and form comprehensive assessment conclusions. We evaluate using standard misinformation datasets such as FakeNewsNet, comparing with traditional machine learning models and LLMs. Evaluation metrics include standard classification metrics, quality assessment of reasoning processes, and robustness testing against rewritten content. Experimental results show that our agent outperforms baseline methods in misinformation detection accuracy, reasoning transparency, and resistance to information rewriting, providing a new paradigm for trustworthy AI-assisted fact-checking.
This position paper examines the use of Large Language Models (LLMs) in social simulation, analyzing their potential and limitations from a computational social science perspective. We first review recent findings on LLMs'ability to replicate key aspects of human cognition, including Theory of Mind reasoning and social inference, while identifying persistent limitations such as cognitive biases, lack of grounded understanding, and behavioral inconsistencies. We then survey emerging applications of LLMs in multi-agent simulation frameworks, examining system architectures, scalability, and validation strategies. Projects such as Generative Agents (Smallville) and AgentSociety are analyzed with respect to their empirical grounding and methodological design. Particular attention is given to the challenges of behavioral fidelity, calibration, and reproducibility in large-scale LLM-driven simulations. Finally, we distinguish between contexts where LLM-based agents provide operational value-such as interactive simulations and serious games-and contexts where their use raises epistemic concerns, particularly in explanatory or predictive modeling. We argue that hybrid approaches integrating LLMs into established agent-based modeling platforms such as GAMA and NetLogo may offer a promising compromise between expressive flexibility and analytical transparency. Building on this analysis, we outline a conceptual research direction termed Hybrid Constitutional Architectures, which proposes a stratified integration of classical agent-based models (ABMs), small language models (SLMs), and LLMs within established platforms such as GAMA and NetLogo.
Large Language Models (LLMs) can perform chart question answering tasks but often generate unverified hallucinated responses. Existing answer attribution methods struggle to ground responses in source charts due to limited visual-semantic context, complex visual-text alignment requirements, and difficulties in bounding box prediction across complex layouts. We present ChartCitor, a multi-agent framework that provides fine-grained bounding box citations by identifying supporting evidence within chart images. The system orchestrates LLM agents to perform chart-to-table extraction, answer reformulation, table augmentation, evidence retrieval through pre-filtering and re-ranking, and table-to-chart mapping. ChartCitor outperforms existing baselines across different chart types. Qualitative user studies show that ChartCitor helps increase user trust in Generative AI by providing enhanced explainability for LLM-assisted chart QA and enables professionals to be more productive.
With the evolution of generative AI, multi - agent systems leveraging large - language models(LLMs) have emerged as a powerful tool for complex tasks. However, these systems face challenges in quantifying agent performance and lack mechanisms to assess agent credibility. To address these issues, we introduce DRF, a dynamic reputation filtering framework. DRF constructs an interactive rating network to quantify agent performance, designs a reputation scoring mechanism to measure agent honesty and capability, and integrates an Upper Confidence Bound - based strategy to enhance agent selection efficiency. Experiments show that DRF significantly improves task completion quality and collaboration efficiency in logical reasoning and code - generation tasks, offering a new approach for multi - agent systems to handle large - scale tasks.
The rapid expansion of optical networks has catalyzed the growth of data capacity, creating a new era of high-speed network services. However, as the scale and complexity of nodes and connections increase, coupled with increasingly stringent demands for service efficiency, quality, and resistance to interference, intelligent solutions are needed to achieve efficient and autonomous network operation and maintenance. In this study, we proposed an advanced artificial intelligence (AI) Agent empowered by large language model (LLM), aimed at providing a practical solution for autonomous optical network management. We leverage the powerful language processing and reasoning capabilities of the Generative Pre-trained Transformer (GPT-4), and integrate domain-specific knowledge and optical network tools to simplify maintenance workflows, reduce manual intervention, and improve operational efficiency. Acting as an intelligent assistant for optical network operations, the AI Agent is capable of providing real-time insights and optimization recommendations. In particular, we focus on four typical tasks: quality of transmission (QoT) estimation, performance analysis, optimization, and parameter calibration for physical-layer modeling, which are essential for ensuring service reliability and resource efficiency. Through the design, implementation, and evaluation of these tasks, we demonstrate the feasibility and effectiveness of the proposed agent in addressing key challenges of optical network maintenance. Furthermore, we provide an assessment of accuracy and reliability based on a predetermined scoring standard. The proposed solution not only enhances automation in network monitoring and optimization, but also provides a scalable and generalizable framework for LLM-based support in evolving optical transport environments.
Large language model multi-agent systems (LLM-MAS) offer a promising paradigm for harnessing collective intelligence to achieve more advanced forms of AI behaviour. While recent studies suggest that LLM-MAS can outperform LLM single-agent systems (LLM-SAS) on certain tasks, the lack of systematic experimental designs limits the strength and generality of these conclusions. We argue that a principled understanding of task complexity, such as the degree of sequential reasoning required and the breadth of capabilities involved, is essential for assessing the effectiveness of LLM-MAS in task solving. To this end, we propose a theoretical framework characterising tasks along two dimensions: depth, representing reasoning length, and width, representing capability diversity. We theoretically examine a representative class of LLM-MAS, namely the multi-agent debate system, and empirically evaluate its performance in both discriminative and generative tasks with varying depth and width. Theoretical and empirical results show that the benefit of LLM-MAS over LLM-SAS increases with both task depth and width, and the effect is more pronounced with respect to depth. This clarifies when LLM-MAS are beneficial and provides a principled foundation for designing future LLM-MAS methods and benchmarks.
Large language models (LLMs) now mediate many web-based mental-health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent framework for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discriminatory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse generative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi-agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog-Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language rationales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.
The rapid progress in Generative AI and Agent technologies is profoundly transforming enterprise data management and analytics. Traditional database applications and system deployment are fundamentally impacted by AI-driven tools, such as Retrieval-Augmented Generation (RAG) and vector database technologies, which provide new pathways for semantic querying over enterprise knowledge bases. In the meantime, data security and compliance are top priorities for organizations adopting AI technologies. For enterprise data analysis, SQL generations powered by large language models (LLMs) and AI agents, has emerged as a key bridge connecting natural language with structured data, effectively lowering the barrier to enterprise data access and improving analytical efficiency. This paper focuses on enterprise data analysis applications and system deployment, covering a range of innovative frameworks, enabling complex query understanding, multi-agent collaboration, security verification, and computational efficiency. Through representative use cases, key challenges related to distributed deployment, data security, and inherent difficulties in SQL generation tasks are discussed.
Large Language Model-based Multi-Agent Systems (MASs) have emerged as a powerful paradigm for tackling complex tasks through collaborative intelligence. However, the topology of these systems--how agents in MASs should be configured, connected, and coordinated--remains largely unexplored. In this position paper, we call for a paradigm shift toward \emph{topology-aware MASs} that explicitly model and dynamically optimize the structure of inter-agent interactions. We identify three fundamental components--agents, communication links, and overall topology--that collectively determine the system's adaptability, efficiency, robustness, and fairness. To operationalize this vision, we introduce a systematic three-stage framework: 1) agent selection, 2) structure profiling, and 3) topology synthesis. This framework not only provides a principled foundation for designing MASs but also opens new research frontiers across language modeling, reinforcement learning, graph learning, and generative modeling to ultimately unleash their full potential in complex real-world applications. We conclude by outlining key challenges and opportunities in MASs evaluation. We hope our framework and perspectives offer critical new insights in the era of agentic AI.
We present an autonomous agent for cyber threat hunting that integrates structured knowledge from human-authored playbooks with the generative and reasoning capabilities of large language models (LLMs). Building on a prior LLM-based threat-hunting framework, this work introduces additional components for playbook ingestion, hunt script validation, co-occurrence modeling, prompt-based generation, and tuning-based adaptation. The agent generates, validates and refines hunt scripts in a sandboxed environment, identifies behaviorally consistent scripts using consensus voting, and constructs a threat co-occurrence matrix to inform hunt scheduling. Unlike the earlier system that relied solely on natural language prompting, our design enables the agent to learn from execution feedback and adapt its models over time. This work contributes a scalable, self-adaptive and resilient foundation for LLM-driven cyber threat hunting systems.
We explored the insufficient process-level support for designers employing text-to-image generative AI tools in visual design. Although these tools facilitate swift image creation via natural language prompts, they often reduce the design process to mere prompt editing, offering minimal assistance with task comprehension, ideation, or reflection. To bridge this gap, we propose a multi-agent workflow powered by a Large Language Model (LLM), aligned with the Double Diamond model, with agent roles corresponding to the Discover, Define, Develop, and Deliver phases. A functional prototype was developed on the n8n platform to enable structured and traceable image-generation processes. We conducted a small exploratory study with four design-trained participants and a blind review by three expert evaluators. Compared to baseline tools, the workflow received higher ratings in process controllability and outcome–intent alignment in this small sample (≈ +1.0 on a 5-point scale), though at the cost of higher perceived cognitive load and several additional minutes of task time per brief. Given the limited sample size, single-task scope, and reliance on subjective measures, these findings are preliminary. Nonetheless, our results provide initial evidence that structured agentic workflows can enhance co-creation with generative AI, laying the groundwork for future studies involving more complex tasks, toolchains, and participant groups.
Simulating consumer decision-making is vital for designing and evaluating marketing strategies before costly realworld deployment. However, post-event analyses and rule-based agent-based models (ABMs) struggle to capture the complexity of human behavior and social interaction. We introduce an LLM-powered multi-agent simulation framework that models consumer decisions and social dynamics. Building on recent advances in large language model simulation in a sandbox environment, our framework enables generative agents to interact, express internal reasoning, form habits, and make purchasing decisions without predefined rules. In a price-discount marketing scenario, the system delivers actionable strategy-testing outcomes and reveals emergent social patterns beyond the reach of conventional methods. This approach offers marketers a scalable, low-risk tool for pre-implementation testing, reducing reliance on time-intensive post-event evaluations and lowering the risk of underperforming campaigns.
This paper proposes an epistemological shift in the analysis of large generative models, replacing the category''Large Language Models''(LLM) with that of''Large Discourse Models''(LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the''fascination/fear''dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.
Modeling complex human behavior, such as voter decisions in national elections, is a long-standing challenge for computational social science. Traditional agent-based models (ABMs) are limited by oversimplified rules, while large-scale statistical models often lack interpretability. We introduce FlockVote, a novel framework that uses Large Language Models (LLMs) to build a"computational laboratory"of LLM agents for political simulation. Each agent is instantiated with a high-fidelity demographic profile and dynamic contextual information (e.g. candidate policies), enabling it to perform nuanced, generative reasoning to simulate a voting decision. We deploy this framework as a testbed on the 2024 U.S. Presidential Election, focusing on seven key swing states. Our simulation's macro-level results successfully replicate the real-world outcome, demonstrating the high fidelity of our"virtual society". The primary contribution is not only the prediction, but also the framework's utility as an interpretable research tool. FlockVote moves beyond black-box outputs, allowing researchers to probe agent-level rationale and analyze the stability and sensitivity of LLM-driven social simulations.
Advancements in artificial intelligence (AI) are having a significant impact across various domains. Among these, generative AI models – particularly large language models (LLMs) – have brought substantial changes to a wide range of activities and professions. As these models are capable of handling complex business workflows and interacting with external tools (e.g., via APIs), their agentic behaviour becomes especially valuable for supporting diverse workflows. Software development is one of the domains that can greatly benefit from the use of LLM-based agents. We have developed a software developer agent using the LangGraph platform, capable of generating and validating the software requirements specification (SRS), design document specification (DDS), project structure, function implementations, and corresponding unit tests – all based on a well-formulated input prompt. After successfully testing the agent on the relatively simple task of generating a Snake game, we applied it to a more complex problem: an optimization scenario. A simulation framework was developed to optimize port logistics processes, such as routing and scheduling trucks and vessels. This framework allows for port structure customization and, thanks to its modular design, supports the evaluation of various optimization algorithms. Using this framework, we investigated how our software developer agent performs on a significantly more complex coding task compared to the Snake game. We also compared the results with those generated by two widely used LLM-based systems: ChatGPT and Gemini Code Assist.
The cold-chain supply of perishable fruits continues to face challenges such as fuel wastage, fragmented stakeholder coordination, and limited real-time adaptability. Traditional solutions, based on static routing and centralized control, fall short in addressing the dynamic, distributed, and secure demands of modern food supply chains. This study presents a novel end-to-end architecture that integrates multi-agent reinforcement learning (MARL), blockchain technology, and generative artificial intelligence. The system features large language model (LLM)-mediated negotiation for inter-enterprise coordination, Pareto-based reward optimization balancing spoilage, energy consumption, delivery time, and climate and emission impact. Smart contracts and Non-Fungible Token (NFT)-based traceability are deployed over a private Ethereum blockchain to ensure compliance, trust, and decentralized governance. Modular agents—trained using centralized training with decentralized execution (CTDE)—handle routing, temperature regulation, spoilage prediction, inventory, and delivery scheduling. Generative AI simulates demand variability and disruption scenarios to strengthen resilient infrastructure. Experiments demonstrate up to 50% reduction in spoilage, 35% energy savings, and 25% lower emissions. The system also cuts travel time by 30% and improves delivery reliability and fruit quality. This work offers a scalable, intelligent, and sustainable supply chain framework, especially suitable for resource-constrained or intermittently connected environments, laying the foundation for future-ready food logistics systems.
The emergence of generative artificial intelligence (GenAI) marks a significant breakthrough in the realm of AI. Recently, GenAI and large language models (LLMs) have garnered tremendous attention due to their capability to automatically generate data based on the given original patterns and dataset. However, traditional GenAI-LLM mechanisms may result in low-quality output content and considerable creation time. The Internet of Agents (IoA) can address these challenges by providing a flexible and scalable platform for integrating diverse agents, enabling seamless communication and coordination in mobile environments. Therefore, in this work, we explore the integration of multi-agent GenAI-LLMs enlightened by IoA. Specifically, we first provide a brief introduction to multi-agent GenAI-LLMs and their applications in different domains. Then, we demonstrate the potential of deploying the multi-agent GenAI-LLMs in mobile edge networks. Subsequently, we discuss the emerging applications and challenges when deploying multi-agent GenAI-LLMs in mobile edge networks. In the following, we propose a novel multi-agent GenAI-LLM architecture for mobile edge networks. Moreover, we conduct a case study to show the effectiveness of the proposed architecture by applying it to generate high-quality solutions in uncrewed aerial vehicle (UAV) networks. Finally, several potential research directions for GenAI-LLMs in mobile edge networks are discussed.
The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.
The rise of generative AI and the emergence of Large Language Models (LLMs) have sparked a new line of research on the evolution of strategic information retrieval ecosystems. My research adopts an agent-based perspective to analyze sponsored search and competitive search ecosystems. The primary motivation of my work is to examine how agent strategies influence social welfare and the existence of equilibrium. To that end, I leverage frameworks from game theory and mechanism design.
The potential of Large Language Models (LLMs) continues to grow as new applications are developed to enhance systems' ability to understand and generate human language. Recent advances in LLMs and the introduction of the Model Context Protocol (MCP) have paved the way for exploring possibilities to strengthen existing systems with Generative AI (GenAI). This paper explores the enhancement of a multimodal tourism bot with agentic capabilities to improve the services provided to tourists. It addresses common challenges faced by tourists, such as language barriers, travel arrangements, accommodation, cost, time management, safety, and comfort. The paper presents an MCP-enabled, agent-based voice assistance architecture, focusing on its communication and data fetching from multiple resources to provide real-time services to tourists. The implementation used the DistilBERT model for local database search. A key innovation is the hybrid reasoning framework that prioritizes structured local search and selectively uses GPT for web search as a fallback for unmatched queries, ensuring both performance and reliability. The architecture supports voice and text input, operates across languages, and maintains session-level memory for multi-turn interactions. It also improves the ability to integrate real-time data with traditional LLM, making it better suited to dynamic needs.
Financial trading analytics increasingly demands modular, explainable, and adaptive intelligence systems capable of handling volatile market conditions and multimodal data streams. Recent advances in generative Artificial Intelligence (AI), Large Language Models (LLMs), and graph-based representations have enabled the creation of intelligent agents that can reason over complex, interconnected financial data. We introduce GWise, a graph-structured, generative AI-enabled multi-agent framework for real-time financial trading analytics delivered via secure web services. GWise models financial decision making as a directed computational graph of specialized analytical agents, including technical, fundamental, sentiment, and risk analysis crews, whose outputs are orchestrated through a memory-augmented LLM. This graph-structured design enables transparent, adaptive, and explainable trade recommendations that evolve over time. We demonstrate how the framework's agent orchestration forms a dynamic service graph, facilitating composability, fault isolation, and scalable deployment through cloud-native APIs. Extensive back-testing and simulated market conditions show that GWise outperforms traditional strategies in risk-adjusted returns while offering improved interpretability and service robustness. Our work illustrates how graph-based multiagent coordination and generative reasoning can advance real-time financial analytics as a service.
While the swift advancement of cloud-based Large Language Models (LLMs) has significantly increased the efficiency and automation in business processes, it has also introduced considerable privacy concerns regarding Personally Identifiable Information (PII) and other protected data in multimodal forms, such as text, video, or images, being exported, potentially insecurely, outside the corporate environments. Although traditional anonymization-based techniques can alleviate these risks in offline applications, such as summarization or classification, incorporating it into online LLM workflows poses substantial challenges, particularly when these workflows encompass real-time transactions involving multiple stakeholders, as commonly observed in multi-agent generative AI applications. This study explores these challenges and proposes novel context-aware privacy frameworks and methods to address these issues. We employ a local privacy-focused gatekeeper LLM to contextually pseudonymize PII and assign unique identifiers as part of a new mapping process, thereby facilitating re-identification in real-time operations while safeguarding privacy when interacting with cloud-based LLMs. Our proposed methodologies and frameworks adeptly integrate privacy considerations into LLM and LLM Agent workflows, preserving both privacy and data utility while maintaining operational efficiency and utility comparable to non-anonymized generative AI processes.
This paper presents a prototype AI-powered decision support system that leverages large language models (LLMs) and digital agents to assist furnace-stage operations in transformer manufacturing. The system integrates real furnace sensor data with synthetic manuals and expert insights using a modular architecture that combines Retrieval-Augmented Generation (RAG), structured prompting, multi-agent coordination, and response validation. It processes natural language queries to generate context-aware responses grounded in process data and documentation, demonstrating the potential of domain-adapted generative AI for industrial support. Experiments show a 100% improvement over a baseline LLM, though performance remains 29% below ChatGPT-4o, indicating both promise and areas for future improvement.
Conceptual design decisions critically influence product performance, cost, and sustainability, yet integrating rigorous feasibility evaluation early in this creative phase remains challenging. While generative AI accelerates concept generation, current methods often lack mechanisms to assess the feasibility of proposed designs. To bridge this gap, this paper presents DesignAgent, an LLM-based multi-agent system that assists early-stage product design and evaluation. The system features specialised agents that collaborate to interpret requirements, produce 3D prototypes, and automatically evaluate feasibility through integrated finite element analysis. This agent-driven framework facilitates an automated, iterative design loop where simulation feedback informs concept refinement. We evaluated the system through a case study involving 104 simulated design sessions for three types of UAV landing gear. A comprehensive assessment involving human expert review, AI (LLM) evaluation, and quantitative analysis demonstrates the high proficiency of the system. The system achieved over 90% accuracy in core tasks, effectively utilised FEA feedback with an 84.3% meaningful refinement rate, and also showed excellent adherence to engineering constraints and effective parameter refinement towards improved design quality. A user study further showed that DesignAgent achieved higher design accuracy and solution quality, lower workload, and better human–AI collaboration than the baseline systems.
The existing generative coding of distribution grids modeling and optimization face several issues, like complicated usage or high auto-codes error rates. This paper proposes the OptDisPro, a novel LLM-based multi-agent framework, enabling automatic optimal power flow (OPF) script modeling and solving. Driven by interactive linguistic instruction, it realizes automatic coding for customized requirements and flexibly adaptive heuristic optimization. Specifically, domain expertise and example scripts are encoded into structured prompt sequences to guide OptDisPro and enhance reasoning via Chain-of-Thought (COT). To mitigate LLM hallucinations, a contextual feedback mechanism is introduced, which collects error messages from the run-time environment for self-correction. Furthermore, Adaptive Selection of Multiple Algorithms (ASMA) is applied in the solving process, flexibly selecting heuristic algorithms to decrease the possibility of local optima. According to cases verification in multiple scenarios, the simulation results demonstrate the effectiveness and stability of OptDisPro in OPF problem of distribution network. The results also encourage further exploration of LLM applications in script online self-updating, autonomous OPF problem-solving and intelligent operation within distribution grid.
With the spread of generative AI in recent years, attacks known as Whaling have become a serious threat. Whaling is a form of social engineering that targets important high-authority individuals within organizations and uses sophisticated fraudulent emails. In the context of Japanese universities, faculty members frequently hold positions that combine research leadership with authority within institutional workflows. This structural characteristic leads to the wide public disclosure of high-value information such as publications, grants, and detailed researcher profiles. Such extensive information exposure enables the construction of highly precise target profiles using generative AI. This raises concerns that Whaling attacks based on high-precision profiling by generative AI will become prevalent. In this study, we propose a Whaling countermeasure framework for university faculty members that constructs personalized defense profiles and uses large language model (LLM)-based agents. We design agents that (i) build vulnerability profiles for each target from publicly available information on faculty members, (ii) identify potential risk scenarios relevant to Whaling defense based on those profiles, (iii) construct defense profiles corresponding to the vulnerabilities and anticipated risks, and (iv) analyze Whaling emails using the defense profiles. Furthermore, we conduct a preliminary risk-assessment experiment. The results indicate that the proposed method can produce judgments accompanied by explanations of response policies that are consistent with the work context of faculty members who are Whaling targets. The findings also highlight practical challenges and considerations for future operational deployment and systematic evaluation.
Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution. Our code is available at https://github.com/OpenBMB/ChatDev/tree/puppeteer.
Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy with dynamic entropy regulation, progressively teaching the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL outperforms both open-source and proprietary Med-LVLMs. Notably, it achieves an average performance gain of 23.6% over strong baselines.
In the field of MLLM-based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra- and inter-app workflows. To address these issues, we propose a hierarchical agent framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. We also introduce a new benchmark PC-Eval with 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task success rate over previous state-of-the-art methods. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/PC-Agent.
The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.
Recent progress in large language model (LLM)-based multi-agent collaboration highlights the power of structured communication in enabling collective intelligence. However, existing methods largely rely on static or graph-based inter-agent topologies, lacking the potential adaptability and flexibility in communication. In this work, we propose a new framework that rethinks multi-agent coordination through a sequential structure rather than a graph structure, offering a significantly larger topology space for multi-agent communication. Our method focuses on two key directions: (1) Next-Agent Prediction, which selects the most suitable agent role at each step, and (2) Next-Context Selection (NCS), which enables each agent to selectively access relevant information from any previous step. Together, these components construct task-adaptive communication pipelines that support both role flexibility and global information flow. Extensive evaluations across multiple benchmarks demonstrate that our approach achieves superior performance while substantially reducing communication overhead.
Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.
Multi-agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem-solving. However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents’ communication topologies particularly important. Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout , which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance. Compared to state-of-the-art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks. Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness. We release our code at https://github.
The dawn of embodied intelligence has ushered in an unprecedented imperative for resilient, cognition-enabled multi-agent collaboration across next-generation ecosystems, revolutionizing paradigms in autonomous manufacturing, adaptive service robotics, and cyber-physical production architectures. However, current robotic systems face significant limitations, such as limited cross-embodiment adaptability, inefficient task scheduling, and insufficient dynamic error correction. While End-to-end VLA models demonstrate inadequate long-horizon planning and task generalization, hierarchical VLA models suffer from a lack of cross-embodiment and multi-agent coordination capabilities. To address these challenges, we introduce RoboOS, the first open-source embodied system built on a Brain-Cerebellum hierarchical architecture, enabling a paradigm shift from single-agent to multi-agent intelligence. Specifically, RoboOS consists of three key components: (1) Embodied Brain Model (RoboBrain), a MLLM designed for global perception and high-level decision-making; (2) Cerebellum Skill Library, a modular, plug-and-play toolkit that facilitates seamless execution of multiple skills; and (3) Real-Time Shared Memory, a spatiotemporal synchronization mechanism for coordinating multi-agent states. By integrating hierarchical information flow, RoboOS bridges Embodied Brain and Cerebellum Skill Library, facilitating robust planning, scheduling, and error correction for long-horizon tasks, while ensuring efficient multi-agent collaboration through Real-Time Shared Memory. Furthermore, we enhance edge-cloud communication and cloud-based distributed inference to facilitate high-frequency interactions and enable scalable deployment. Extensive real-world experiments across various scenarios, demonstrate RoboOS's versatility in supporting heterogeneous embodiments. Project website: https://github.com/FlagOpen/RoboOS
Recently, with the development of tool-calling capabilities in large language models (LLMs), these models have demonstrated significant potential for automating electronic design automation (EDA) flows by interacting with EDA tool APIs via EDA scripts. However, considering the limited understanding of EDA tools, LLMs face challenges in practical scenarios where diverse interfaces of EDA tools exist across different platforms. Additionally, EDA flow automation often involves intricate, long-chain tool-calling processes, increasing the likelihood of errors in intermediate steps. Any errors will lead to the instability and failure of EDA flow automation. To address these challenges, we introduce EDAid, a multi-agent collaboration system where multiple agents harboring divergent thoughts converge towards a common goal, ensuring reliable and successful EDA flow automation. Specifically, each agent is controlled by ChipLlama models, which are expert LLMs fine-tuned for EDA flow automation. Our experiments demonstrate the state-of-the-art (SOTA) performance of our ChipLlama models and validate the effectiveness of our EDAid in the automation of complex EDA flows, showcasing superior performance compared to single-agent systems.
The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has opened new possibilities for automating intricate programming tasks for the sake of accurate code generation. Although contemporary foundational models demonstrate promoting results, researchers continue to explore optimal post-training strategies to enhance code quality. These include supervised fine-tuning, retrieval-augmented generation (RAG), debugging, and many others. In this paper, we combine two widely used approaches namely multi-agent collaboration and runtime execution information-based debugging, for improving code generation functionality, reliability, and practical applicability. We perform an empirical study in order to extend the evaluation of the individual strategies as well as the proposed composition of the activities of both strategies. Our study use 19 LLMs to examines the performance of individual and the proposed strategies, offering comprehensive insights into how different programming activities compositions and training paradigms influence code generation effectiveness. In particular, we implement a chained system that combines both strategies to assess their combined impact on functional accuracy, code reliability, and generation latency using two benchmark datasets commonly used for code generation. Our findings provide valuable insights for organizations seeking robust AI-driven coding solutions by guiding them in selecting models that can better adapt to complex post-training strategies, ultimately fostering the adoption of more effective and reliable code generation technologies.
Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language-specific instruction data from the code snippets and then provide the generated data as the seed data for language-specific agents. Multiple language-specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross-lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high-quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5-xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge, highlighting its potential to reduce the cross-lingual gap.
Multi-agent systems, where specialized agents collaborate to solve a shared task hold great potential, from increased modularity to simulating complex environments. However, they also have a major caveat -- a single agent can cause the entire system to fail. Consider a simple game where the knowledge to solve the task is distributed between agents, which share information in a communication channel. At each round, any of the agents can terminate the game and make the final prediction, even if they are uncertain about the outcome of their action. Detection of such rogue agents before they act may prevent the system's failure. In this work, we propose to monitor agents during action prediction and intervene when a future error is likely to occur. To test our approach, we introduce WhoDunitEnv, a multi-agent collaboration environment that allows modular control over task complexity and communication structure. Experiments on WhoDunitEnv, code generation tasks and the GovSim environment for resource sustainability show that our approach leads to substantial performance gains up to 17.4%, 2.5% and 20%, respectively. Thorough analysis shows that our monitors successfully identify critical points of agent confusion and our interventions effectively stop agent errors from propagating.
Communication has been widely employed to enhance multi-agent collaboration. Previous research has typically assumed delay-free communication, a strong assumption that is challenging to meet in practice. However, real-world agents suffer from channel delays, receiving messages sent at different time points, termed Asynchronous Communication, leading to cognitive biases and breakdowns in collaboration. This paper first defines two communication delay settings in MARL and emphasizes their harm to collaboration. To handle the above delays, this paper proposes a novel framework, Communication Delay-Tolerant Multi-Agent Collaboration (CoDe). At first, CoDe learns an intent representation as messages through future action inference, reflecting the stable future behavioral trends of the agents. Then, CoDe devises a dual alignment mechanism of intent and timeliness to strengthen the fusion process of asynchronous messages. In this way, agents can extract the long-term intent of others, even from delayed messages, and selectively utilize the most recent messages that are relevant to their intent. Experimental results demonstrate that CoDe outperforms baseline algorithms in three MARL benchmarks without delay and exhibits robustness under fixed and time-varying delays.
Multi-agent collaboration among models has shown promise in reasoning tasks but is underexplored in long-form generation tasks like summarization and question-answering. We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model-generated outputs to remove factual inconsistencies. We investigate how iterative collaboration among multiple instances and types of large language models (LLMs) enhances subtasks in the refinement process, such as error detection, critiquing unfaithful sentences, and making corrections based on critiques. We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing. Additionally, reframing critiquing and refinement as reranking rather than generation tasks improves multi-agent performance. We consolidate these insights into a final"recipe"called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance on three summarization datasets as well as on long-form question answering, demonstrating the effectiveness and generalizability of our recipe.
Vision-language models (VLMs) have demonstrated remarkable capabilities in robotic planning, particularly for long-horizon tasks that require a holistic understanding of the environment for task decomposition. Existing methods typically rely on prior environmental knowledge or carefully designed task-specific prompts, making them struggle with dynamic scene changes or unexpected task conditions, e.g., a robot attempting to put a carrot in the microwave but finds the door was closed. Such challenges underscore two critical issues: adaptability and efficiency. To address them, in this work, we propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution through continuous reflection and self-evolution. REMAC incorporates two key modules: a self-reflection module performing pre-condition and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning. It offers several appealing benefits: 1) Robots can initially explore and reason about the environment without complex prompt design. 2) Robots can keep reflecting on potential planning errors and adapting the plan based on task-specific insights. 3) After iterations, a robot can call another one to coordinate tasks in parallel, maximizing the task execution efficiency. To validate REMAC's effectiveness, we build a multi-agent environment for long-horizon robot manipulation and navigation based on RoboCasa, featuring 4 task categories with 27 task styles and 50+ different objects. Based on it, we further benchmark state-of-the-art reasoning models, including DeepSeek-R1, o3-mini, QwQ, and Grok3, demonstrating REMAC's superiority by boosting average success rates by 40% and execution efficiency by 52.7% over the single robot baseline.
Zero-shot composed image retrieval (ZS-CIR) is a challenging task that aims to retrieve images similar to a composed query of a reference image and a description, without relying on training on triplet datasets. Existing methods for this task often rely on predefined, fixed retrieval processes that combine the image and the modified text through hand-crafted templates, which suffer from two main issues: non-adaptive retrieval queries and user-unfriendly retrieval processes. To address these limitations, we propose a novel framework - Automatic Multi-Agent Collaboration for Zero-Shot Composed Image Retrieval (AutoCIR). AutoCIR consists of three training-free agents - a planner, a retriever, and a corrector - that work together to iteratively identify and rectify mismatches. The planner guides the retriever by generating a customized target caption for the composed query and further refines this caption to resolve any semantic discrepancies based on feedback. The corrector, equipped with a chain-of-thought reasoning mechanism, conducts an in-depth evaluation of the retrieved results and generates appropriate self-correction actions. Extensive experiments on three benchmarks demonstrate that AutoCIR consistently outperforms previous competitive methods for ZS-CIR.
Quality assurance of web applications is critical, as web applications play an essential role in people's daily lives. To reduce labor costs, automated web GUI testing (AWGT) is widely adopted, exploring web applications via GUI actions such as clicks and text inputs. However, these approaches face limitations in generating continuous and meaningful action sequences capable of covering complex functionalities. Recent work incorporates large language models (LLMs) for GUI testing. However, these approaches face various challenges, including low efficiency of LLMs, high complexity of rich web application contexts, and a low success rate of LLMs in executing GUI tasks. To address these challenges, in this paper, we propose Temac, an approach that enhances AWGT using LLM-based multi-agent collaboration to increase code coverage. Temac is motivated by our insight that LLMs can enhance AWGT in executing complex functionalities, while the information discovered during AWGT can, in turn, be provided as the domain knowledge to improve the LLM-based task execution. Specifically, given a web application, Temac initially runs an existing approach to broadly explore application states. When the testing coverage stagnates, Temac then employs LLM-based agents to summarize the collected information to form a knowledge base and to infer not-covered functionalities. Guided by this knowledge base, Temac finally uses specialized LLM-based agents to target and execute the not-covered functionalities, reaching deeper states beyond those explored by the existing approach. Our evaluation results show that Temac exceeds state-of-the-art approaches from 12.5% to 60.3% on average code coverage on six complex open-source web applications, while revealing 445 unique failures in the top 20 real-world web applications. These results strongly demonstrate the effectiveness and the general applicability of Temac.
Large Language Model-based multi-agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter-agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation. We model the task-solving workflow on a graph-structured multi-agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent's individual experience pool. During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi-agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively-achieving faster convergence and producing higher-quality solutions on current tasks.
Malicious agents pose significant threats to the reliability and decision-making capabilities of Multi-Agent Systems (MAS) powered by Large Language Models (LLMs). Existing defenses often fall short due to reactive designs or centralized architectures which may introduce single points of failure. To address these challenges, we propose SentinelNet, the first decentralized framework for proactively detecting and mitigating malicious behaviors in multi-agent collaboration. SentinelNet equips each agent with a credit-based detector trained via contrastive learning on augmented adversarial debate trajectories, enabling autonomous evaluation of message credibility and dynamic neighbor ranking via bottom-k elimination to suppress malicious communications. To overcome the scarcity of attack data, it generates adversarial trajectories simulating diverse threats, ensuring robust training. Experiments on MAS benchmarks show SentinelNet achieves near-perfect detection of malicious agents, close to 100% within two debate rounds, and recovers 95% of system accuracy from compromised baselines. By exhibiting strong generalizability across domains and attack patterns, SentinelNet establishes a novel paradigm for safeguarding collaborative MAS.
Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.
Audiobook interpretations are attracting increasing attention, as they provide accessible and in-depth analyses of books that offer readers practical insights and intellectual inspiration. However, their manual creation process remains time-consuming and resource-intensive. To address this challenge, we propose AI4Reading, a multi-agent collaboration system leveraging large language models (LLMs) and speech synthesis technology to generate podcast, like audiobook interpretations. The system is designed to meet three key objectives: accurate content preservation, enhanced comprehensibility, and a logical narrative structure. To achieve these goals, we develop a framework composed of 11 specialized agents,including topic analysts, case analysts, editors, a narrator, and proofreaders that work in concert to explore themes, extract real world cases, refine content organization, and synthesize natural spoken language. By comparing expert interpretations with our system's output, the results show that although AI4Reading still has a gap in speech generation quality, the generated interpretative scripts are simpler and more accurate.
This paper presents the MAPLE framework, which harnesses large language models (LLMs) to facilitate multi-agent collaboration for fully automated deployment and management of large-scale networks. Within MAPLE, a supervisor agent interprets natural language instructions from users, orchestrates specialized agents to execute tasks, and validates outcomes through integration with a network simulation platform. Experimental findings show that MAPLE outperforms single-agent approaches in terms of success rates for topology deployment and service configuration. Moreover, experiments reveal that by adaptively employing LLMs with varying capabilities according to task requirements and inter-agent dependencies, the framework effectively balances task success rates with cost efficiency.
Diet plays a central role in human health, and Nutrition Question Answering (QA) offers a promising path toward personalized dietary guidance and the prevention of diet-related chronic diseases. However, existing methods face two fundamental challenges: the limited reasoning capacity of single-agent systems and the complexity of designing effective multi-agent architectures, as well as contextual overload that hinders accurate decision-making. We introduce Nutritional-Graph Router (NG-Router), a novel framework that formulates nutritional QA as a supervised, knowledge-graph-guided multi-agent collaboration problem. NG-Router integrates agent nodes into heterogeneous knowledge graphs and employs a graph neural network to learn task-aware routing distributions over agents, leveraging soft supervision derived from empirical agent performance. To further address contextual overload, we propose a gradient-based subgraph retrieval mechanism that identifies salient evidence during training, thereby enhancing multi-hop and relational reasoning. Extensive experiments across multiple benchmarks and backbone models demonstrate that NG-Router consistently outperforms both single-agent and ensemble baselines, offering a principled approach to domain-aware multi-agent reasoning for complex nutritional health tasks.
Complex medical decision-making involves cooperative workflows operated by different clinicians. Designing AI multi-agent systems can expedite and augment human-level clinical decision-making. Existing multi-agent researches primarily focus on language-only tasks, yet their extension to multimodal scenarios remains challenging. A blind combination of diverse vision-language models (VLMs) can amplify an erroneous outcome interpretation. VLMs in general are less capable in instruction following and importantly self-reflection, compared to large language models (LLMs) of comparable sizes. This disparity largely constrains VLMs'ability in cooperative workflows. In this study, we propose MedOrch, a mediator-guided multi-agent collaboration framework for medical multimodal decision-making. MedOrch employs an LLM-based mediator agent that enables multiple VLM-based expert agents to exchange and reflect on their outputs towards collaboration. We utilize multiple open-source general-purpose and domain-specific VLMs instead of costly GPT-series models, revealing the strength of heterogeneous models. We show that the collaboration within distinct VLM-based agents can surpass the capabilities of any individual agent. We validate our approach on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training. Our findings underscore the value of mediator-guided multi-agent collaboration in advancing medical multimodal intelligence.
Aiming at the problems of fragmented knowledge and data privacy risks in traditional intelligent teaching systems, this research proposes a technical framework of “knowledge base driven + multi-agent collaboration”. Through vector databases and Retrieval-Augmented Generation (RAG) technology, semantic-level retrieval of subject knowledge is achieved. Combined with Federated Learning and blockchain technology, a cross-school collaboration mechanism is constructed. Experiments show that the “AI Learning Companion” system based on this framework can effectively improve the average scores of experimental classes and increase students' active learning time. The innovation lies in: CD constructing a dynamically updated knowledge base system that supports multi-modal resource integration; ® designing a multi-agent collaboration mechanism to achieve distributed processing of teaching tasks; ® proposing a blockchain-based cross-school knowledge base alliance scheme. The research results provide a transferable technical path for the large-scale application of intelligent education systems.
Requirements Engineering (RE) is an initial and critical phase in software development, with the aim of producing well-defined software requirements specifications (SRSs) from rough ideas of clients. It involves multiple tasks (e.g., elicitation, analysis) and roles (e.g., interviewer, analyst). With the rise of Large Language Models (LLMs), many studies have leveraged LLMs to support specific RE tasks. However, existing LLM-based agents often lack domain knowledge integration and fall short in simulating the complex collaboration of human experts across the full RE process. To address this gap, we propose KGMAF, a knowledge-guided multi-agent framework designed to assist requirements engineers in developing high-quality SRSs. KGMAF comprises six LLM-based agents and a shared artifact pool. Each agent is equipped with predefined actions, dedicated functions, and injected knowledge tailored to specific RE tasks. The artifact pool stores both intermediate and final artifacts, serving as a communication channel for inter-agent collaboration. A human-in-the-loop (HITL) mechanism is embedded to guide and validate agent outputs. We present the design of KGMAF, along with preliminary experiments and a case study to demonstrate its practicality. This work lays the foundation for future research on knowledge-driven multi-agent collaboration in RE and highlights key challenges in building trustworthy intelligent assistants for real-world RE tasks.
No abstract available
Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering the systematic reduction of uncertainty. Overcoming these limitations fundamentally requires a principled approach to exploration. We introduce PiFlow, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). Extensive evaluations across three distinct scientific domains demonstrate that PiFlow (I) improves discovery efficiency by 31.18%~41.73% and solution quality by 12.47%~31.72% against state-of-the-art methods, (II) delivers a 5.6x speedup in time-to-solution while reducing token consumption by up to 27% compared to vanilla agents, and (III) serves as a Plug-and-Play module that generalizes on existing agent architecture. Overall, PiFlow establishes a novel paradigm shift in highly efficient agentic scientific discovery, paving the way for more robust and accelerated AI-driven research.
Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.
Contemporary multi-agent systems encounter persistent challenges in cross-platform interoperability, dynamic task scheduling, and efficient resource sharing. Agents with heterogeneous implementations often lack standardized interfaces; collaboration frameworks remain brittle and hard to extend; scheduling policies are static; and inter-agent state synchronization is insufficient. We propose Hierarchical Agent Workflow (HAWK), a modular framework comprising five layers-User, Workflow, Operator, Agent, and Resource-and supported by sixteen standardized interfaces. HAWK delivers an end-to-end pipeline covering task parsing, workflow orchestration, intelligent scheduling, resource invocation, and data synchronization. At its core lies an adaptive scheduling and optimization module in the Workflow Layer, which harnesses real-time feedback and dynamic strategy adjustment to maximize utilization. The Resource Layer provides a unified abstraction over heterogeneous data sources, large models, physical devices, and third-party services&tools, simplifying cross-domain information retrieval. We demonstrate HAWK's scalability and effectiveness via CreAgentive, a multi-agent novel-generation prototype, which achieves marked gains in throughput, lowers invocation complexity, and improves system controllability. We also show how hybrid deployments of large language models integrate seamlessly within HAWK, highlighting its flexibility. Finally, we outline future research avenues-hallucination mitigation, real-time performance tuning, and enhanced cross-domain adaptability-and survey prospective applications in healthcare, government, finance, and education.
While Large Language Models (LLMs) have shown remarkable advancements in reasoning and tool use, they often fail to generate optimal, grounded solutions under complex constraints. Real-world travel planning exemplifies these challenges, evaluating agents'abilities to handle constraints that are explicit, implicit, and even evolving based on interactions with dynamic environments and user needs. In this paper, we present ATLAS, a general multi-agent framework designed to effectively handle such complex nature of constraints awareness in real-world travel planning tasks. ATLAS introduces a principled approach to address the fundamental challenges of constraint-aware planning through dedicated mechanisms for dynamic constraint management, iterative plan critique, and adaptive interleaved search. ATLAS demonstrates state-of-the-art performance on the TravelPlanner benchmark, improving the final pass rate from 23.3% to 44.4% over its best alternative. More importantly, our work is the first to demonstrate quantitative effectiveness on real-world travel planning tasks with live information search and multi-turn feedback. In this realistic setting, ATLAS showcases its superior overall planning performance, achieving an 84% final pass rate which significantly outperforms baselines including ReAct (59%) and a monolithic agent (27%).
Mobile Edge Computing (MEC) distributes resources such as computing, storage, and bandwidth to the side close to users, which can provide low-latency services to in-vehicle users, thus promising a more efficient and safer driving environment. However, due to the dynamic scale of vehicle and the variability of resource requirements, it is a significant challenge to quickly obtain effective task offloading in large-scale vehicle scenarios. The existing studies generally adopt the centralized decision-making method, with long decision-making time and high computational overhead, which cannot effectively achieve good offloading decisions in large-scale scenarios. To address these problems, we propose a Multi-agent Collaborative Method for vehicular task offloading using Federated Deep Reinforcement Learning called MCM-FDRL. First, each vehicle as an agent, independently makes offloading decisions based on local information. Next, the offloading decision model of each vehicle is obtained through federated reinforcement learning training. At runtime, an effective vehicle offloading plan can be gradually developed through multi-agent collaboration. Using two real-world datasets, experiments show that the MCM-FDRL has good adaptability and scalability. Moreover, compared to the state-of-the-art methods, the task's average response time of the MCM-FDRL is reduced by 9.75%-64.90%, respectively.
As large language models (LLMs) become integral to multi-agent systems, new privacy risks emerge that extend beyond memorization, direct inference, or single-turn evaluations. In particular, seemingly innocuous responses, when composed across interactions, can cumulatively enable adversaries to recover sensitive information, a phenomenon we term compositional privacy leakage. We present the first systematic study of such compositional privacy leaks and possible mitigation methods in multi-agent LLM systems. First, we develop a framework that models how auxiliary knowledge and agent interactions jointly amplify privacy risks, even when each response is benign in isolation. Next, to mitigate this, we propose and evaluate two defense strategies: (1) Theory-of-Mind defense (ToM), where defender agents infer a questioner's intent by anticipating how their outputs may be exploited by adversaries, and (2) Collaborative Consensus Defense (CoDef), where responder agents collaborate with peers who vote based on a shared aggregated state to restrict sensitive information spread. Crucially, we balance our evaluation across compositions that expose sensitive information and compositions that yield benign inferences. Our experiments quantify how these defense strategies differ in balancing the privacy-utility trade-off. We find that while chain-of-thought alone offers limited protection to leakage (~39% sensitive blocking rate), our ToM defense substantially improves sensitive query blocking (up to 97%) but can reduce benign task success. CoDef achieves the best balance, yielding the highest Balanced Outcome (79.8%), highlighting the benefit of combining explicit reasoning with defender collaboration. Together, our results expose a new class of risks in collaborative LLM deployments and provide actionable insights for designing safeguards against compositional, context-driven privacy leakage.
Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a novel multi-agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio-visual alignment over baseline methods. Project page: https://lvas-agent.github.io
Embodied agents based on large language models (LLMs) face significant challenges in collaborative tasks, requiring effective communication and reasonable division of labor to ensure efficient and correct task completion. Previous approaches with simple communication patterns carry erroneous or incoherent agent actions, which can lead to additional risks. To address these problems, we propose Cooperative Tree Search (CoTS), a framework designed to significantly improve collaborative planning and task execution efficiency among embodied agents. CoTS guides multi-agents to discuss long-term strategic plans within a modified Monte Carlo tree, searching along LLM-driven reward functions to provide a more thoughtful and promising approach to cooperation. Another key feature of our method is the introduction of a plan evaluation module, which not only prevents agent action confusion caused by frequent plan updates but also ensures plan updates when the current plan becomes unsuitable. Experimental results show that the proposed method performs excellently in planning, communication, and collaboration on embodied environments (CWAH and TDW-MAT), efficiently completing long-term, complex tasks and significantly outperforming existing methods.
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
Humans do not passively observe the visual world -- we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by first collecting teleoperated demonstrations paired with a 360 camera. This data is imported into a simulation environment that supports rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze on top of robot demonstrations. We then introduce a BC-RL loop to train the hand and eye jointly: the hand (BC) agent is trained from rendered eye observations, and the eye (RL) agent is rewarded when the hand produces correct action predictions. In this way, hand-eye coordination emerges as the eye looks towards regions which allow the hand to complete the task. EyeRobot implements a foveal-inspired policy architecture allowing high resolution with a small compute budget, which we find also leads to the emergence of more stable fixation as well as improved ability to track objects and ignore distractors. We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring manipulation in an arc surrounding the robot arm. Our experiments suggest EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate manipulation over large workspaces with a single camera. See project site for videos: https://www.eyerobot.net/
Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.
No abstract available
Existing methods for multi-agent navigation typically assume fully known environments, offering limited support for partially known scenarios such as warehouses or factory floors. There, agents may need to plan trajectories that balance their own path optimality with their ability to collect and share information about the environment that can help their teammates reach their own goals. To these ends, we propose ORION, a novel deep reinforcement learning framework for cooperative multi-agent online navigation in partially known environments. Starting from an imperfect prior map, ORION trains agents to make decentralized decisions, coordinate to reach their individual targets, and actively reduce map uncertainty by sharing online observations in a closed perception-action loop. We first design a shared graph encoder that fuses prior map with online perception into a unified representation, providing robust state embeddings under dynamic map discrepancies. At the core of ORION is an option-critic framework that learns to reason about a set of high-level cooperative modes that translate into sequences of low-level actions, allowing agents to switch between individual navigation and team-level exploration adaptively. We further introduce a dual-stage cooperation strategy that enables agents to assist teammates under map uncertainty, thereby reducing the overall makespan. Across extensive maze-like maps and large-scale warehouse environments, our simulation results show that ORION achieves high-quality, real-time decentralized cooperation over varying team sizes, outperforming state-of-the-art classical and learning-based baselines. Finally, we validate ORION on physical robot teams, demonstrating its robustness and practicality for real-world cooperative navigation.
Traditional Agent-based Models (ABMs) often struggle to capture the nuance of adaptive human decision-making during complex crises due to their reliance on static, predefined rules. Large Language Models (LLMs) offer a transformative solution by acting as cognitive engines that empower agents with human-like common-sense reasoning. In this paper, we introduce an LLM-driven Multi-Agent Simulation framework to investigate coupled epidemic–economic dynamics, incorporating a Perception-Deliberation-Action (PDA) loop. Agents, acting as heterogeneous cognitive entities, utilize Chain-of-Thought processes to autonomously balance health risks against economic necessities. This approach endogenously generates adaptive behaviors without explicit scripting. Extensive experiment results across diverse LLM backends confirm the framework’s robustness, revealing divergent socio-economic trajectories under distinct macroscopic conditions and effectively quantifying the trade-offs between public health and economic stability. This approach establishes a high-fidelity computational laboratory for investigating complex scenarios under distinct macroscopic conditions, effectively bridging the gap between micro-level cognition and macro-level societal outcomes.
Geometric Problem Solving (GPS) poses a unique challenge for Multimodal Large Language Models (MLLMs), requiring not only the joint interpretation of text and diagrams but also iterative visuospatial reasoning. While existing approaches process diagrams as static images, they lack the capacity for dynamic manipulation - a core aspect of human geometric reasoning involving auxiliary line construction and affine transformations. We present GeoSketch, a neural-symbolic framework that recasts geometric reasoning as an interactive perception-reasoning-action loop. GeoSketch integrates: (1) a Perception module that abstracts diagrams into structured logic forms, (2) a Symbolic Reasoning module that applies geometric theorems to decide the next deductive step, and (3) a Sketch Action module that executes operations such as drawing auxiliary lines or applying transformations, thereby updating the diagram in a closed loop. To train this agent, we develop a two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense, symbolic rewards to enhance robustness and strategic exploration. To evaluate this paradigm, we introduce the GeoSketch Benchmark, a high-quality set of 390 geometry problems requiring auxiliary construction or affine transformations. Experiments on strong MLLM baselines demonstrate that GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods. By unifying hierarchical decision-making, executable visual actions, and symbolic verification, GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction, establishing a new foundation for solving complex visuospatial problems.
Spatial reasoning in partially observable environments has often been approached through passive predictive models, yet theories of embodied cognition suggest that genuinely useful representations arise only when perception is tightly coupled to action. Here we ask whether a recurrent agent, trained solely by sparse rewards to solve procedurally generated planar mazes, can autonomously internalize metric concepts such as direction, distance and obstacle layout. After training, the agent consistently produces near-optimal paths in unseen mazes, behavior that hints at an underlying spatial model. To probe this possibility, we cast the closed agent-environment loop as a hybrid dynamical system, identify stable limit cycles in its state space, and characterize behavior with a Ridge Representation that embeds whole trajectories into a common metric space. Canonical correlation analysis exposes a robust linear alignment between neural and behavioral manifolds, while targeted perturbations of the most informative neural dimensions sharply degrade navigation performance. Taken together, these dynamical, representational, and causal signatures show that sustained sensorimotor interaction is sufficient for the spontaneous emergence of compact, embodied world models, providing a principled path toward interpretable and transferable navigation policies.
This study presents a conceptual framework and a prototype assessment for Large Language Model (LLM)-based Building Energy Management System (BEMS) AI agents to facilitate context-aware energy management in smart buildings through natural language interaction. The proposed framework comprises three modules: perception (sensing), central control (brain), and action (actuation and user interaction), forming a closed feedback loop that captures, analyzes, and interprets energy data to respond intelligently to user queries and manage connected appliances. By leveraging the autonomous data analytics capabilities of LLMs, the BEMS AI agent seeks to offer context-aware insights into energy consumption, cost prediction, and device scheduling, thereby addressing limitations in existing energy management systems. The prototype's performance was evaluated using 120 user queries across four distinct real-world residential energy datasets and different evaluation metrics, including latency, functionality, capability, accuracy, and cost-effectiveness. The generalizability of the framework was demonstrated using ANOVA tests. The results revealed promising performance, measured by response accuracy in device control (86%), memory-related tasks (97%), scheduling and automation (74%), and energy analysis (77%), while more complex cost estimation tasks highlighted areas for improvement with an accuracy of 49%. This benchmarking study moves toward formalizing the assessment of LLM-based BEMS AI agents and identifying future research directions, emphasizing the trade-off between response accuracy and computational efficiency.
India’s defense drone ecosystem remains largely dependent on manual control and pre-programmed routines, restricting adaptability in rapidly evolving or GPS-denied environments. In this paper, we propose Drishti-AI Drone-based Reinforcement System for Intelligent Human Tracking and Identification using AI , a fully autonomous UAV surveillance framework that tightly integrates Reinforcement Learning (RL)–based navigation with real-time Deep Learning (DL)–driven human detection. For autonomous mobility, we introduce AeroNav-PPO, a Proximal Policy Optimization (PPO) model trained in a 3D AirSim environment to enable intelligent navigation and dynamic obstacle avoidance. For perception, we present YOLOv8-HawkEye, a customized object detection model capable of identifying humans with real-time coordinate mapping and automated alert generation. Together, these modules form a closed perception-action loop that enables end-to-end autonomous operation. The AeroNav-PPO agent demonstrated an average episode reward of 347.0 with consistent patrol performance across complex terrains, while YOLOv8-HawkEye achieved a precision of 95.6%, recall of 87.8%, and mAP@0.5 of 93.8%. Our results validate Drishti-AI as a promising step toward fully autonomous, mission-ready surveillance drones for next-generation defense applications.
Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.
Artificial Intelligence is moving from models that only generate text to Agentic AI, where systems behave as autonomous entities that can perceive, reason, plan, and act. Large Language Models (LLMs) are no longer used only as passive knowledge engines but as cognitive controllers that combine memory, tool use, and feedback from their environment to pursue extended goals. This shift already supports the automation of complex workflows in software engineering, scientific discovery, and web navigation, yet the variety of emerging designs, from simple single loop agents to hierarchical multi agent systems, makes the landscape hard to navigate. In this paper, we investigate architectures and propose a unified taxonomy that breaks agents into Perception, Brain, Planning, Action, Tool Use, and Collaboration. We use this lens to describe the move from linear reasoning procedures to native inference time reasoning models, and the transition from fixed API calls to open standards like the Model Context Protocol (MCP) and Native Computer Use. We also group the environments in which these agents operate, including digital operating systems, embodied robotics, and other specialized domains, and we review current evaluation practices. Finally, we highlight open challenges, such as hallucination in action, infinite loops, and prompt injection, and outline future research directions toward more robust and reliable autonomous systems.
Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/
Graphical user interface (GUI) agents have shown promise in automating mobile tasks but still struggle with input redundancy and decision ambiguity. In this paper, we present \textbf{RecAgent}, an uncertainty-aware agent that addresses these issues through adaptive perception. We distinguish two types of uncertainty in GUI navigation: (1) perceptual uncertainty, caused by input redundancy and noise from comprehensive screen information, and (2) decision uncertainty, arising from ambiguous tasks and complex reasoning. To reduce perceptual uncertainty, RecAgent employs a component recommendation mechanism that identifies and focuses on the most relevant UI elements. For decision uncertainty, it uses an interactive module to request user feedback in ambiguous situations, enabling intent-aware decisions. These components are integrated into a unified framework that proactively reduces input complexity and reacts to high-uncertainty cases via human-in-the-loop refinement. Additionally, we propose a dataset called \textbf{ComplexAction} to evaluate the success rate of GUI agents in executing specified single-step actions within complex scenarios. Extensive experiments validate the effectiveness of our approach. The dataset and code will be available at https://github.com/Fanye12/RecAgent.
Existing imitation learning methods decouple perception and action, which overlooks the causal reciprocity between sensory representations and action execution that humans naturally leverage for adaptive behaviors. To bridge this gap, we introduce Action-Guided Diffusion Policy (DP-AG), a unified representation learning that explicitly models a dynamic interplay between perception and action through probabilistic latent dynamics. DP-AG encodes latent observations into a Gaussian posterior via variational inference and evolves them using an action-guided SDE, where the Vector-Jacobian Product (VJP) of the diffusion policy's noise predictions serves as a structured stochastic force driving latent updates. To promote bidirectional learning between perception and action, we introduce a cycle-consistent contrastive loss that organizes the gradient flow of the noise predictor into a coherent perception-action loop, enforcing mutually consistent transitions in both latent updates and action refinements. Theoretically, we derive a variational lower bound for the action-guided SDE, and prove that the contrastive objective enhances continuity in both latent and action trajectories. Empirically, DP-AG significantly outperforms state-of-the-art methods across simulation benchmarks and real-world UR5 manipulation tasks. As a result, our DP-AG offers a promising step toward bridging biological adaptability and artificial policy learning.
Multi-agent reinforcement learning provides promising prospect for task scheduling in cloud-edge computing environment in recent years. However, there remains a formidable challenge due to partial observation and the rigid coupling between action spaces and schedulable devices. These limit the ability of agent to perceive global communication patterns and adapt to dynamic environments, resulting in unsatisfactory scheduling decisions. To address these issues, this work proposes SPAD, a novel spatial perception and action decoupling empowered distributed multi-agent AI task scheduling framework. By constructing a global spatial feature distillation mechanism, SPAD can approximate the implicit heterogeneous connection patterns and communication dynamics between devices and tasks under constrained observability, enhancing its ability to make robust decisions in dynamic environments with limited observations. Additionally, SPAD employs a Lyapunov-based action decoupling module to alleviate scalability challenges from rigid action-device coupling, while a novel intrinsic penalty mechanism augments the agent’s advantage function with the instantaneous Lyapunov cost, thereby aligning the policy optimization process with the decoupling module’s underlying stability constraints. Through a comprehensive empirical evaluation spanning synthetic, bursty, and real-world trace-driven workloads, we show that SPAD consistently outperforms state-of-the-art benchmarks in reducing task completion latency and improving resource utilization, while maintaining remarkable resilience and scalability across diverse network topologies and under non-stationary load conditions.
Autonomous Unmanned Surface Vehicle (USV) operations in complex ocean engineering scenarios necessitate robust navigation, guidance, and control technologies. These systems require reliable sensor-based object detection and efficient, safe, and energy-aware path planning. To address these multifaceted challenges, this paper proposes a novel synergistic AI framework. The framework integrates (1) a novel adaptation of the Swin-Transformer to generate a dense, semantic risk map from raw visual data, enabling the system to interpret ambiguous marine conditions like sun glare and choppy water, enabling real-time environmental understanding crucial for guidance; (2) a Transformer-enhanced A-star (T-ASTAR) algorithm with spatio-temporal attentional guidance to generate globally near-optimal and energy-aware static paths; (3) a domain-adapted TD3 agent featuring a novel energy-aware reward function that optimizes for USV hydrodynamic constraints, making it suitable for long-endurance missions tailored for USVs to perform dynamic local path optimization and real-time obstacle avoidance, forming a key control element; and (4) CUDA acceleration to meet the computational demands of real-time ocean engineering applications. Simulations and real-world data verify the framework’s superiority over benchmarks like A* and RRT, achieving 30% shorter routes, 70% fewer turns, 64.7% fewer dynamic collisions, and a 215-fold speed improvement in map generation via CUDA acceleration. This research underscores the importance of integrating powerful AI components within a hierarchical synergy, encompassing AI-based perception, hierarchical decision planning for guidance, and multi-stage optimal search algorithms for control. The proposed solution significantly advances USV autonomy, addressing critical ocean engineering challenges such as navigation in dynamic environments, object avoidance, and energy-constrained operations for unmanned maritime systems.
Large Language Models (LLMs) exhibit remarkable capabilities in the hierarchical decomposition of complex tasks through semantic reasoning. However, their application in embodied systems faces challenges in ensuring reliable execution of subtask sequences and achieving one-shot success in long-term task completion. To address these limitations in dynamic environments, we propose Closed-Loop Embodied Agent (CLEA)—a novel architecture incorporating four specialized open-source LLMs with functional decoupling for closed-loop task management. The framework features two core innovations: (1) Interactive task planner that dynamically generates executable subtasks based on the environmental memory, and (2) Multimodal execution critic employing an evaluation framework to conduct a probabilistic assessment of action feasibility, triggering hierarchical re-planning mechanisms when environmental perturbations exceed preset thresholds. To validate CLEA’s effectiveness, we conduct experiments in a real environment with manipulable objects, using two heterogeneous robots for object search, manipulation, and search-manipulation integration tasks. Across 12 task trials, CLEA outperforms the baseline model, achieving a 67.3% improvement in success rate and a 52.8% increase in task completion rate. These results demonstrate that CLEA significantly enhances the robustness of task planning and execution in dynamic environments. Our code is available at https://sp4595.github.io/CLEA/.
With the continuous development of artificial intelligence technology and high-performance computing power, various large models have transitioned from the experimental testing phase to production use, offering more efficient solutions for traditional deep learning research. By annotating a massive amount of data, reinforcement learning algorithms are utilized to inject data features into the model. When the model's parameter scale reaches a certain level, the phenomenon of intelligent emergence occurs. This has been widely applied to a variety of tasks such as text-to-image generation, virtual scene recognition, robotic arm control, and manufacturing. However, reinforcement learning has the challenge of poor generalization ability in agent inference when the environment changes. During the agent's training process, the action rewards are generated based on an environment model. In practical applications, when the environmental factors continuously change, the reward function may also vary. It’s leading to the issues with the agent's action strategy. This paper proposes a reinforcement learning method that integrates large models into the agent's situational awareness loop. The processed information is used as the state for the agent's actions, guiding the agent to generate behaviors in its operational work. Finally, the miniworld experiments are used to verify the correctness of the proposed reinforcement learning agent training method. This work also provides a new approach for agent decision-making methods in practical applications when the agent faced with environmental changes.
Executing natural-language instructions in dynamic environments requires robust integration of perception, planning, and control. Conventional vision–language models (VLMs) provide open-vocabulary recognition but often leave a perception–action gap and fail to ensure safety under uncertainty. To address this challenge, we propose a Vision-Language Action Model (VLAM) that directly maps visual observations and instructions to actions through adaptive perception and uncertainty-aware control. The algorithm leverages contrastive language-image pretraining (CLIP)-based vision–language similarity to score candidate actions while incorporating rule-based safety priors as a fallback mechanism when confidence is low or collision risk is detected. This design narrows the perception–action gap, maintains semantic grounding, and guarantees stable behavior. We evaluate VLAM in a dynamic multi-agent environment with moving obstacles. Experimental results demonstrate that the proposed method achieves higher task success and reduced inter-agent conflicts compared to baseline strategies.
The realization of intelligent robots, operating autonomously and interacting with other intelligent agents, human or artificial, requires the integration of environment perception, reasoning, and action. Classic Artificial Intelligence techniques for this purpose, focusing on symbolic approaches, have long-ago hit the scalability wall on compute and memory costs. Advances in Large Language Models in the past decade (neural approaches) have resulted in unprecedented displays of capability, at the cost of control, explainability, and interpretability. Large Action Models aim at extending Large Language Models to encompass the full perception, reasoning, and action cycle; however, they typically require substantially more comprehensive training and suffer from the same deficiencies in reliability. Here, we show it is possible to build competent Large Action Models by composing off-the-shelf foundation models, and that their control, interpretability, and explainability can be effected by incorporating symbolic wrappers and associated verification on their outputs, achieving verifiable neuro-symbolic solutions for intelligent robots. Our experiments on a multi-modal robot demonstrate that Large Action Model intelligence does not require massive end-to-end training, but can be achieved by integrating efficient perception models with a logic-driven core. We find that driving action execution through the generation of Planning Domain Definition Language (PDDL) code enables a human-in-the-loop verification stage that effectively mitigates action hallucinations. These results can support practitioners in the design and development of robotic Large Action Models across novel industries, and shed light on the ongoing challenges that must be addressed to ensure safety in the field.
The ubiquitous computing resources in 6G networks provide ideal environments for the fusion of large language models (LLMs) and intelligent services through the agent framework. With auxiliary modules and planning cores, LLM-enabled agents can autonomously plan and take actions to deal with diverse environment semantics and user intentions. However, the limited resources of individual network devices significantly hinder the efficient operation of LLM-enabled agents with complex tool calls, highlighting the urgent need for efficient multi-level device collaborations. To this end, the framework and method of the LLM-enabled multi-agent system with dual-loop terminal-edge collaborations are proposed in 6G networks. Firstly, the outer loop consists of the iterative collaborations between the global agent and multiple sub-agents deployed on edge servers and terminals, where the planning capability is enhanced through task decomposition and parallel sub-task distribution. Secondly, the inner loop utilizes sub-agents with dedicated roles to circularly reason, execute, and replan the sub-task, and the parallel tool calling generation with offloading strategies is incorporated to improve efficiency. The improved task planning capability and task execution efficiency are validated through the conducted case study in 6G-supported urban safety governance. Finally, the open challenges and future directions are thoroughly analyzed in 6G networks, accelerating the advent of the 6G era.
Service robots operating in open environments face two core challenges in semantic task planning: adaptation to dynamic scenes and uncertainty in symbol-to-motion mapping. This paper proposes a multi-agent collaborative closed-loop planning-evaluation framework, an LLM-driven, closed-loop optimization pipeline for task planning and feasibility auditing aimed at robust planning in complex, dynamic settings. The framework adopts a layered architecture with three LLM-based agents: a planning agent, an evaluation (validation) agent, and an assessment agent. Using a unified geometric-semantic-constraint representation, the planner generates symbolic task sequences, maps them to a predefined skill library, and performs self-checks to revise plans. The evaluation agent audits plans against preset criteria, outputs feasibility judgments and corrective suggestions, and iterates with the planner until a feasible solution is obtained. A fine-tuned assessment agent produces quantitative scores by combining task completion, efficiency, and replanning penalties, forming a human preference-aligned evaluation system. Experiments on mobile manipulation in household scenarios show task success rates of 96.9% (simple short tasks), 87.4% (simple long tasks), and $72.4 \%$ (complex long tasks), substantially improving over classical symbolic planning (PDDL) and Chain-of-Thought-guided LLM baselines. Ablation studies validate the contribution of each module. The correlation between the assessment agent’s scores and human ratings reaches 0.87. This work provides an explainable, closed-loop optimization paradigm for perception-decision-execution in embodied AI, advancing practical autonomy in open-world robotic operation.
Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixel-rich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow-fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.
Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.
The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.
We present OpenDriveVLA, a Vision Language Action model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially grounded driving actions by leveraging multimodal inputs, including 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent environment ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question answering tasks. Qualitative analyses further illustrate its capability to follow high-level driving commands and generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.
The present paper sets forth a hybrid intelligence (HI) framework for collaborative assembly workbenches that integrates human and robotic agents through a shared Perception-Cognition-Action (PCA) cycle. The system wants to build a prototype for a "hybrid intelligence" (HI) assembly work-cell, which leverages projection-based instruction, large language model (LLM) assistance, real-time activity tracking using convolutional neural networks (CNNs), and robot-side perception and manipulation. The human and the robot function as cognitive agents, each with its own distinct PCA loops, which are interconnected. These loops facilitate situational awareness (SA), task execution and the representation of shared context. The present study reports on the current system capabilities, details the architectural design, and outlines future extensions towards dynamic task allocation and adaptive collaboration. The objective of this research is to establish a foundation for adaptive, ergonomic, and intelligent assembly systems suitable for industrial small-batch and high-mix low-volume production.
Existing artificial intelligence systems, such as powerful Large Language Models (LLMs), Vision-Language Models (VLMs), and specialized tools like the Segment Anything Model (SAM), have made remarkable progress in specific domains. [1] However, in high-stakes environments marked by volatility, uncertainty, complexity, and ambiguity (VUCA), these architectures are unsuitable for dynamic multimodal data integration and the generation of auditable, actionable results. Traditional generalist agents, such as DeepMind's Gato and Google's Gemini, have attempted to unify these capabilities, but they frequently lack the explicit safety mechanisms and fine-grained control required for critical applications. [6] This paper presents the Unified Multimodal Cognitive Architecture (UMCA), a novel framework that integrates perception, reasoning, and action into a single, end-to-end pipeline. The architecture is built around three key innovations: a Latent Concept Model (LCM) for deep cross-modal alignment, a dynamic Mixture-of-Experts (MoE) routing layer for adaptive, resource-aware computation, and a Language-Action Model (LAM) for creating structured, verifiable action graphs. We demonstrate UMCA's superior performance by conducting extensive benchmarks on a variety of crisis response tasks, such as multimodal question answering, image-grounded summarization, and resource routing. Our comparative and ablation studies formally validate the importance of each architectural component, demonstrating that UMCA outperforms cutting-edge baselines by a significant margin. The UMCA framework represents a viable path to developing robust, explainable, and ethically grounded AI systems for high-stakes societal applications, directly supporting the global collaboration principles outlined in Sustainable Development Goal 17 (SDG-17).
Behavior analysis across species represents a fundamental challenge in neuroscience, psychology, and ethology, typically requiring extensive expert knowledge and labor-intensive processes that limit research scalability and accessibility. We introduce BehaveAgent, an autonomous multimodal AI agent designed to automate behavior analysis from video input without retraining or manual intervention. Unlike conventional methods that require manual behavior annotation, video segmentation, task-specific model training, BehaveAgent leverages the reasoning capabilities of multimodal large language models (LLM) to generalize across novel behavioral domains without need for additional training. It integrates LLMs, vision-language models (VLMs), and large-scale visual grounding modules, orchestrated through a multimodal context memory and goal-directed attention mechanism, to enable robust zero-shot visual reasoning across species and experimental paradigms, including plants, insects, rodents, primates, and humans. Upon receiving a video input, BehaveAgent autonomously identifies the correct analysis strategy and performs end-to-end behavior analysis and interpretation without human supervision. Leveraging vision-language representations, it performs general-purpose tracking, pose estimation and segmentation. We demonstrate BehaveAgent’s universal applicability to autonomously (1) identify the behavioral paradigm and develop an action plan specialized for the identified paradigm, (2) identify relevant subjects and objects, (3) track those features, (4) identify behavioral sequences with explicit reasoning, (5) generate and execute code for targeted analysis and (6) generate comprehensive research reports that integrate behavioral findings with relevant scientific literature. Through interpretable agentic reasoning, BehaveAgent makes its internal decision-making process transparent, clarifying why particular features are tracked or behaviors inferred. By reducing the time and expertise required for behavior analysis, BehaveAgent introduces a scalable, generalizable, and explainable paradigm for advancing biological and behavioral research.
The rise of agentic artificial intelligence (Agentic AI) marks a transition from systems that optimize externally specified objectives to systems capable of representing, evaluating, and revising their own goals. Whereas earlier AI architectures executed fixed task specifications, agentic systems maintain recursive loops of perception, evaluation, goal-updating, and action, allowing them to sustain and adapt purposive activity across temporal and organizational scales. This paper argues that Agentic AI is not an incremental extension of large language models (LLMs) or autonomous agents in the sense we know it from classical AI and multi-agent systems, but a reconstitution of agency itself within computational substrates. Building on the logic of coordination, delegation, and self-regulation developed in early agent-based process management systems, we propose a general theory of synthetic purposiveness, where agency emerges as a distributed and self-maintaining property of artificial systems operating in open-ended environments. We develop the concept of synthetic teleology—the engineered capacity of artificial systems to generate and regulate goals through ongoing self-evaluation—and we formalize its dynamics through a recursive goal-maintenance equation. We further outline design patterns, computational semantics, and measurable indicators of purposiveness (e.g., teleological coherence, adaptive recovery, and reflective efficiency), providing a foundation for the systematic design and empirical investigation of agentic behaviour. By reclaiming agency as a first-class construct in artificial intelligence, we argue for a paradigm shift from algorithmic optimization toward goal-directed reasoning and purposive orchestration—one with far-reaching epistemic, societal, and institutional consequences.
Modern enterprise data platforms increasingly operate under conditions of extreme scale, heterogeneity, and uncertainty. Traditional data pipeline orchestration frameworks rely on static Directed Acyclic Graphs (DAGs) and deterministic retry semantics, which are fundamentally misaligned with environments characterized by schema volatility, infrastructure churn, and non-stationary workloads. This paper presents a comprehensive architectural model for Multi-Agent Orchestrated Data Pipelines (MODP), where autonomous agents replace task-centric orchestration with goal-driven reasoning.The architecture integrates four primary subsystems: an Agent Orchestrator, a Knowledge Plane grounded in Retrieval-Augmented Generation (RAG), a Unified Feature Store, and a Causal Tracing Engine. Together, these components enable self-healing execution, dynamic schema adaptation, and causal observability across the data lifecycle. Empirical evidence from large-scale distributed systems research demonstrates that agent-based orchestration improves fault tolerance, reduces mean time to recovery (MTTR), and significantly enhances developer productivity. This work formalizes agentic data engineering as a shift from procedural execution to intent-based systems, positioning autonomous multi-agent orchestration as a foundational design principle for next-generation data platforms.
Achieving fully autonomous exploration and navigation remains a critical challenge in robotics, requiring integrated solutions for localisation, mapping, decision-making and motion planning. Existing approaches either rely on strict navigation rules lacking adaptability or on pre-training, which requires large datasets. These AI methods are often computationally intensive or based on static assumptions, limiting their adaptability in dynamic or unknown environments. This paper introduces a bio-inspired agent based on the Active Inference Framework (AIF), which unifies mapping, localisation, and adaptive decision-making for autonomous navigation, including exploration and goal-reaching. Our model creates and updates a topological map of the environment in real-time, planning goal-directed trajectories to explore or reach objectives without requiring pre-training. Key contributions include a probabilistic reasoning framework for interpretable navigation, robust adaptability to dynamic changes, and a modular ROS2 architecture compatible with existing navigation systems. Our method was tested in simulated and real-world environments. The agent successfully explores large-scale simulated environments and adapts to dynamic obstacles and drift, proving to be comparable to other exploration strategies such as Gbplanner, FAEL and Frontiers. This approach offers a scalable and transparent approach for navigating complex, unstructured environments.
In this paper, we introduce SYMPLEX (Symbolic Policy Learning from Experts/Exploration), an interactive framework that learns complex hierarchies of behavioral norms as interpretable logical constraints through a combination of autonomous exploration and expert imitation. The approach ensures that learned constraints are interpretable for human oversight, generalizable for transfer to similar environments, and defeasible - enabling adaptation to novel behaviors and facilitating the learning of exceptions in dynamic domains. We demonstrate the utility of our approach in a traffic simulation environment using a neuro-symbolic implementation of SYMPLEX that interleaves a Deep Q-Learning (DQL) component for policy optimization through goal-directed domain exploration, with interactive Inductive Logic Programming (ILP) for example-based symbolic constraint generation. At each iteration, inferred constraints are imposed on the DQL via penalty terms appended to the reward function, allowing the system to form exceptions to previously-learned constraints. We illustrate SYMPLEX's ability to identify concise human-readable constraints in complex environments, and evidence the efficacy of learning norms as defeasible constraints. Additionally, we exemplify the benefits of using an interactive rule induction system in expediting convergence to accurate norms.
Autonomous Valet Parking (AVP) requires planning under partial observability, where parking spot availability evolves as dynamic agents enter and exit spots. Existing approaches either rely only on instantaneous spot availability or make static assumptions, thereby limiting foresight and adaptability. We propose an approach that estimates probability of future spot occupancy by distinguishing initially vacant and occupied spots while leveraging nearby dynamic agent motion. We propose a probabilistic estimator that integrates partial, noisy observations from a limited Field-of-View, with the evolving uncertainty of unobserved spots. Coupled with the estimator, we design a strategy planner that balances goal-directed parking maneuvers with exploratory navigation based on information gain, and incorporates wait-and-go behaviors at promising spots. Through randomized simulations emulating large parking lots, we demonstrate that our framework significantly improves parking efficiency and trajectory smoothness over existing approaches, while maintaining safety margins.
This paper extends Rodney Brooks’ subsumption architecture into the era of Agentic AI by replacing its priority arbiter with a Generative Orchestrator that performs semantic mediation—interpreting heterogeneous agent outputs and integrating them into a coherent action rather than merely arbitrating among them. Brooks’ original model (1986) demonstrated that autonomous behavior can emerge from parallel reactive layers without symbolic representation, establishing principles later recognized as foundational to agentic systems: environmental responsiveness, autonomy, and goal-directed action. Contemporary Agentic AI, however, requires capabilities beyond mechanical response—decision-making, adaptive strategy, and goal pursuit. We therefore reinterpret subsumption layers as four interacting agent types: reflex, model-based, goal-based, and utility-based, coordinated through semantic mediation. The Generative Orchestrator employs large language models not for content generation but for decision synthesis, enabling integrative agentic behavior. This approach merges real-time responsiveness with interpretive capacity for learning, reasoning, and explanation. An autonomous driving case study demonstrates how the architecture sustains behavioral autonomy while generating human-interpretable rationales for its actions. Validation was conducted through a Python-based proof-of-concept on an NVIDIA platform, reproducing the scenario to evaluate and confirm the architecture. This framework delineates a practical pathway toward advancing autonomous agents from reactive control to fully Agentic AI systems capable of operating in open, uncertain environments.
Autonomous navigation in unfamiliar environments requires robots to simultaneously explore, localise, and plan under uncertainty, without relying on predefined maps or extensive training. We present a biologically inspired, Active Inference-based framework, Active Inference MAPping and Planning (AIMAPP). This model unifies mapping, localisation, and decision-making within a single generative model. Inspired by hippocampal navigation, it uses topological reasoning, place-cell encoding, and episodic memory to guide behaviour. The agent builds and updates a sparse topological map online, learns state transitions dynamically, and plans actions by minimising Expected Free Energy. This allows it to balance goal-directed and exploratory behaviours. We implemented a ROS-compatible navigation system that is sensor and robot-agnostic, capable of integrating with diverse hardware configurations. It operates in a fully self-supervised manner, is resilient to drift, and supports both exploration and goal-directed navigation without any pre-training. We demonstrate robust performance in large-scale real and simulated environments against state-of-the-art planning models, highlighting the system's adaptability to ambiguous observations, environmental changes, and sensor noise. The model offers a biologically inspired, modular solution to scalable, self-supervised navigation in unstructured settings. AIMAPP is available at https://github.com/decide-ugent/AIMAPP.
Large Language Model (LLM) agents represent a promising shift in human-AI interaction, moving beyond passive prompt-response systems to autonomous agents capable of reasoning, planning, and goal-directed action. While LLM agents are technically capable of performing a broad range of tasks, not all of these capabilities translate into meaningful usability. This position paper argues that the central question for LLM agent usability is no longer whether a task can be automated, but whether it delivers sufficient Agentic Return on Investment (Agentic ROI). Agentic ROI reframes evaluation from raw performance to a holistic, utility-driven perspective, guiding when, where, and for whom LLM agents should be deployed. Despite widespread application in high-ROI tasks like coding and scientific research, we identify a critical usability gap in mass-market, everyday applications. To address this, we propose a zigzag developmental trajectory: first scaling up to improve information gain and time savings, then scaling down to reduce cost. We present a strategic roadmap across these phases to make LLM agents truly usable, accessible, and scalable in real-world applications.
The unique integration of Large Language Models (LLMs) as the reasoning center inside Agentic Artificial Intelligence (AAI) systems exposes new, systematic hazards that current research is totally ill-fitted to handle. The lack of a consistent, thorough framework that simultaneously addresses the basic problems of LLM unreliability as they spread and grow in autonomous, goal-directed, multi-agent systems reveals a major gap in the literature. These issues reach important, under-investigated fields like avoiding goal misalignment, lowering the danger of opaque decision-making, and guaranteeing strong long-term safety in complicated systems. Clearly establishing moral and legal responsibility and existing means for minimal human supervision are clearly lacking, therefore creating a hazardous hole as these automated systems approach actual deployment. This article provides an innovative, unified framework for responsible development, thus directly tackling this critical need. Specifically intended for LLM-powered agentic systems operating in challenging, high-stakes contexts, the author has presented the Trust, Risk, and Safety Management (TRiSM) governance framework. The core innovation of the framework is the Goal-Constraint Alignment (GCA) mechanism, which dynamically monitors and constrains LLM behavior inside set ethical and safety envelopes, hence acting as a dynamic barrier against both planned and unexpected goal misalignment. Furthermore, we install a Decentralized Oversight Ledger (DOL) to improve transparency and allow realistic accountability. The DOL offers real-time, tamper-proof, auditable tracking of all multi-agent interactions and decisions, therefore enhancing human oversight and establishing a clear chain of custody for agent behavior, which is vital to determine legal responsibility. Studies show a major improvement in systematic safety and a significant decrease in catastrophic failures compared to present baseline systems by verifying the effectiveness of the framework against a fresh collection of high-stakes, multi-agent coordination scenarios. This study offers the essential technical and governance structure needed for the responsible and safe distribution of next-generation autonomous artificial intelligence.
The advancement of smart factories and multi-stage logistics has accelerated the need for efficient and adaptive coordination of autonomous mobile robots (AMRs). While most existing research has focused on homogeneous automated guided vehicles (AGVs) systems, the control of heterogeneous AMRs remains underexplored. This paper proposes a multi-agent reinforcement learning (MARL)-based control framework tailored for heterogeneous AMRs in smart factory environments. The framework incorporates AMR-specific characteristics such as field of view (FOV) into the observation space and models decision-making using a multi-discrete action structure with action masking to ensure feasibility. The reward function promotes efficient task execution by encouraging successful pickup, delivery, and goal-directed movement. Simulation results demonstrate that the proposed approach achieves stable learning, improves delivery completion rates, and reduces task execution time, validating its effectiveness in heterogeneous and complex factory settings.
Multi-agent Epistemic Planning (MEP) is an autonomous planning framework for reasoning about both the physical world and the beliefs of agents, with applications in domains where information flow and awareness among agents are critical. The richness of MEP requires states to be represented as Kripke structures, i.e., directed labeled graphs. This representation limits the applicability of existing heuristics, hindering the scalability of epistemic solvers, which must explore an exponential search space without guidance, resulting often in intractability. To address this, we exploit Graph Neural Networks (GNNs) to learn patterns and relational structures within epistemic states, to guide the planning process. GNNs, which naturally capture the graph-like nature of Kripke models, allow us to derive meaningful estimates of state quality -- e.g., the distance from the nearest goal -- by generalizing knowledge obtained from previously solved planning instances. We integrate these predictive heuristics into an epistemic planning pipeline and evaluate them against standard baselines, showing improvements in the scalability of multi-agent epistemic planning.
The rapid rise of wind energy introduces persistent challenges related to variability, control, and operational reliability. While traditional machine learning supports forecasting and optimization, its scope is narrow and typically operator-dependent. We propose an agentic AI paradigm in which autonomous, goal-directed agents interact with turbines, the wind farm, and the grid to manage complexity in real time. A multi-agent architecture coordinates turbine-, farm-, and grid-level decisions to enhance efficiency and resilience. We present three concise case studies illustrating optimization at complementary scales: Case 1 (turbine-level) an AI agent adjusts upstream derating setpoints to mitigate wake effects and increase overall farm power production, at the cost of reduced power from the derated turbines; Case 2 (farm-level) a coordinating agent aligns multiple turbines to meet plant-wide energy and reliability objectives under operational and environmental constraints; Case 3 (grid-level) a system agent balances wind with other generators and consumer demand while honoring transmission limits and minimizing cost. Together, these cases show how local actions and global coordination increase energy yield, reduce structural loading, and improve reliability. Applications span adaptive forecasting, wake management, predictive maintenance, market participation, and cyber-physical security. We outline research needs in data quality, interoperability, safety, and regulation, and advocate hybrid designs that fuse reinforcement learning with digital twins to advance intelligent wind infrastructures.
No abstract available
The development of autonomous agents has seen a revival of enthusiasm due to the emergence of LLMs, such as GPT-4o. Deploying these agents in environments where they coexist with humans (e.g., as domestic assistants) requires special attention to trustworthiness and explainability. However, the use of LLMs and other deep learning models still does not resolve these key issues. Deep learning systems may hallucinate, be unable to justify their decisions as black boxes, or perform badly on unseen scenarios. In this work, we propose the use of s(CASP), a goal-directed common sense reasoner based on Answer Set Programming, to break down the high-level tasks of an autonomous agent into mid-level instructions while justifying the selection of these instructions. To validate its use in real applications we present a framework that integrates the reasoner into the VirtualHome simulator and compares its accuracy with GPT-4o, running some of the"real"use cases available in the domestic environments of VirtualHome. Additionally, since experiments with VirtualHome have shown the need to reduce the response time (which increases as the agent's decision space grows), we have proposed and evaluated a series of optimizations based on program analysis that exploit the advantages of the top-down execution of s(CASP).
This work presents a novel computer architecture that extends the Von Neumann model with a dedicated Reasoning Unit (RU) to enable native artificial general intelligence capabilities. The RU functions as a specialized co-processor that executes symbolic inference, multi-agent coordination, and hybrid symbolic-neural computation as fundamental architectural primitives. This hardware-embedded approach allows autonomous agents to perform goal-directed planning, dynamic knowledge manipulation, and introspective reasoning directly within the computational substrate at system scale. The architecture incorporates a reasoning-specific instruction set architecture, parallel symbolic processing pipelines, agent-aware kernel abstractions, and a unified memory hierarchy that seamlessly integrates cognitive and numerical workloads. Through systematic co-design across hardware, operating system, and agent runtime layers, this architecture establishes a computational foundation where reasoning, learning, and adaptation emerge as intrinsic execution properties rather than software abstractions, potentially enabling the development of general-purpose intelligent machines.
Autonomous agents often require multiple strategies to solve complex tasks, but determining when to switch between strategies remains challenging. This research introduces a reinforcement learning technique to learn switching thresholds between two orthogonal navigation policies. Using maze navigation as a case study, this work demonstrates how an agent can dynamically transition between systematic exploration (coverage) and goal-directed pathfinding (convergence) to improve task performance. Unlike fixed-threshold approaches, the agent uses Q-learning to adapt switching behavior based on coverage percentage and distance to goal, requiring only minimal domain knowledge: maze dimensions and target location. The agent does not require prior knowledge of wall positions, optimal threshold values, or hand-crafted heuristics; instead, it discovers effective switching strategies dynamically during each run. The agent discretizes its state space into coverage and distance buckets, then adapts which coverage threshold (20-60\%) to apply based on observed progress signals. Experiments across 240 test configurations (4 maze sizes from 16$\times$16 to 128$\times$128 $\times$ 10 unique mazes $\times$ 6 agent variants) demonstrate that adaptive threshold learning outperforms both single-strategy agents and fixed 40\% threshold baselines. Results show 23-55\% improvements in completion time, 83\% reduction in runtime variance, and 71\% improvement in worst-case scenarios. The learned switching behavior generalizes within each size class to unseen wall configurations. Performance gains scale with problem complexity: 23\% improvement for 16$\times$16 mazes, 34\% for 32$\times$32, and 55\% for 64$\times$64, demonstrating that as the space of possible maze structures grows, the value of adaptive policy selection over fixed heuristics increases proportionally.
The landscape of artificial intelligence agent architecture is a revolutionary paradigm shift from static classic inference systems to dynamic, goal-directed architectures that support autonomous reasoning and sophisticated task fulfillment. Modern agent architectures require high-level decision-making processes involving core model selection, system topology, and robust governance mechanisms. The transition from single-agent deployments to multi-agent systems brings in critical coordination challenges while facilitating unprecedented scalability and task specialization. Memory architecture design becomes an imperative element with the need to integrate short-term working memory, long-term episodic storage, and hybrid vector-graph database implementations to host a wide range of cognitive functions. Vector databases enable semantic memory operations with retrieval accuracies in excess of ninety percent, while structured representations allow for procedural knowledge management and temporal sequence processing. Production deployment requires pervasive observability frameworks with decision trace logging, behavior monitoring, and ongoing performance assessment in multiple operational planes. Governance controls provide ethical alignment through content filtering, action approval workflows, and runtime policy enforcement with sustained optimal system performance. The architectural choices have a direct effect on reliability, scalability, and operational efficiency, with properly designed systems exhibiting enormous task fulfillment rates, response correctness, and operational cost savings over conventional monolithic approaches.
Task planning for autonomous agents has typically been done using deep learning models and simulation-based reinforcement learning. This research proposes combining inductive learning techniques with goal-directed answer set programming to increase the explainability and reliability of systems for task breakdown and completion. Preliminary research has led to the creation of a Python harness that utilizes s(CASP) to solve task problems in a computationally efficient way. Although this research is in the early stages, we are exploring solutions to complex problems in simulated task completion.
Agentic AI integrates multi-agent system technologies with conventional distributed systems to create agents capable of autonomous, goal-driven decisions. Deploying multiple agents in restricted, complex, distributed environments requires collaboration and coordination to achieve individual and collective objectives. The widespread adoption of distributed applications-including the Internet, mobile networks, peer-to-peer and volunteer computing, and sensor networks-has increased interest in empowering devices to operate with greater autonomy. A distributed system comprises multiple hosts that cooperate to appear as a single system, offering extensive resources, high availability, and scalability. Multi-agent systems represent another form of distributed systems comprising autonomous, interactive agents that collaborate or compete to solve complex problems exceeding individual capabilities. Agentic AI augments multi-agent systems with reasoning abilities that enable the selection and pursuit of high-level human-defined targets comprehensible by the agents. While some agents receive training for machine learning in multi-agent reinforcement learning settings, such training yields only low-level skills•The problem (distributed systems need goal-oriented autonomy).•The approach (hierarchical MARL with CTDE).•The contribution (architecture + simulation + conflict-resolution mechanism).
GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: (1) Autonomous Exploration of Function-aware Trajectory. To comprehensively cover all application functionalities, we design a Function-aware Task Goal Generator that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. (2) Unsupervised Mining of Transition-aware Knowledge. To establish precise screen-operation logic, we develop a Transition-aware Knowledge Extractor that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. GUI-explorer is open-sourced and publicly available at https://github.com/JiuTian-VL/GUI-explorer.
Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.
Object search in large-scale, unstructured environments remains a fundamental challenge in robotics, particularly in dynamic or expansive settings such as outdoor autonomous exploration. This task requires robust spatial reasoning and the ability to leverage prior experiences. While Large Language Models (LLMs) offer strong semantic capabilities, their application in embodied contexts is limited by a grounding gap in spatial reasoning and insufficient mechanisms for memory integration and decision consistency.To address these challenges, we propose GET (Goal-directed Exploration and Targeting), a framework that enhances object search by combining LLM-based reasoning with experience-guided exploration. At its core is DoUT (Diagram of Unified Thought), a reasoning module that facilitates real-time decision-making through a role-based feedback loop, integrating task-specific criteria and external memory. For repeated tasks, GET maintains a probabilistic task map based on a Gaussian Mixture Model, allowing for continual updates to object-location priors as environments evolve.Experiments conducted in real-world, large-scale environments demonstrate that GET improves search efficiency and robustness across multiple LLMs and task settings, significantly outperforming heuristic and LLM-only baselines. These results suggest that structured LLM integration provides a scalable and generalizable approach to embodied decision-making in complex environments.
Real-time autonomous systems utilize multi-layer computational frameworks to perform critical tasks such as perception, goal finding, and path planning. Traditional methods implement perception using occupancy grid mapping (OGM), segmenting the environment into discretized cells with probabilistic information. This classical approach is well-established and provides a structured input for downstream processes like goal finding and path planning algorithms. Recent approaches leverage a biologically inspired mathematical framework known as vector symbolic architectures (VSA), commonly known as hyperdimensional computing, to perform probabilistic OGM in hyperdimensional space. This approach, VSA-OGM, provides native compatibility with spiking neural networks, positioning VSA-OGM as a potential neuromorphic alternative to conventional OGM. However, for large-scale integration, it is essential to assess the performance implications of VSA-OGM on downstream tasks compared to established OGM methods. This study examines the efficacy of VSA-OGM against a traditional OGM approach, Bayesian Hilbert Maps (BHM), within reinforcement learning based goal finding and path planning frameworks, across a controlled exploration environment and an autonomous driving scenario inspired by the F1-Tenth challenge. Our results demonstrate that VSA-OGM maintains comparable learning performance across single and multi-scenario training configurations while improving performance on unseen environments by approximately 47%. These findings highlight the increased generalizability of policy networks trained with VSA-OGM over BHM, reinforcing its potential for real-world deployment in diverse environments.
How to enable human-like long-term memory in large language models (LLMs) has been a central question for unlocking more general capabilities such as few-shot generalization. Existing memory frameworks and benchmarks focus on finding the optimal memory compression algorithm for higher performance in tasks that require recollection and sometimes further reasoning. However, such efforts have ended up building more human bias into the compression algorithm, through the search for the best prompts and memory architectures that suit specific benchmarks, rather than finding a general solution that would work on other data distributions. On the other hand, goal-directed search on uncompressed information could potentially exhibit superior performance because compression is lossy, and a predefined compression algorithm will not fit all raw data distributions. Here we present SUMER (Search in Uncompressed Memory via Experience Replay), an end-to-end reinforcement learning agent with verifiable reward (RLVR) that learns to use search tools to gather information and answer a target question. On the LoCoMo dataset for long-context conversation understanding, SUMER with Qwen2.5-7B-Instruct learned to use search tools and outperformed all other biased memory compression approaches and also the full-context baseline, reaching SOTA performance (43% gain over the prior best). We demonstrate that a simple search method applied to raw data outperforms goal-agnostic and biased compression algorithms in current long-context memory tasks, arguing for new paradigms and benchmarks that are more dynamic and autonomously scalable. Code for SUMER and all implemented baselines is publicly available at https://github.com/zycyc/SUMER.
Infants often exhibit goal-directed behaviors, such as reaching for a sensory stimulus, even when no external reward criterion is provided. These intrinsically motivated behaviors facilitate spontaneous exploration and learning of the body and environment during early developmental stages. Although computational modeling can offer insight into the mechanisms underlying such behaviors, many existing studies on intrinsic motivation focus primarily on how exploration contributes to acquiring external rewards. In this paper, we propose a novel density model for an agent's own multimodal sensory experiences, called the"self-prior,"and investigate whether it can autonomously induce goal-directed behavior. Integrated within an active inference framework based on the free energy principle, the self-prior generates behavioral references purely from an intrinsic process that minimizes mismatches between average past sensory experiences and current observations. This mechanism is also analogous to the acquisition and utilization of a body schema through continuous interaction with the environment. We examine this approach in a simulated environment and confirm that the agent spontaneously reaches toward a tactile stimulus. Our study implements intrinsically motivated behavior shaped by the agent's own sensory experiences, demonstrating the spontaneous emergence of intentional behavior during early development.
This study proposes a quantum reinforcement learning (QRL) approach for robotic applications, which incorporates a Grover-based autonomous quantum agent (GAQA) and a quantum environment represented as a quantum TicTacToe (QTTT) game. The QTTT environment is a quantum circuit of qubits in their superposition states, manipulated by the agent through quantum gates to establish a goal state. By utilizing amplitude estimation and Grover search techniques, the proposed reinforcement learning-based autonomous quantum agent (ReLAQA) enhances the probability amplitudes of the actions taken, which results in reducing the number of observed states required to reach a solution. Empirical results substantiate the quantum advantages of the proposed GAQA in reinforcement learning (RL) tasks by observing fewer states of 6300, outperforming classical agents. Therefore, signifying its potential to enhance complex problem-solving in robotics.
There has been unprecedented interest in developing agents that expand the boundary of scientific discovery, primarily by optimizing quantitative objective functions specified by scientists. However, for grand challenges in science , these objectives are only imperfect proxies. We argue that automating objective function design is a central, yet unmet requirement for scientific discovery agents. In this work, we introduce the Scientific Autonomous Goal-evolving Agent (SAGA) to amend this challenge. SAGA employs a bi-level architecture in which an outer loop of LLM agents analyzes optimization outcomes, proposes new objectives, and converts them into computable scoring functions, while an inner loop performs solution optimization under the current objectives. This bi-level design enables systematic exploration of the space of objectives and their trade-offs, rather than treating them as fixed inputs. We demonstrate the framework through a broad spectrum of applications, including antibiotic design, inorganic materials design, functional DNA sequence design, and chemical process design, showing that automating objective formulation can substantially improve the effectiveness of scientific discovery agents.
Achieving rigorous latency SLAs in dynamic telecommunications environments necessitates continuous optimization of network topology and capacity. This paper presents NetAgent-SLA, a goal-driven autonomous agent framework designed to monitor real-time topological and QoS metrics, implement reinforcement learning policies, and initiate SDN and cloud-API reconfigurations to maintain SLA performance. NetAgent-SLA was deployed on a multi-cloud testbed comprising on-premises SDN, Azure, and AWS platforms, demonstrating the following results: Maintained 99.5% SLA compliance across request rate fluctuations ranging from 500 to 2,000 per second Achieved a 28% reduction in 95th percentile latency compared to static orchestration approaches Adapted to new optimal configurations within 30 seconds following topology changes Incurred less than 1.2 ms decision-loop overhead per cycle This work provides a comprehensive explanation of system architecture, agent design, training methodology, performance evaluation, and addresses integration challenges as well as future directions for autonomous network operations.
This paper presents a dynamic memory-based intelligent framework, Emotional Detours, for autonomous agent navigation that formalizes avoidance and recovery for an autonomous agent as a valence-coded state model: negative-valence (NV) events represent undesirable encounters, and positive-valence (PV) events represent recovery locations. The memory function maintains a track of all NV visits by an agent, and an avoidance mechanism ensures that the agent does not revisit an NV region even if that NV region happens to be on the shortest path to the goal. The framework was evaluated by developing 72 simulated environments, each environment being a 25×50 base grid with 23% stochastic edge pruning and seeding of ≈3.3% NV and ≈1.3% PV cells. The results of comparing the proposed framework with the baseline approach for all these 72 environments highlighted that the proposed framework significantly reduced the exposure to NV regions via the novel no-revisit, dynamic memory mechanism, and robust policies. For instance, in one environment, the baseline approach caused the agent to take a 648-step route with 15 entries into undesirable regions. In comparison, the proposed framework generated a 93step route from the source to the goal with only 3 entries, an 80 percent reduction. Across all these runs, the baseline produced an average path length of 182.73 steps with 5.13 entries into NV regions, while the proposed framework produced an average path length of 83.27 steps with 2.77 entries into NV regions. Finally, survival curves were used to measure recovery time in steps-to-recovery from NV→PV and compare the proposed framework with the baseline. The median number of steps to recovery for the proposed framework was observed to be 7 steps across all simulations, whereas the baseline approach used 11 steps. The NV →PV mechanism of this framework can be viewed as a parallel to human behavior, where periods of difficulty prompt an active search for comfort or closure, and the attainment of that relief enables steady continuation of progress. This behavior-oriented perspective introduces a new dimension to autonomous navigation by aligning the agent’s decision-making with the way humans manage adversity, while remaining grounded in the formal structure of valence coding, recovery rules, and robust policies.
In the context of autonomous navigation, the development of systems that enable vehicles to operate independently in controlled environments is a crucial step toward advancing autonomous technology. This work presents the design, implementation, and validation of a navigation system for autonomous vehicles using NeuroEvolution of Augmenting Topologies (NEAT). The primary objective was to create a vehicle capable of navigating a 2D map with a defined starting point and target. Virtual sensors enable the vehicle to identify navigable paths and boundaries. Distance metrics such as Euclidean, Manhattan, and Chebyshev were employed as reward systems, continuously calculating agent positions. The closer the vehicle is to the target, the higher its fitness score, forming the basis of the fitness function. A forced reinforcement acceleration method was designed and implemented to ensure progress when the vehicle's speed fell below 0.1, preventing it from becoming stalled. Validation tests were conducted to evaluate the system's performance under varying conditions. Results demonstrate that the autonomous vehicle can navigate the map effectively, improving its fitness score in each generation depending on the distance metric used. Chebyshev performed best in obstacle-free environments, while Euclidean excelled in the presence of obstacles. The forced reinforcement method significantly reduced the time required to achieve the target fitness. These findings provide valuable insights for researchers aiming to develop NEAT-based navigation systems for autonomous vehicles.
In this paper we study how transforming regular reinforcement learning environments into goal-conditioned environments can let agents learn to solve tasks autonomously and reward-free. We show that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning. Our method is independent of the underlying off-policy learning algorithm. Since our method is environment-agnostic, the agent does not value any goals higher than others, leading to instability in performance for individual goals. However, in our experiments, we show that the average goal success rate improves and stabilizes. An agent trained with this method can be instructed to seek any observations made in the environment, enabling generic training of agents prior to specific use cases.
: This study aims to develop a quantum computing-based neurocognitive architecture that allows an agent to perform autonomous behaviors. Therefore, we present a brain-inspired cognitive architecture for autonomous agents that integrates a prefrontal cortex–inspired model with modern deep learning (a transformer-based reinforcement learning module) and quantum algorithms. In particular, our framework incorporates quantum computational routines (Deutsch–Jozsa, Bernstein–Vazirani, and Grover’s search) to enhance decision-making efficiency. As a novelty of this research, this comprehensive computational structure is empowered by quantum computing operations so that superiority in speed and robustness of learning compared to classical methods can be demonstrated. Another main contribution is that the proposed architecture offers some features, such as meta-cognition and situation awareness. The meta-cognition aspect is responsible for hierarchically learning sub-tasks, enabling the agent to achieve the master goal. The situation-awareness property identifies how spatial-temporal reasoning activities related to the world model of the agent can be extracted in a dynamic simulation environment with unstructured uncertainties by quantum computation-based machine learning algorithms with the explainable artificial intelligence paradigm. In this research, the Minecraft game-based simulation environment is utilized for the experimental evaluation of performance and verification tests within complex, multi-objective tasks related to the autonomous behaviors of a smart agent. By implementing several interaction scenarios, the results of the system performance and comparative superiority over alternative solutions are presented, and it is discussed how these autonomous behaviors and cognitive skills of a smart agent can be improved in further studies. Results show that the quantum-enhanced agent achieves 2 × faster convergence to an 80% task success rate in exploration tasks and approximately 15% higher cumulative rewards compared to a classical deep RL baseline. These findings demonstrate the potential of quantum algorithms to significantly improve learning and performance in cognitive agent architectures. However, advantages are task-specific and less pronounced under high-uncertainty, reactive scenarios. Limitations of the simulation environment are acknowledged, and a structured future research roadmap is proposed involving high-fidelity simulation validation, hardware-in-the-loop robotic testing, and integration of advanced hybrid quantum-classical architectures.
Safe navigation is essential for autonomous systems operating in hazardous environments. Traditional planning methods are effective for solving long-horizon tasks but depend on the availability of a graph representation with prede-fined distance metrics. In contrast, safe Reinforcement Learning (RL) is capable of learning complex behaviors without relying on manual heuristics but fails to solve long-horizon tasks, particularly in goal-conditioned and multi-agent scenarios. In this paper, we introduce a novel method that integrates the strengths of both planning and safe RL. Our method leverages goal-conditioned RL (GCRL) and safe RL to learn a goal-conditioned policy for navigation while concurrently estimating cumulative distance and safety levels using learned value functions via an automated self-training algorithm. By constructing a graph with states from the replay buffer, our method prunes unsafe edges and generates a waypoint-based plan that the agent then executes by following those waypoints sequentially until their goal locations are reached. This graph pruning and planning approach via the learned value functions allows our approach to flexibly balance the trade-off between faster and safer routes especially over extended horizons. Utilizing this unified high-level graph and a shared low-level safe GCRL policy, we extend this approach to address the multi-agent safe navigation problem. In particular, we leverage Conflict-Based Search (CBS) to create waypoint-based plans for multiple agents allowing for their safer navigation over extended horizons. This integration enhances the scalability of goal-conditioned safe RL in multi-agent scenarios, enabling efficient coordination among agents. Extensive benchmarking against state-of-the-art baselines demonstrates the effectiveness of our method in achieving distance goals safely for multiple agents in complex and hazardous environments. Our code and further details about or work is available at https://safe-visual-mapf-mers.csail.mit.edu/.
For groups of autonomous agents to achieve a particular goal, they must engage in coordination and long-horizon reasoning. However, designing reward functions to elicit such behavior is challenging. In this paper, we study how self-supervised goal-reaching techniques can be leveraged to enable agents to cooperate. The key idea is that, rather than have agents maximize some scalar reward, agents aim to maximize the likelihood of visiting a certain goal. This problem setting enables human users to specify tasks via a single goal state rather than implementing a complex reward function. While the feedback signal is quite sparse, we will demonstrate that self-supervised goal-reaching techniques enable agents to learn from such feedback. On MARL benchmarks, our proposed method outperforms alternative approaches that have access to the same sparse reward signal as our method. While our method has no explicit mechanism for exploration, we observe that self-supervised multi-agent goal-reaching leads to emergent cooperation and exploration in settings where alternative approaches never witness a single successful trial.
No abstract available
本报告全面梳理了 Agent 从单一模型向具备高度自主性、协作能力和环境适应性的“自主实体”演化的全貌。研究体系涵盖了从底层的认知架构革新(如世界模型、MCP协议)到多智能体协作协议的优化;从软件工程、网络运维等数字世界的自动化,到具身智能、自动驾驶等物理世界的深度集成。同时,随着 Agent 进入企业级应用,安全治理、可信度评估及经济回报(ROI)分析成为了当前研究的核心前沿,共同勾勒出 Agent 作为未来数字化与物理世界交互核心引擎的发展趋势。