glm 5.1技术报告

软件工程中的多智能体协作与交互式编程

这组文献关注如何利用多智能体框架、人机协同（Human-in-the-loop）以及交互式反馈来解决复杂的软件开发和代码生成任务。它们强调了从单一代码生成到全生命周期（规划、编码、调试）的转变。

Human-In-The-Loop Software Development Agents（Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, C. Tantithamthavorn, Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, Ming Wu, 2024, 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)）
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging（Md. Ashraful Islam, Mohammed Eunus Ali, Md. Rizwan Parvez, 2025, North American Chapter of the Association for Computational Linguistics）
VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool（Chia-Tung Ho, Haoxing Ren, Brucek Khailany, 2024, AAAI Conference on Artificial Intelligence）
A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement（Huan Zhang, Wei Cheng, Yuhan Wu, Wei Hu, 2024, Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering）
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback（John Yang, Akshara Prabhakar, Karthik Narasimhan, Shunyu Yao, 2023, Neural Information Processing Systems）

代码库级别与复杂依赖的神经符号规划

该组文献探讨了在大规模代码仓库或具有复杂依赖关系的环境中，如何通过神经符号结合、增量依赖分析或多步编辑链来处理超出单一Prompt长度的复杂编程任务。

CodePlan: Repository-Level Coding using LLMs and Planning（Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, C. VageeshD., Arun Shankar Iyer, Suresh Parthasarathy, S. Rajamani, B. Ashok, Shashank Shet, 2023, Proceedings of the ACM on Software Engineering）

复杂异步规划与认知推理机制

这部分文献侧重于提升LLM在非线性、异步或分层环境下的规划能力。研究涵盖了从图增强的提示技术到模仿大脑学习机制的统一预测编码模型，旨在解决长程规划中的复杂性退化问题。

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning（Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, J. Pierrehumbert, 2024, International Conference on Machine Learning）
Active Predictive Coding: A Unified Neural Framework for Learning Hierarchical World Models for Perception and Planning（Rajesh P. N. Rao, Dimitrios C. Gklezakos, V. Sathish, 2022, Neural Computation）

具身智能与空间地理规划应用

这组文献将LLM的规划能力扩展到了物理世界和特定领域空间任务，如机器人控制（通过RL桥接抽象语言与底层动作）和地理空间路径规划，验证了模型在现实世界约束下的执行效能。

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks（Murtaza Dalal, Tarun Chiruvolu, Devendra Singh Chaplot, Ruslan Salakhutdinov, 2024, International Conference on Learning Representations）
Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study（Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Q. Luo, Sensen Wu, Xinyue Ye, Hailin Feng, Zhenhong Du, 2025, International Journal of Digital Earth）

glm 5.1技术报告

本组文献共同探讨了大型语言模型（LLM）在复杂规划与自主代理（Agent）领域的前沿进展。研究方向涵盖了从高度专业化的软件工程自动化（多智能体协作、交互式编程、库级别代码处理）到通用的认知推理模型（异步规划、预测编码），并进一步延伸至具身智能和地理空间导航等实际应用场景。这些研究共同指向了一个核心趋势：通过引入结构化规划、反馈闭环以及多智能体协同，克服LLM在处理长程、高复杂性任务时的局限性。

共 10 篇文献，4 个研究方向

软件工程中的多智能体协作与交互式编程

这组文献关注如何利用多智能体框架、人机协同（Human-in-the-loop）以及交互式反馈来解决复杂的软件开发和代码生成任务。它们强调了从单一代码生成到全生命周期（规划、编码、调试）的转变。相关文献: Wannita Takerngsaksiri et. al, 2024 等 5 篇文献

代码库级别与复杂依赖的神经符号规划

该组文献探讨了在大规模代码仓库或具有复杂依赖关系的环境中，如何通过神经符号结合、增量依赖分析或多步编辑链来处理超出单一Prompt长度的复杂编程任务。相关文献: Ramakrishna Bairi et. al, 2023

复杂异步规划与认知推理机制

这部分文献侧重于提升LLM在非线性、异步或分层环境下的规划能力。研究涵盖了从图增强的提示技术到模仿大脑学习机制的统一预测编码模型，旨在解决长程规划中的复杂性退化问题。相关文献: Fangru Lin et. al, 2024 等 2 篇文献

具身智能与空间地理规划应用

这组文献将LLM的规划能力扩展到了物理世界和特定领域空间任务，如机器人控制（通过RL桥接抽象语言与底层动作）和地理空间路径规划，验证了模型在现实世界约束下的执行效能。相关文献: Murtaza Dalal et. al, 2024 等 2 篇文献

总计10篇相关文献

VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool

VerilogCoder：基于图规划和基于抽象语法树（AST）波形追踪工具的自主Verilog编码代理

Chia-Tung Ho, Haoxing Ren, Brucek Khailany, 2024-AAAI Conference on Artificial Intelligence

Due to the growing complexity of modern Integrated Circuits (ICs), automating hardware design can prevent a significant amount of human error from the engineering process and result in less errors. Verilog is a popular hardware description language for designing and modeling digital systems; thus, Verilog generation is one of the emerging areas of research to facilitate the design process. In this work, we propose VerilogCoder, a system of multiple Artificial Intelligence (AI) agents for Verilog code generation, to autonomously write Verilog code and fix syntax and functional errors using collaborative Verilog tools (i.e., syntax checker, simulator, and waveform tracer). Firstly, we propose a task planner that utilizes a novel Task and Circuit Relation Graph retrieval method to construct a holistic plan based on module descriptions. To debug and fix functional errors, we develop a novel and efficient abstract syntax tree (AST)-based waveform tracing tool, which is integrated within the autonomous Verilog completion flow. The proposed methodology successfully generates 94.2% syntactically and functionally correct Verilog code, surpassing the state-of-the-art methods by 33.9% on the VerilogEval-Human v2 benchmark.

安装插件收集

被引 98

CodePlan: Repository-Level Coding using LLMs and Planning

CodePlan：基于LLMs和规划的代码库级编码

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade 等, 2023-Proceedings of the ACM on Software Engineering

Software engineering activities such as package migration, fixing error reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. We formulate these activities as repository-level coding tasks. Recent tools like GitHub Copilot, which are powered by Large Language Models (LLMs), have succeeded in offering high-quality solutions to localized coding problems. Repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is inter-dependent and the entire repository may be too large to fit into the prompt. We frame repository-level coding as a planning problem and present a task-agnostic, neuro-symbolic framework called CodePlan to solve it. CodePlan synthesizes a multi-step chain-of-edits (plan), where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. CodePlan is based on a novel combination of an incremental dependency analysis, a change may-impact analysis and an adaptive planning algorithm (symbolic components) with the neural LLMs. We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python). Each task is evaluated on multiple code repositories, each of which requires inter-dependent changes to many files (between 2–97 files). Coding tasks of this level of complexity have not been automated using LLMs before. Our results show that CodePlan has better match with the ground truth compared to baselines. CodePlan is able to get 5/7 repositories to pass the validity checks (i.e., to build without errors and make correct code edits) whereas the baselines (without planning but with the same type of contextual information as CodePlan) cannot get any of the repositories to pass them. We provide our (non-proprietary) data, evaluation scripts and supplementary material at https://github.com/microsoft/codeplan.

安装插件收集

被引 185

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

Plan-Seq-Learn：基于语言模型的强化学习解决长时程机器人任务

Murtaza Dalal, Tarun Chiruvolu, Devendra Singh Chaplot 等, 2024-International Conference on Learning Representations

Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://mihdalal.github.io/planseqlearn/

安装插件收集

被引 81

A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement

基于多计划探索和反馈驱动的代码生成协同编程框架

Huan Zhang, Wei Cheng, Yuhan Wu 等, 2024-Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

Large language models (LLMs) have achieved impressive performance on code generation. Although prior studies enhanced LLMs with prompting techniques and code refinement, they still struggle with complex programming problems due to rigid solution plans. In this paper, we draw on pair programming practices to propose PairCoder, a novel LLM-based framework for code generation. PairCoder incorporates two collaborative LLM agents, namely a Navigator agent for high-level planning and a Driver agent for specific implementation. The Navigator is responsible for proposing promising solution plans, selecting the current optimal plan, and directing the next iteration round based on execution feedback. The Driver follows the guidance of Navigator to undertake initial code generation, code testing, and refinement. This interleaved and iterative workflow involves multi-plan exploration and feedback-based refinement, which mimics the collaboration of pair programmers. We evaluate PairCoder with both open-source and closed-source LLMs on various code generation benchmarks. Extensive experimental results demonstrate the superior accuracy of PairCoder, achieving relative pass@1 improvements of 12.00%– 162.43% compared to prompting LLMs directly.CCS CONCEPTS• Software and its engineering → Automatic programming.

安装插件收集

被引 26

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

图增强大型语言模型在异步计划推理中的应用

Fangru Lin, Emanuele La Malfa, Valentin Hofmann 等, 2024-International Conference on Machine Learning

Planning is a fundamental property of human intelligence. Reasoning about asynchronous plans is challenging since it requires sequential and parallel planning to optimize time costs. Can large language models (LLMs) succeed at this task? Here, we present the first large-scale study investigating this question. We find that a representative set of closed and open-source LLMs, including GPT-4 and LLaMA-2, behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow. We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results. We show that although PLaG can boost model performance, LLMs still suffer from drastic degradation when task complexity increases, highlighting the limits of utilizing LLMs for simulating digital devices. We see our study as an exciting step towards using LLMs as efficient autonomous agents. Our code and data are available at https://github.com/fangru-lin/graph-llm-asynchow-plan.

安装插件收集

被引 33

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback

InterCode：基于执行反馈的交互式编码标准化与基准测试

John Yang, Akshara Prabhakar, Karthik Narasimhan 等, 2023-Neural Information Processing Systems

Humans write code in a fundamentally interactive manner and rely on constant execution feedback to correct errors, resolve ambiguities, and decompose tasks. While LLMs have recently exhibited promising coding capabilities, current coding benchmarks mostly consider a static instruction-to-code sequence transduction process, which has the potential for error propagation and a disconnect between the generated code and its final execution environment. To address this gap, we introduce InterCode, a lightweight, flexible, and easy-to-use framework of interactive coding as a standard reinforcement learning (RL) environment, with code as actions and execution feedback as observations. Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution, and is compatible out-of-the-box with traditional seq2seq coding methods, while enabling the development of new methods for interactive code generation. We use InterCode to create three interactive code environments with Bash, SQL, and Python as action spaces, leveraging data from the static NL2Bash, Spider, and MBPP datasets. We demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies such as ReAct and Plan&Solve. Our results showcase the benefits of interactive code generation and demonstrate that InterCode can serve as a challenging benchmark for advancing code understanding and generation capabilities. InterCode is designed to be easily extensible and can even be used to create new tasks such as Capture the Flag, a popular coding puzzle that is inherently multi-step and involves multiple programming languages. Project site with code and data: https://intercode-benchmark.github.io

安装插件收集

被引 197

Active Predictive Coding: A Unified Neural Framework for Learning Hierarchical World Models for Perception and Planning

主动预测编码：一种用于学习感知和规划分层世界模型的统一神经网络框架

Rajesh P. N. Rao, Dimitrios C. Gklezakos, V. Sathish, 2022-Neural Computation4区IF 2.1

There is growing interest in predictive coding as a model of how the brain learns through predictions and prediction errors. Predictive coding models have traditionally focused on sensory coding and perception. Here we introduce active predictive coding (APC) as a unifying model for perception, action, and cognition. The APC model addresses important open problems in cognitive science and AI, including (1) how we learn compositional representations (e.g., part-whole hierarchies for equivariant vision) and (2) how we solve large-scale planning problems, which are hard for traditional reinforcement learning, by composing complex state dynamics and abstract actions from simpler dynamics and primitive actions. By using hypernetworks, self-supervised learning, and reinforcement learning, APC learns hierarchical world models by combining task-invariant state transition networks and task-dependent policy networks at multiple abstraction levels. We illustrate the applicability of the APC model to active visual perception and hierarchical planning. Our results represent, to our knowledge, the first proof-of-concept demonstration of a unified approach to addressing the part-whole learning problem in vision, the nested reference frames learning problem in cognition, and the integrated state-action hierarchy learning problem in reinforcement learning.

安装插件收集

被引 33

Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study

评估大型语言模型在地理空间任务上的表现：一项多地理空间任务基准测试研究

Liuchang Xu, Shuo Zhao, Qingming Lin 等, 2025-International Journal of Digital Earth2区IF 4.9

ABSTRACT The emergence of large language models like ChatGPT and Gemini has highlighted the need to assess their diverse capabilities. However, their performance on geospatial tasks remains underexplored. This study introduces a novel multi-task spatial evaluation dataset to address this gap, covering twelve task types, including spatial understanding and route planning, with verified answers. We evaluated several models, including OpenAI’s gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI’s glm-4, Anthropic’s claude-3-sonnet-20240229, and MoonShot’s moonshot-v1-8k, using a two-phase testing approach: zero-shot testing followed by difficulty-based categorization and prompt tuning. Results show that gpt-4o had the highest overall accuracy in the first phase at 71.3%. Though moonshot-v1-8k performed slightly worse overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on performance, such as the Chain-of-Thought strategy, which boosted gpt-4o’s accuracy in route planning from 12.4% to 87.5%, and a one-shot strategy that raised moonshot-v1-8k’s accuracy in mapping tasks from 10.1% to 76.3%.

安装插件收集

被引 22

CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging

CODESIM：通过模拟驱动规划和调试的多智能体代码生成与问题解决

Md. Ashraful Islam, Mohammed Eunus Ali, Md. Rizwan Parvez, 2025-North American Chapter of the Association for Computational Linguistics

Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).

安装插件收集

被引 26

Human-In-The-Loop Software Development Agents

人机交互式软件开发智能体

Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam 等, 2024-2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

Recently, Large Language Models (LLMs)-based multi-agent paradigms for software engineering are introduced to automatically resolve software development tasks (e.g., from a given issue to source code). However, existing work is evaluated based on historical benchmark datasets, rarely considers human feedback at each stage of the automated software development process, and has not been deployed in practice. In this paper, we introduce a Human-in-the-loop LLM-based Agents framework (HULA) for software development that allows software engineers to refine and guide LLMs when generating coding plans and source code for a given task. We design, implement, and deploy the HULA framework into Atlassian JIRA for internal uses. Through a multi-stage evaluation of the HULA framework, Atlassian software engineers perceive that HULA can minimize the overall development time and effort, especially in initiating a coding plan and writing code for straightforward tasks. On the other hand, challenges around code quality remain a concern in some cases. We draw lessons learned and discuss opportunities for future work, which will pave the way for the advancement of LLM-based agents in software development.

安装插件收集

被引 37

glm 5.1技术报告

共 10 篇文献，4 个研究方向

软件工程中的多智能体协作与交互式编程

代码库级别与复杂依赖的神经符号规划

复杂异步规划与认知推理机制

具身智能与空间地理规划应用