glm 5.1技术报告
软件工程中的多智能体协作与交互式编程
这组文献关注如何利用多智能体框架、人机协同(Human-in-the-loop)以及交互式反馈来解决复杂的软件开发和代码生成任务。它们强调了从单一代码生成到全生命周期(规划、编码、调试)的转变。
- Human-In-The-Loop Software Development Agents(Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, C. Tantithamthavorn, Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, Ming Wu, 2024, 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP))
- CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging(Md. Ashraful Islam, Mohammed Eunus Ali, Md. Rizwan Parvez, 2025, North American Chapter of the Association for Computational Linguistics)
- VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool(Chia-Tung Ho, Haoxing Ren, Brucek Khailany, 2024, AAAI Conference on Artificial Intelligence)
- A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement(Huan Zhang, Wei Cheng, Yuhan Wu, Wei Hu, 2024, Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering)
- InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback(John Yang, Akshara Prabhakar, Karthik Narasimhan, Shunyu Yao, 2023, Neural Information Processing Systems)
代码库级别与复杂依赖的神经符号规划
该组文献探讨了在大规模代码仓库或具有复杂依赖关系的环境中,如何通过神经符号结合、增量依赖分析或多步编辑链来处理超出单一Prompt长度的复杂编程任务。
- CodePlan: Repository-Level Coding using LLMs and Planning(Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, C. VageeshD., Arun Shankar Iyer, Suresh Parthasarathy, S. Rajamani, B. Ashok, Shashank Shet, 2023, Proceedings of the ACM on Software Engineering)
复杂异步规划与认知推理机制
这部分文献侧重于提升LLM在非线性、异步或分层环境下的规划能力。研究涵盖了从图增强的提示技术到模仿大脑学习机制的统一预测编码模型,旨在解决长程规划中的复杂性退化问题。
- Graph-enhanced Large Language Models in Asynchronous Plan Reasoning(Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, J. Pierrehumbert, 2024, International Conference on Machine Learning)
- Active Predictive Coding: A Unified Neural Framework for Learning Hierarchical World Models for Perception and Planning(Rajesh P. N. Rao, Dimitrios C. Gklezakos, V. Sathish, 2022, Neural Computation)
具身智能与空间地理规划应用
这组文献将LLM的规划能力扩展到了物理世界和特定领域空间任务,如机器人控制(通过RL桥接抽象语言与底层动作)和地理空间路径规划,验证了模型在现实世界约束下的执行效能。
- Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks(Murtaza Dalal, Tarun Chiruvolu, Devendra Singh Chaplot, Ruslan Salakhutdinov, 2024, International Conference on Learning Representations)
- Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study(Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Q. Luo, Sensen Wu, Xinyue Ye, Hailin Feng, Zhenhong Du, 2025, International Journal of Digital Earth)
本组文献共同探讨了大型语言模型(LLM)在复杂规划与自主代理(Agent)领域的前沿进展。研究方向涵盖了从高度专业化的软件工程自动化(多智能体协作、交互式编程、库级别代码处理)到通用的认知推理模型(异步规划、预测编码),并进一步延伸至具身智能和地理空间导航等实际应用场景。这些研究共同指向了一个核心趋势:通过引入结构化规划、反馈闭环以及多智能体协同,克服LLM在处理长程、高复杂性任务时的局限性。
总计10篇相关文献
Due to the growing complexity of modern Integrated Circuits (ICs), automating hardware design can prevent a significant amount of human error from the engineering process and result in less errors. Verilog is a popular hardware description language for designing and modeling digital systems; thus, Verilog generation is one of the emerging areas of research to facilitate the design process. In this work, we propose VerilogCoder, a system of multiple Artificial Intelligence (AI) agents for Verilog code generation, to autonomously write Verilog code and fix syntax and functional errors using collaborative Verilog tools (i.e., syntax checker, simulator, and waveform tracer). Firstly, we propose a task planner that utilizes a novel Task and Circuit Relation Graph retrieval method to construct a holistic plan based on module descriptions. To debug and fix functional errors, we develop a novel and efficient abstract syntax tree (AST)-based waveform tracing tool, which is integrated within the autonomous Verilog completion flow. The proposed methodology successfully generates 94.2% syntactically and functionally correct Verilog code, surpassing the state-of-the-art methods by 33.9% on the VerilogEval-Human v2 benchmark.
Software engineering activities such as package migration, fixing error reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. We formulate these activities as repository-level coding tasks. Recent tools like GitHub Copilot, which are powered by Large Language Models (LLMs), have succeeded in offering high-quality solutions to localized coding problems. Repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is inter-dependent and the entire repository may be too large to fit into the prompt. We frame repository-level coding as a planning problem and present a task-agnostic, neuro-symbolic framework called CodePlan to solve it. CodePlan synthesizes a multi-step chain-of-edits (plan), where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. CodePlan is based on a novel combination of an incremental dependency analysis, a change may-impact analysis and an adaptive planning algorithm (symbolic components) with the neural LLMs. We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python). Each task is evaluated on multiple code repositories, each of which requires inter-dependent changes to many files (between 2–97 files). Coding tasks of this level of complexity have not been automated using LLMs before. Our results show that CodePlan has better match with the ground truth compared to baselines. CodePlan is able to get 5/7 repositories to pass the validity checks (i.e., to build without errors and make correct code edits) whereas the baselines (without planning but with the same type of contextual information as CodePlan) cannot get any of the repositories to pass them. We provide our (non-proprietary) data, evaluation scripts and supplementary material at https://github.com/microsoft/codeplan.
Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://mihdalal.github.io/planseqlearn/
Large language models (LLMs) have achieved impressive performance on code generation. Although prior studies enhanced LLMs with prompting techniques and code refinement, they still struggle with complex programming problems due to rigid solution plans. In this paper, we draw on pair programming practices to propose PairCoder, a novel LLM-based framework for code generation. PairCoder incorporates two collaborative LLM agents, namely a Navigator agent for high-level planning and a Driver agent for specific implementation. The Navigator is responsible for proposing promising solution plans, selecting the current optimal plan, and directing the next iteration round based on execution feedback. The Driver follows the guidance of Navigator to undertake initial code generation, code testing, and refinement. This interleaved and iterative workflow involves multi-plan exploration and feedback-based refinement, which mimics the collaboration of pair programmers. We evaluate PairCoder with both open-source and closed-source LLMs on various code generation benchmarks. Extensive experimental results demonstrate the superior accuracy of PairCoder, achieving relative pass@1 improvements of 12.00%– 162.43% compared to prompting LLMs directly.CCS CONCEPTS• Software and its engineering → Automatic programming.
Planning is a fundamental property of human intelligence. Reasoning about asynchronous plans is challenging since it requires sequential and parallel planning to optimize time costs. Can large language models (LLMs) succeed at this task? Here, we present the first large-scale study investigating this question. We find that a representative set of closed and open-source LLMs, including GPT-4 and LLaMA-2, behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow. We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results. We show that although PLaG can boost model performance, LLMs still suffer from drastic degradation when task complexity increases, highlighting the limits of utilizing LLMs for simulating digital devices. We see our study as an exciting step towards using LLMs as efficient autonomous agents. Our code and data are available at https://github.com/fangru-lin/graph-llm-asynchow-plan.
Humans write code in a fundamentally interactive manner and rely on constant execution feedback to correct errors, resolve ambiguities, and decompose tasks. While LLMs have recently exhibited promising coding capabilities, current coding benchmarks mostly consider a static instruction-to-code sequence transduction process, which has the potential for error propagation and a disconnect between the generated code and its final execution environment. To address this gap, we introduce InterCode, a lightweight, flexible, and easy-to-use framework of interactive coding as a standard reinforcement learning (RL) environment, with code as actions and execution feedback as observations. Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution, and is compatible out-of-the-box with traditional seq2seq coding methods, while enabling the development of new methods for interactive code generation. We use InterCode to create three interactive code environments with Bash, SQL, and Python as action spaces, leveraging data from the static NL2Bash, Spider, and MBPP datasets. We demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies such as ReAct and Plan&Solve. Our results showcase the benefits of interactive code generation and demonstrate that InterCode can serve as a challenging benchmark for advancing code understanding and generation capabilities. InterCode is designed to be easily extensible and can even be used to create new tasks such as Capture the Flag, a popular coding puzzle that is inherently multi-step and involves multiple programming languages. Project site with code and data: https://intercode-benchmark.github.io
There is growing interest in predictive coding as a model of how the brain learns through predictions and prediction errors. Predictive coding models have traditionally focused on sensory coding and perception. Here we introduce active predictive coding (APC) as a unifying model for perception, action, and cognition. The APC model addresses important open problems in cognitive science and AI, including (1) how we learn compositional representations (e.g., part-whole hierarchies for equivariant vision) and (2) how we solve large-scale planning problems, which are hard for traditional reinforcement learning, by composing complex state dynamics and abstract actions from simpler dynamics and primitive actions. By using hypernetworks, self-supervised learning, and reinforcement learning, APC learns hierarchical world models by combining task-invariant state transition networks and task-dependent policy networks at multiple abstraction levels. We illustrate the applicability of the APC model to active visual perception and hierarchical planning. Our results represent, to our knowledge, the first proof-of-concept demonstration of a unified approach to addressing the part-whole learning problem in vision, the nested reference frames learning problem in cognition, and the integrated state-action hierarchy learning problem in reinforcement learning.
ABSTRACT The emergence of large language models like ChatGPT and Gemini has highlighted the need to assess their diverse capabilities. However, their performance on geospatial tasks remains underexplored. This study introduces a novel multi-task spatial evaluation dataset to address this gap, covering twelve task types, including spatial understanding and route planning, with verified answers. We evaluated several models, including OpenAI’s gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI’s glm-4, Anthropic’s claude-3-sonnet-20240229, and MoonShot’s moonshot-v1-8k, using a two-phase testing approach: zero-shot testing followed by difficulty-based categorization and prompt tuning. Results show that gpt-4o had the highest overall accuracy in the first phase at 71.3%. Though moonshot-v1-8k performed slightly worse overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on performance, such as the Chain-of-Thought strategy, which boosted gpt-4o’s accuracy in route planning from 12.4% to 87.5%, and a one-shot strategy that raised moonshot-v1-8k’s accuracy in mapping tasks from 10.1% to 76.3%.
Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).
Recently, Large Language Models (LLMs)-based multi-agent paradigms for software engineering are introduced to automatically resolve software development tasks (e.g., from a given issue to source code). However, existing work is evaluated based on historical benchmark datasets, rarely considers human feedback at each stage of the automated software development process, and has not been deployed in practice. In this paper, we introduce a Human-in-the-loop LLM-based Agents framework (HULA) for software development that allows software engineers to refine and guide LLMs when generating coding plans and source code for a given task. We design, implement, and deploy the HULA framework into Atlassian JIRA for internal uses. Through a multi-stage evaluation of the HULA framework, Atlassian software engineers perceive that HULA can minimize the overall development time and effort, especially in initiating a coding plan and writing code for straightforward tasks. On the other hand, challenges around code quality remain a concern in some cases. We draw lessons learned and discuss opportunities for future work, which will pave the way for the advancement of LLM-based agents in software development.
本组文献共同探讨了大型语言模型(LLM)在复杂规划与自主代理(Agent)领域的前沿进展。研究方向涵盖了从高度专业化的软件工程自动化(多智能体协作、交互式编程、库级别代码处理)到通用的认知推理模型(异步规划、预测编码),并进一步延伸至具身智能和地理空间导航等实际应用场景。这些研究共同指向了一个核心趋势:通过引入结构化规划、反馈闭环以及多智能体协同,克服LLM在处理长程、高复杂性任务时的局限性。