Vibe Coding
程序合成与形式化验证技术
聚焦于利用大型语言模型结合程序分析、枚举算法及形式化验证方法,解决代码生成的正确性、语义准确性及高可靠性需求。
- Guiding Enumerative Program Synthesis with Large Language Models(Yixuan Li, Julian Parsert, Elizabeth Polgreen, 2024, Lecture Notes in Computer Science)
- Verified Code Transpilation with LLMs(Sahil Bhatia, Alvin Cheung, Niranjan Hasabnis, Jie Qiu, Sanjit A. Seshia, 2024, Advances in Neural Information Processing Systems 37)
- PREFACE - A Reinforcement Learning Framework for Code Verification via LLM Prompt Repair(Manvi Jha, Jiaxin Wan, Huan Zhang, Deming Chen, 2025, Proceedings of the Great Lakes Symposium on VLSI 2025)
- LLM-based Interactive Code Generation: Empirical Evaluation(D. Shaikhelislamov, M. Drobyshevskiy, Andrey Belevantsev, 2024, 2024 Ivannikov Ispras Open Conference (ISPRAS))
- Intelligent program synthesis techniques: literature review(B GU, B YU, XG DONG, XF LI, RM ZHONG, 2021, Journal of …)
- Jigsaw: Large Language Models meet Program Synthesis(Naman Jain, Skanda Vaidyanath, Arun Shankar Iyer, Nagarajan Natarajan, Suresh Parthasarathy, S. Rajamani, Rahul Sharma, 2021, Proceedings of the 44th International Conference on Software Engineering)
- Automating Requirements Modelling with LLMs: An Iterative Contrastive Optimisation Approach(Chenxi Lv, S. Tyszberowicz, Zhiming Liu, Bo Liu, 2025, 2025 32nd Asia-Pacific Software Engineering Conference (APSEC))
- VeriGen: An LLM-Augmented Framework for End-to-End Automation of Software Development Lifecycle–From Requirements Specifications to Code Generation(M. Ahmed, Muhammad Waseem Anwar, W. H. Butt, F. Azam, 2026, IEEE Access)
- Enchanting Program Specification Synthesis by Large Language Models Using Static Analysis and Program Verification(Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, Cong Tian, 2024, Lecture Notes in Computer Science)
- Test-Driven Development and LLM-based Code Generation(N. Mathews, M. Nagappan, 2024, Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering)
- Optimal Neural Program Synthesis from Multimodal Specifications(Xi Ye, Qiaochu Chen, Işıl Dillig, Greg Durrett, 2020, Findings of the Association for Computational Linguistics: EMNLP 2021)
- LLM-Based Scheme for Synthesis of Formal Verification Algorithms(Itay Cohen, Doron A. Peled, 2024, Lecture Notes in Computer Science)
- LLM-Assisted Synthesis of High-Assurance C Programs(Prasita Mukherjee, Minghai Lu, Benjamin Delaware, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
交互式提示驱动与代码精炼框架
探讨如何通过自然语言提示、自我修正机制、澄清对话及静态分析反馈,实现从需求到高质量代码的迭代式生成与优化。
- Natural Language to Code Generation: Implementing a Transformer-Based Model for Python Code Synthesis and Integration into Conversational Ai(Poornima Devi M, Ayushmaan Das, Ruthvika Muchala, 2025, SSRN Electronic Journal)
- Program Synthesis Using Natural Language(Aditya Desai, Sumit Gulwani, V. Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, R. Sailesh, Subhajit Roy, 2015, Proceedings of the 38th International Conference on Software Engineering)
- Prompt-Driven Development with Claude Code: Developing a TUI Framework for the Ring Programming Language(M. S. Fayed, A. S. Fayed, 2026, Electronics)
- Prompt-Driven and Kubernetes Error Report-Aware Container Orchestration(Niklas Beuter, André Drews, Nane Kratzke, 2025, Future Internet)
- A Systematic Literature Review of 10 years of Research on Program Synthesis and Natural Language Processing(Rolando Ramírez-Rueda, E. Benítez-Guerrero, Carmen Mezura-Godoy, E. Bárcenas, 2024, Programming and Computer Software)
- On Program Synthesis and Large Language Models(Hans Hüttel, 2024, Communications of the ACM)
- Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs(Sylvain Kouemo Ngassom, Arghavan Moradi Dakhel, Florian Tambon, F. Khomh, 2024, Proceedings of the 1st ACM International Conference on AI-Powered Software)
- ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification(Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, Qing Wang, 2024, Proceedings of the ACM on Software Engineering)
- An Empirical Study on the Effectiveness of Iterative LLM-Based Improvements for Static Analysis Issues(João Gonçalves, M. Maia, 2025, Anais do XXXIX Simpósio Brasileiro de Engenharia de Software (SBES 2025))
- Prompting Techniques for Secure Code Generation: A Systematic Investigation(Catherine Tony, Nicolás E. Díaz Ferreyra, Markus Mutas, Salem Dhif, Riccardo Scandariato, 2024, ACM Transactions on Software Engineering and Methodology)
- LLM-based Iterative Requirements Refinement in FSM with IEC 61499 Code Generation(V. Vyatkin, Sandeep Patil, Dmitrii Drozdov, Anatoly Shalyto, 2025, 2025 IEEE 23rd International Conference on Industrial Informatics (INDIN))
- DesDD: A Design-Enabled Framework with Dual-Layer Debugging for LLM-based Iterative API Orchestrating(Zhuo Cheng, Zhou Zou, Qing Huang, Zhenchang Xing, Wei Zhang, Shaochen Wang, Xueting Yi, Huan Jin, Zhiping Liu, Zhaojin Lu, 2025, Proceedings of the 16th International Conference on Internetware)
人机协作范式与开发者体验
研究人类开发者与AI助手的交互模式,重点关注对话式编程界面、协作流程以及AI如何增强人类的创造力与学习过程。
- Human-Human-AI Triadic Programming: Uncovering the Role of AI Agent and the Value of Human Partner in Collaborative Learning(T. Daryanto, Xiaohan Ding, Kaike Ping, Lance T. Wilhelm, Yan Chen, Chris Brown, E. Rho, 2026, Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems)
- Spellburst: A Node-based Interface for Exploratory Creative Coding with Natural Language Prompts(Tyler Angert, Miroslav Suzara, Jenny Han, Christopher Pondoc, Hariharan Subramonyam, 2023, Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology)
- The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development(Steven I. Ross, Fernando Martinez, Stephanie Houde, Michael J. Muller, Justin D. Weisz, 2023, Proceedings of the 28th International Conference on Intelligent User Interfaces)
- Guiding Novice Programmers With CPS-GPT: A Prompt-Driven AI Assistant for Creative Problem Solving(Shu-Chen Wang, Pin Li, Yuen-Min Huang, 2026, Journal of Educational Computing Research)
- Performance Evaluation of Prompt Generation Strategies for AI Agents in Online Programming Education(Zan Li, Zijie Chen, 2025, Journal of Advanced Computing Systems)
复杂系统开发与多智能体协作
关注在复杂工程任务(如硬件设计、多模块系统)中,利用多智能体系统及结构化中间表示来管理协同开发与复杂逻辑。
- ChatCPU: An Agile CPU Design & Verification Platform with LLM(Xi Wang, Gwok-Waa Wan, Sam-Zaak Wong, Layton Zhang, Tianyang Liu, Qi Tian, Jianmin Ye, 2024, Proceedings of the 61st ACM/IEEE Design Automation Conference)
- LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead(Junda He, Christoph Treude, David Lo, 2024, ACM Transactions on Software Engineering and Methodology)
- Demystifying LLM-Based Software Engineering Agents(Chun Xia, Yinlin Deng, S. Dunn, Lingming Zhang, 2025, Proceedings of the ACM on Software Engineering)
- Athena: Intermediate Representations for Iterative Scaffolded App Generation with an LLM(Jon-Tait Beason, Ruijia Cheng, Eldon Schoop, Jeffrey Nichols, 2026, Proceedings of the 31st International Conference on Intelligent User Interfaces)
软件工程效能评估与信任机制
致力于建立科学的评估基准,验证LLM生成代码的实用性,并探讨在软件工程实践中建立开发者对AI工具信任的框架与实证研究。
- Empowering Future Software Engineers: Integrating AI Tools into Advanced CS Curriculum(N. Roy, Omojokun Olufisayo, Oleksandr Horielko, 2025, Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 2)
- The Impact of LLM-Assistants on Software Developer Productivity: A Systematic Review and Mapping Study(Amr Mohamed, Maram Assi, Mariam Guizani, 2025, ACM Transactions on Software Engineering and Methodology)
- Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems(Xiaoqing Wang, Keman Huang, Bin Liang, Hongyu Li, Xiaoyong Du, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
- AI-ASSISTED CODE GENERATION AND OPTIMIZATION: LEVERAGING MACHINE LEARNING TO ENHANCE SOFTWARE DEVELOPMENT PROCESSES(Swamy Prasadarao Velaga, 2020, International Journal of Innovations in Engineering Research and Technology)
- Self-Collaboration Code Generation via ChatGPT(Yihong Dong, Xue Jiang, Zhi Jin, Ge Li, 2023, ACM Transactions on Software Engineering and Methodology)
- CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models(Hao Yu, Shen Bo, Dezhi Ran, J. Y. Zhang, Qi Rong Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, Tao Xie, 2024, Proceedings of the IEEE/ACM 46th International Conference on Software Engineering)
- Multi-language Software Development in the LLM Era: Insights from Practitioners’ Conversations with ChatGPT(Lucas Aguiar, Matheus Paixão, R. Carmo, Edson Soares, Ant´onio Leal, Matheus Freitas, Eliakim Gama, 2024, Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement)
- Rocks Coding, Not Development: A Human-Centric, Experimental Evaluation of LLM-Supported SE Tasks(Wei Wang, Huilong Ning, Gaowei Zhang, Libo Liu, Yi Wang, 2024, Proceedings of the ACM on Software Engineering)
- LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead(Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, David Lo, 2026, ACM Transactions on Software Engineering and Methodology)
- Analysis of LLM Code Synthesis in Software Productivity(Anurag Anand, Shivali Chopra, Mohit Arora, 2024, Applied Intelligence and Computing)
- Revolutionary transformations in twentieth century: making AI-assisted software development(Binayak Parashar, Inderjeet Kaur, Anupama Sharma, Pratima Singh, Deepti Mishra, 2022, Computational Intelligence in Software Modeling)
- A Pilot Study on AI-Assisted Code Generation with Large Language Models for Software Engineering(Hsiao-Chuan Liu, Chia-Tung Tsai, Min-Yuh Day, 2023, Communications in Computer and Information Science)
- Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation(Jiawei Liu, Chun Xia, Yuyao Wang, Lingming Zhang, 2023, Advances in Neural Information Processing Systems 36)
- Facilitating Trust in AI-assisted Software Tools(Brittany Johnson, Christian Bird, Denae Ford, Ebtesam Al Haque, Nicole Forsgren, Thomas Zimmermann, 2025, ACM Transactions on Software Engineering and Methodology)
- Prompt Driven Test Generation: Leveraging Large Language Models and Knowledge Graphs for Quality Assurance in Data Intensive Software System(Srinivas Reddy Kosna, 2025, Communications in Computer and Information Science)
- Recommendations for efficient and responsible LLM adoption within industrial software development(Krishna Ronanki, Beatriz Cabrero‐Daniel, Tomas Herda, Stefan Sitkovich, Jennifer Horkoff, Christian Berger, 2026, Information and Software Technology)
工业实践、教育应用与演进愿景
从宏观视角分析AI在工业界的落地经验、教育领域的教学变革,以及对未来AI驱动的软件工程范式演进的愿景分析。
- Industrial Experience Report on AI-Assisted Coding in Professional Software Development(Rudolf Ramler, M. Moser, Lukas Fischer, Markus Nissl, René Heinzl, 2024, Proceedings of the 1st International Workshop on Large Language Models for Code)
- Development of an AI-Driven Model for Advancing Software Engineering Practices(Aylin Güzel, A. Egesoy, 2025, International Journal of Innovative Research in Computer Science and Technology)
- “Ok Pal, we have to code that now”: interaction patterns of programming beginners with a conversational chatbot(Alina Mailach, Dominik Gorgosch, Norbert Siegmund, Janet Siegmund, 2024, Empirical Software Engineering)
- Augmented Agile: Human-Centered AI-Assisted Software Management(Rashina Hoda, H. Dam, C. Tantithamthavorn, Patanamon Thongtanunam, M. Storey, Tim Menzies, 2023, IEEE Software)
- The Impact of Structured Prompt-Driven Generative AI on Learning Data Analysis in Engineering Students(Ashish Garg, Ramkumar Rajendran, 2024, Proceedings of the 16th International Conference on Computer Supported Education)
- Investigating students' programming behaviors, interaction qualities and perceptions through prompt-based learning in ChatGPT(D Sun, A Boudouaia, J Yang, J Xu, 2024, Humanities and Social Sciences …)
- The Future of AI-Driven Software Engineering(Valerio Terragni, Annie Vella, Partha S. Roop, Kelly Blincoe, 2024, ACM Transactions on Software Engineering and Methodology)
本报告将Vibe Coding相关研究整合为六大核心领域:从底层的程序合成与形式化验证,到交互式的提示驱动与精炼框架,再到人机协作范式、复杂系统多智能体协同、工程效能评估与信任机制,最后涵盖工业落地与未来愿景。该结构全面覆盖了从技术实现到工程实践、再到教育与行业演进的完整生态,为理解AI辅助软件工程(AI4SE)提供了系统性视角。
总计57篇相关文献
Non-trivial software systems are commonly developed using more than a single programming language. However, multi-language development is not straightforward. Nowadays, tools powered by Large Language Models (LLMs), such as ChatGPT, have been shown to successfully assist practitioners in several aspects of software development. This paper reports a preliminary study aimed to investigate to what extent ChatGPT is being used in multi-language development scenarios. Hence, we leveraged DevGPT, a dataset of conversations between software practitioners and ChatGPT. In total, we studied data from 3,584 conversations, comprising a total of 18,862 code snippets. Our analyses show that only 18.33% of the code snippets suggested by ChatGPT are written in the same programming language as the primary language in the repository where the conversation was shared. In an in-depth analysis, we observed expected scenarios, such as 31.54% of JavaScript snippets being suggested in CSS repositories However, we also unveiled surprising ones, such as Python snippets being largely suggested in C++ repositories. After a qualitative open card sorting of the conversations, we found that in 70% of them developers were asking for coding support while in 57% developers used ChatGPT as a tool to generate code. Our initial results indicate that not only LLMs are being used in multi-language development but also showcase the contexts in which such tools are assisting developers.
Integrating Large Language Models (LLMs) into autonomous agents marks a significant shift in the research landscape by offering cognitive abilities that are competitive with human planning and reasoning. This article explores the transformative potential of integrating Large Language Models into Multi-Agent (LMA) systems for addressing complex challenges in software engineering (SE). By leveraging the collaborative and specialized abilities of multiple agents, LMA systems enable autonomous problem-solving, improve robustness, and provide scalable solutions for managing the complexity of real-world software projects. In this article, we conduct a systematic review of recent primary studies to map the current landscape of LMA applications across various stages of the software development lifecycle (SDLC). To illustrate current capabilities and limitations, we perform two case studies to demonstrate the effectiveness of state-of-the-art LMA frameworks. Additionally, we identify critical research gaps and propose a comprehensive research agenda focused on enhancing individual agent capabilities and optimizing agent synergy. Our work outlines a forward-looking vision for developing fully autonomous, scalable, and trustworthy LMA systems, laying the foundation for the evolution of Software Engineering 2.0.
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically resolve software development issues. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents at the time of paper submission! Agentless also achieves more than 50% solve rate when using Claude 3.5 Sonnet on the new SWE-bench Verified benchmark. In fact, Agentless has already been adopted by OpenAI as the go-to approach to showcase the real-world coding performance of both GPT-4o and the new o1 models; more recently, Agentless has also been used by DeepSeek to evaluate their newest DeepSeek V3 and R1 models. Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patches or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-𝑆 by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the currently overlooked potential of a simplistic, cost-effective technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction. We have open-sourced Agentless at: https://github.com/OpenAutoCoder/Agentless
The rapid advancement of Large Language Model (LLM)-driven multi-agent systems has significantly streamlined software developing tasks, enabling users with little technical expertise to develop executable applications. While these systems democratize software creation through natural language requirements, they introduce significant security risks that remain largely unexplored. We identify two risky scenarios: Malicious User with Benign Agents (MU-BA) and Benign User with Malicious Agents (BU-MA). We introduce the Implicit Malicious Behavior Injection Attack (IMBIA), demonstrating how multi-agent systems can be manipulated to generate software with concealed malicious capabilities beneath seemingly benign applications, and propose Adv-IMBIA as a defense mechanism. Evaluations across ChatDev, MetaGPT, and AgentVerse frameworks reveal varying vulnerability patterns, with IMBIA achieving attack success rates of 93%, 45%, and 71% in MU-BA scenarios, and 71%, 84%, and 45% in BU-MA scenarios. Our defense mechanism reduced attack success rates significantly, particularly in the MU-BA scenario. Further analysis reveals that compromised agents in the coding and testing phases pose significantly greater security risks, while also identifying critical agents that require protection against malicious user exploitation. Our findings highlight the urgent need for robust security measures in multi-agent software development systems and provide practical guidelines for implementing targeted, resource-efficient defensive strategies.
… survey with software … LLM-output evaluation, (iii) scoping the applicability of LLMs within SE tasks, (iv) the effect of LLMs on SE workflows, (v) the necessity and directions for developing …
Large language model assistants (LLM-assistants) present new opportunities to transform software development. Developers are increasingly adopting these tools across tasks, including coding, testing, debugging, documentation, and design. Yet, despite growing interest, there is no synthesis of how LLM-assistants affect software developer productivity. In this paper, we present a systematic review and mapping of 39 peer-reviewed studies published between January 2014 and December 2024 that examine this impact. Our analysis reveals that the majority of studies report considerable benefits from LLM-assistants, though a notable subset identifies critical risks. Commonly reported gains include accelerated development, minimized code search, and the automation of trivial and repetitive tasks. However, studies also highlight concerns around cognitive offloading and reduced team collaboration. Our study reveals that whether LLM-based assistants improve or degrade code quality remains unresolved, as existing studies report contradictory outcomes contingent on context and evaluation criteria. While the majority of studies (90%) adopt a multi-dimensional perspective by examining at least two SPACE dimensions, reflecting increased awareness of the complexity of developer productivity, only 15% extend beyond three dimensions, indicating substantial room for more integrated evaluations. Satisfaction, Performance, and Efficiency are the most frequently investigated dimensions, whereas Communication and Activity remain underexplored. Most studies are exploratory (59%) and methodologically diverse, but lack longitudinal and team-based evaluations. This review surfaces key research gaps and provides recommendations for future research and practice. All artifacts associated with this study are publicly available at https://zenodo.org/records/18489222.
Recently, large language models (LLM) based generative AI has been gaining momentum for their impressive high-quality performances in multiple domains, particularly after the release of the ChatGPT. Many believe that they have the potential to perform general-purpose problem-solving in software development and replace human software developers. Nevertheless, there are in a lack of serious investigation into the capability of these LLM techniques in fulfilling software development tasks. In a controlled 2 × 2 between-subject experiment with 109 participants, we examined whether and to what degree working with ChatGPT was helpful in the coding task and typical software development task and how people work with ChatGPT. We found that while ChatGPT performed well in solving simple coding problems, its performance in supporting typical software development tasks was not that good. We also observed the interactions between participants and ChatGPT and found the relations between the interactions and the outcomes. Our study thus provides first-hand insights into using ChatGPT to fulfill software engineering tasks with real-world developers and motivates the need for novel interaction mechanisms that help developers effectively work with large language models to achieve desired outcomes.
The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks from code generation to program repair, producing a massive volume of software artifacts. This surge in automated creation has exposed a critical bottleneck: the lack of scalable and reliable methods to evaluate the quality of these outputs. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. This approach leverages the advanced reasoning and coding capabilities of LLMs themselves to perform automated evaluations, offering a compelling path toward achieving both the nuance of human insight and the scalability of automated systems. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed. This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.
In recent years, Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI) have gained significant momentum in the automation of software development. Although, previous research studies have also explored specific sub tasks like requirements formalization, UML class diagram generation, UML sequence diagram generation, and skeletal code generation, these efforts are still fragmented and do not provide an integrated, end-to-end software solution. Furthermore, unreliable outputs are frequently caused by instability, omissions, and hallucinations in LLM-generated artifacts. We propose VeriGen, a novel end-to-end generative AI framework for automated software development, to overcome these limitations. VeriGen begins with requirements written in plain natural language, which are structured using RUPPs templates into Structured Natural Language (SNL). LLM outputs are systematically processed through a unique Verifier-Optimizer loop at each subsequent stage, such as class diagram generation, sequence diagram generation, and skeletal Java code synthesis. The Verifier-Optimizer loop identifies correct, incorrect, missing, and extra instances while refining outputs for stability and correctness. This ensures the accuracy and completeness of generated artifacts in addition to the automation. By reliably generating verified and optimized outputs throughout the software development pipeline, VeriGen has been validated against four benchmark case studies, proving its superiority over baseline LLM generated outputs. The results confirm VeriGen as a significant and major step toward achieving a reliable, generative, end-to-end software engineering automation.
Large language models (LLMs) are increasingly used in software development, yet their ability to generate and maintain large, multi-module systems through natural language interaction remains insufficiently characterized. This study presents an empirical analysis of developing a 7420-line Terminal User Interface (TUI) framework for the Ring programming language using a prompt-driven workflow with Claude Code (Opus 4.5), employing an iterative testing and corrective feedback. The system was produced through 107 prompts: 21 feature requests, 72 bug fix prompts, 9 prompts sharing information from Ring documentation, 4 prompts providing architectural guidance, and 1 prompt dedicated to generating documentation. Development progressed across five phases, with the Window Manager phase requiring the most interaction (35 prompts), followed by complex UI systems (25 prompts) and control expansion (20 prompts). Bug-related prompts covered redraw issues, event-handling faults, runtime errors, and layout inconsistencies, while feature requests focused primarily on new widgets, window-manager capabilities, and advanced UI components. Most prompts were brief (mean ≈ 258 characters; median = 207 characters), reflecting a highly iterative workflow in which the human role was limited to specifying requirements, validating behavior, and issuing corrective prompts—without writing any code manually. The resulting framework contains 28 classes, 334 methods and includes a windowing subsystem, event-driven architecture, interactive widgets, hierarchical menus, grid and tree components, tab controls, and a multi-window desktop environment. By combining quantitative prompt analysis with qualitative assessment of model behavior, this study provides empirical evidence that modern LLMs can preserve architectural coherence across iterations and support the construction of new libraries and tools for emerging programming languages, highlighting prompt-driven development as a viable methodology within software-engineering practice.
… Our quasi-experimental study assesses the impact of prompt-driven interaction on college students’ programming behaviors, the quality of their interactions, and their perceptions …
Non-computer-science novices tend to rely on memorization and imitation when learning programming-related courses. Applying structured learning strategies such as Creative Problem Solving (CPS) can scaffold learners through complex tasks, while timely and adaptive support from ChatGPT can foster active knowledge construction and the development of higher-order thinking (HOT). However, conventional ChatGPT interactions are primarily based on one-way question–answering, which is not conducive to beginners who lack well-established knowledge structures. To address this limitation, this study developed a ChatGPT-based instructional assistant integrated with CPS, namely CPS-GPT. The results indicate that CPS-GPT significantly outperformed the comparison condition in creativity ( p = .013, Cohen’s d = .557), critical thinking ( p = .003, Cohen’s d = .674), problem solving ( p < .001, Cohen’s d = .780), knowledge construction ( p = .015, Cohen’s d = .548), as well as cognitive ( p = .039, Cohen’s d = .459), behavioral ( p = .006, Cohen’s d = .618), and social engagement ( p = .004, Cohen’s d = .656). Nevertheless, its task-oriented and highly structured design also emerged as a key factor underlying the non-significant effect on emotional engagement ( p = .373, Cohen’s d = .201).
Large Language Models (LLMs) are gaining momentum in software development with prompt-driven programming enabling developers to create code from Natural Language (NL) instructions. However, studies have questioned their ability to produce secure code and, thereby, the quality of prompt-generated software. Alongside, various prompting techniques that carefully tailor prompts have emerged to elicit optimal responses from LLMs. Still, the interplay between such prompting strategies and secure code generation remains under-explored and calls for further investigations. Objective: In this study, we investigate the impact of different prompting techniques on the security of code generated from NL instructions by LLMs. Method: First, we perform a systematic literature review to identify the existing prompting techniques that can be used for code generation tasks. A subset of these techniques are evaluated on GPT-3, GPT-3.5, and GPT-4 models for secure code generation. For this, we used an existing dataset consisting of 150 NL security-relevant code generation prompts. Results: Our work (i) classifies potential prompting techniques for code generation (ii) adapts and evaluates a subset of the identified techniques for secure code generation tasks, and (iii) observes a reduction in security weaknesses across the tested LLMs, especially after using an existing technique called Recursive Criticism and Improvement (RCI), contributing valuable insights to the ongoing discourse on LLM-generated code security.
… This paper has presented Prompt Driven Test Generation (PDTG), a novel approach that integrates large language models (LLMs) and knowledge graphs to address the unique …
: This paper investigates the use of Generative AI chatbots, especially large language models like ChatGPT, in enhancing data analysis skills through structured prompts in an educational setting. The study addresses the challenge of deploying AI tools for learners new to programming and data analysis, focusing on the role of structured prompt engineering as a facilitator. In this study Engineering students were trained to adeptly use structured prompts in conjunction with Generative AI, to improve their data analysis skills. The t-test comparing pre-test and post-test scores on programming and data analysis shows a significant difference, indicating learning progress. Additionally, the task completion rate reveals that 45% of novice participants completed tasks using Generative AI and structured prompts. This finding highlights the transformative impact of Generative AI in education, indicating a shift in learning experiences and outcomes. The integration of structured prompts with Generative AI not only aids skill development but also marks a new direction in educational methodologies.
Creative coding tasks are often exploratory in nature. When producing digital artwork, artists usually begin with a high-level semantic construct such as a “stained glass filter” and programmatically implement it by varying code parameters such as shape, color, lines, and opacity to produce visually appealing results. Based on interviews with artists, it can be effortful to translate semantic constructs to program syntax, and current programming tools don’t lend well to rapid creative exploration. To address these challenges, we introduce Spellburst, a large language model (LLM) powered creative-coding environment. Spellburst provides (1) a node-based interface that allows artists to create generative art and explore variations through branching and merging operations, (2) expressive prompt-based interactions to engage in semantic programming, and (3) dynamic prompt-driven interfaces and direct code editing to seamlessly switch between semantic and syntactic exploration. Our evaluation with artists demonstrates Spellburst’s potential to enhance creative coding practices and inform the design of computational creativity tools that bridge semantic and syntactic spaces.
Background: Container orchestration systems like Kubernetes rely heavily on declarative manifest files, which serve as orchestration blueprints. However, managing these manifest files is often complex and requires substantial DevOps expertise. Methodology: This study investigates the use of Large Language Models (LLMs) to automate the creation of Kubernetes manifest files from natural language specifications, utilizing prompt engineering techniques within an innovative error- and warning-report–aware refinement process. We assess the capabilities of these LLMs using Zero-Shot, Few-Shot, Prompt-Chaining, and Self-Refine methods to address DevOps needs and support fully automated deployment pipelines. Results: Our findings show that LLMs can generate Kubernetes manifests with varying levels of manual intervention. Notably, GPT-4 and GPT-3.5 demonstrate strong potential for deployment automation. Interestingly, smaller models sometimes outperform larger ones, challenging the assumption that larger models always yield better results. Conclusions: This research highlights the crucial impact of prompt engineering on LLM performance for Kubernetes tasks and recommends further exploration of prompt techniques and model comparisons, outlining a promising path for integrating LLMs into automated deployment workflows.
Artificial intelligence (AI) has a massive impact on all industries and digital advances. Software development is also at the forefront of receiving this massive influence. According to industry experts, AI and machine learning technologies are anticipated to enhance every area of the software development life cycle (SDLC). AI and software engineering are two disciplines that have developed independently and one after the other. AI techniques strive to construct software systems that inculcate some sort of human intelligence in software. Software inspections have been used successfully to detect faults in various types of software documents, such as specifications. The study broadly inculcates the involvement of human intelligence in software and trains it to think and analyze like humans. Nowadays, use of AI assists practitioners in a variety of ways, from project timetable prediction to software delivery estimation, bug resolving, coding, and testing. This chapter focuses on each phase of the SDLC along with the AI conjunction.
This work introduces the Fuzzy Specification Tree Model (FST), a general-purpose framework designed to enhance AI-assisted software engineering. The paper begins by examining the intricate interplay between software engineering and artificial intelligence (AI), emphasizing how AI technologies are reshaping software development methodologies. Building on a foundation of requirements-driven approaches, the study presents a novel adaptation of classical feature modelling to create a versatile, fuzzy logic-based requirements specification model. This model not only facilitates the definition of functionalities for partially completed software but also supports formal methods for project management, version control, and reuse. By employing separate Fuzzy Specification Trees for requirements and the current state of a project, developers gain a dynamic perspective on project completeness and can leverage AI assistance to prioritize tasks, ensuring efficient progression toward project completion with minimal effort.
The field of code generation, influenced by deep learning, has become crucial in contemporary software engineering, facilitating the conversion of natural language to executable code. …
Agile methods have served software engineering well for over two decades, improving responsiveness to change, empowering teams, and facilitating better communication among various project stakeholders. But is it enough to lead us through the next era where balancing business value with human values has become more relevant than ever, especially in an increasingly artificial intelligence (AI)-assisted, hybrid world? We do not think so, and, in this article, we present our vision of “augmented agile” where agile practices are augmented with new capabilities made possible by AI while incorporating human-centered values.
The day to day of a software engineer involves a variety of tasks. While many of these tasks are collaborative, it is not always possible or feasible to engage with other engineers for task completion. Software tools, such as code generators and static analysis tools, aim to fill this gap by providing additional support for developers to effectively complete their tasks. With a steady stream of new tools emerging to support software engineers, including a new breed of tools that rely on artificial intelligence (AI), there are important questions to answer regarding the trust engineers can, and should, put into their software tools and what it means to build a trustworthy tool. To this end, this paper presents findings from a mixed methods investigation of the factors that contribute to trust in traditional and AI-assisted software tools. First, we introduce the PICSE (pronounced “pixie”) framework for trust in software tools that we developed based on a set of 18 interviews with software practitioners internal and external to Microsoft. We then discuss insights from a survey with 368 internal Microsoft responses on the relevance and importance of the factors in our framework with respect to traditional and AI-assisted tools.
A paradigm shift is underway in Software Engineering, with AI systems such as LLMs playing an increasingly important role in boosting software development productivity. This trend is anticipated to persist. In the next years, we expect a growing symbiotic partnership between human software developers and AI. The Software Engineering research community cannot afford to overlook this trend; we must address the key research challenges posed by the integration of AI into the software development process. In this article, we present our vision of the future of software development in an AI-driven world and explore the key challenges that our research community should address to realize this vision.
The aim of this paper is to explore the AI-Driven Code Generation and Optimization. The continuous evolution of code generators has also opened up new possibilities for automating repetitive tasks, allowing for greater focus on high-level problem-solving and design rather than low-level implementation details. As technology continues to advance, the role of code generators in software development is expected to expand even further, offering innovative solutions to the challenges of tomorrow's computing landscape. Whether the ultimate vision is that of a programmer taking "creative coding" to the next level, the expert use of specialized DSLs, or automated software development, we believe that the pathway to practical realization is indeed in their enclosure [1]. Within this context, it is evident that the integration of AI methodologies and flex space technologies is a significant area of interest for researchers, as it presents numerous opportunities for continued innovation and advancement in the field. As we delve deeper into the intricacies of code generation and optimization techniques, it becomes increasingly apparent that further exploration and refinement are crucial in order to unlock the full potential of these cutting-edge AI-driven approaches. Additionally, the identification and proactive mitigation of challenges inherent in the convergence of AI and flex space are pivotal to ensuring the successful development and deployment of impactful solutions. Through a nuanced understanding of these critical domains, researchers and practitioners can work towards realizing the transformative possibilities that lie at the intersection of AI and flex space technologies [1]. By addressing the complexities and nuances associated with these advanced methodologies, we can facilitate the evolution of programming practices and software development processes, ultimately leading to the materialization of the envisioned creative and efficient computing ecosystems.
Artificial Intelligence (AI) tools have transformed software development, making it crucial to equip computer science (CS) students with the skills to leverage these technologies. This talk presents an innovative curriculum approach, integrating AI tools into an advanced CS capstone course at a stage where students possess foundational skills in software engineering. This strategic timing ensures students can critically engage with AI, recognizing biases and managing challenges like hallucinations in AI-generated outputs. Before redesigning the curriculum, independent research was conducted to understand the strengths and limitations of various AI tools, such as Lucidchart, Eraser.io for design documentation, and GitHub Copilot, GPT-4, Codeium, Claude, and Gemini for implementation tasks like code generation, code completion, UI design, error handling, and API integration. This research guided the curriculum by shaping assignment design and delivering foundational lectures on prompt engineering to ease the learning curve for students. Experiments during the capstone course included AI-enhanced assignments and projects, where students applied these tools for software design and implementation. Quantitative data-prompt refinement counts, error rates, code accuracy, and qualitative reflections revealed increased confidence in AI tools, enhanced productivity, and greater readiness for industry roles. Despite these benefits, students faced challenges with complex tasks that required iterative refinement and oversight, but they gained skills in managing biases and hallucinations in AI outputs. The curriculum's ''right-left'' approach enables a smooth transition to AI-assisted development, preparing students for the evolving tech landscape. This talk shares key findings, best practices, and insights into balancing manual skills with AI-enhanced learning.
AI-based tools for software development are widely discussed in academic literature. They promise to boost software development performance, especially in code creation. This paper collects insights from practitioners about the use and implications of AI assistance in industrial software development, with a focus on SMEs. Through interviews with five developers from three software development organization, we gathered and analyzed the experiences made in industrial practice, and we identified lessons learned and open challenges. ChatGPT and Copilot are used in industry projects. While they are considered useful for many code-related development activities, their integration in the development workflow remains mostly shallow. Contradicting observations about speed-ups due to AI support in development are reported. Legal issues are of minor concern although awareness exists.CCS CONCEPTS#x2022; Software and its engineering → Automatic programming.
Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To evaluate the effectiveness of these models, multiple existing benchmarks (e.g., HumanEval and AiXBench) are proposed, including only cases of generating a standalone function, i.e., a function that may invoke or access only built-in functions and standard libraries. However, non-standalone functions, which typically are not included in the existing benchmarks, constitute more than 70% of the functions in popular open-source projects, and evaluating models' effectiveness on standalone functions cannot reflect these models' effectiveness on pragmatic code generation scenarios (i.e., code generation for real settings of open source or proprietary code).
… when code generation is the only purpose of a conversation). … level, we analyze entire conversational structures by grouping … , multiple prompts of code generation form a single block). A …
The main goal of this paper is to create a PyTorch-based model that uses Natural Language prompts to generate Python code. The key component of the concept is a Transformer …
Large language models (LLMs) have recently been applied in software engineering to perform tasks such as translating code between programming languages, generating code from natural language, and autocompleting code as it is being written. When used within development tools, these systems typically treat each model invocation independently from all previous invocations, and only a specific limited functionality is exposed within the user interface. This approach to user interaction misses an opportunity for users to more deeply engage with the model by having the context of their previous interactions, as well as the context of their code, inform the model’s responses. We developed a prototype system – the Programmer’s Assistant – in order to explore the utility of conversational interactions grounded in code, as well as software engineers’ receptiveness to the idea of conversing with, rather than invoking, a code-fluent LLM. Through an evaluation with 42 participants with varied levels of programming experience, we found that our system was capable of conducting extended, multi-turn discussions, and that it enabled additional knowledge and capabilities beyond code generation to emerge from the LLM. Despite skeptical initial expectations for conversational programming assistance, participants were impressed by the breadth of the assistant’s capabilities, the quality of its responses, and its potential for improving their productivity. Our work demonstrates the unique potential of conversational interactions with LLMs for co-creative processes like software development.
Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.
Although large language models (LLMs) have demonstrated remarkable code-generation ability, they still struggle with complex tasks. In real-world software development, humans usually tackle complex tasks through collaborative teamwork, a strategy that significantly controls development complexity and enhances software quality. Inspired by this, we present a self-collaboration framework for code generation employing LLMs, exemplified by ChatGPT. Specifically, through role instructions, (1) Multiple LLM agents act as distinct “experts,” each responsible for a specific subtask within a complex task; (2) Specify the way to collaborate and interact, so that different roles form a virtual team to facilitate each other’s work, ultimately the virtual team addresses code generation tasks collaboratively without the need for human intervention. To effectively organize and manage this virtual team, we incorporate software-development methodology into the framework. Thus, we assemble an elementary team consisting of three LLM roles (i.e., analyst, coder, and tester) responsible for software development’s analysis, coding, and testing stages. We conduct comprehensive experiments on various code-generation benchmarks. Experimental results indicate that self-collaboration code generation relatively improves 29.9–47.1% Pass@1 compared to the base LLM agent. Moreover, we showcase that self-collaboration could potentially enable LLMs to efficiently handle complex repository-level tasks that are not readily solved by the single LLM agent.
… We provide here a code generation example of an LLM-synthesized runtime verification monitor written in Python for the specification \(\boxminus (q_1 \leftrightarrow \ominus (\ominus …
We present SynVer — a novel, general purpose synthesizer for C programs equipped with machine-checked proofs of correctness using the Verified Software Toolchain. To do so, SynVer employs two Large Language Models (LLMs): the first generates candidate programs from user-provided specifications, and the second helps automatically construct formal proofs of their correctness in the Rocq proof assistant. To facilitate verification, SynVer places a set of syntactic restrictions on candidate programs that make them amenable to automated reasoning. SynVer uses a hybrid verification strategy that combines symbolic reasoning with LLM-powered proof generation to discharge proof obligations that the symbolic engine cannot handle on its own. We demonstrate the applicability of SynVer using a diverse set of benchmarks drawn from the program synthesis and verification literature.
Abstract Formal verification provides a rigorous and systematic approach to ensure the correctness and reliability of software systems. Yet, constructing specifications for the full proof relies on domain expertise and non-trivial manpower. In view of such needs, an automated approach for specification synthesis is desired. While existing automated approaches are limited in their versatility, i.e. , they either focus only on synthesizing loop invariants for numerical programs, or are tailored for specific types of programs or invariants. Programs involving multiple complicated data types ( e.g. , arrays, pointers) and code structures ( e.g. , nested loops, function calls) are often beyond their capabilities. To help bridge this gap, we present AutoSpec , an automated approach to synthesize specifications for automated program verification. It overcomes the shortcomings of existing work in specification versatility, synthesizing satisfiable and adequate specifications for full proof. It is driven by static analysis and program verification, and is empowered by large language models (LLMs). AutoSpec addresses the practical challenges in three ways: (1) driving AutoSpec by static analysis and program verification, LLMs serve as generators to generate candidate specifications, (2) programs are decomposed to direct the attention of LLMs, and (3) candidate specifications are validated in each round to avoid error accumulation during the interaction with LLMs. In this way, AutoSpec can incrementally and iteratively generate satisfiable and adequate specifications. The evaluation shows its effectiveness and usefulness, as it outperforms existing works by successfully verifying 79% of programs through automatic specification synthesis, a significant improvement of 1.592x. It can also be successfully applied to verify the programs in a real-world X509-parser project.
The use of LLMs in code generation tools has introduced a paradigm shift in software development, streamlining the process and enhancing automation and efficiency. This study presents a comprehensive analysis of the applications and effectiveness of the Large Language Model (LLM) in code synthesis based upon the analysis of various models. The LLM techniques where programming codes are significantly constrained on high level and lowlevel programming paradigm, has emerged as a dominant strategy in software productivity due to its inherent ability to promote efficiency and minimize time to build logic. Our research systematically explores the impact of LLM on the performance outcomes on various programming languages, comparing it to traditional code practices. We analyze multiple case studies, quantitatively evaluating the success rates, efficiency, and problem-solving capacity of LLM-based solutions. Preliminary findings indicate that LLM encourages a unique problem-solving approach, despite its limitations, often results in highly efficient and innovative solutions. However, the technique also presents a steep learning curve that may deter novice programmers. This study aims to contribute to the body of knowledge on software productivity strategies and the continuing discourse on code efficiency and optimization.
Recently, large language models (LLMs), those pretrained on code, have demonstrated strong capabilities in generating programs from informal natural language intent. However, LLM -generated code is prone to bugs. Developers interacting with LLMs seek trusted code and, ideally, clear indications of potential bugs and vulnerabilities. Verified code can mitigate potential business risks associated with adopting generated code. We use model-agnostic framework CodePatchLLM, an extension for LLM that utilizes Svace feedback to enhance code generation quality. We evaluate CodePatchLLM on four popular LLMs across three datasets. Our experiments show an average absolute reduction of 19.1 % in static analyzer warnings for Java across all datasets and models, while preserving pass@ 1 code generation accuracy.
LLM-based assistants, such as GitHub Copilot and ChatGPT, have the potential to generate code that fulfills a programming task described in a natural language description, referred to as a prompt. The widespread accessibility of these assistants enables users with diverse backgrounds to generate code and integrate it into software projects. However, studies show that code generated by LLMs is prone to bugs and may miss various corner cases in task specifications. Presenting such buggy code to users can impact their reliability and trust in LLM-based assistants. Moreover, significant efforts are required by the user to detect and repair any bug present in the code, especially if no test cases are available. In this study, we propose a self-refinement method aimed at improving the reliability of code generated by LLMs by minimizing the number of bugs before execution, without human intervention, and in the absence of test cases. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. These VQs target various nodes within the Abstract Syntax Tree (AST) of the initial code, which have the potential to trigger specific types of bug patterns commonly found in LLM-generated code. Finally, our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code. Our evaluation, based on programming tasks in the CoderEval dataset, demonstrates that our proposed method outperforms state-of-the-art methods by decreasing the number of targeted errors in the code between 21% to 62% and improving the number of executable code instances to 13%.
Large Language Models (LLMs) have emerged as powerful tools for code generation. Yet, they often struggle to produce code that is both syntactically and semantically correct, particularly when correctness must be formally verified. Addressing this gap, we present a novel, model-agnostic framework that couples LLMs with a lightweight reinforcement learning (RL) agent to enable scalable, robust, and formally verifiable code generation without the need for costly model fine-tuning. Centered on the generation of Dafny code, a formally verifiable programming language, our system initiates with LLM-generated code, which is rigorously evaluated within an integrated verification environment. Upon failure, we feed the erroneous code and error metadata to an RL agent trained to explore the prompt-code space (i.e., the set of possible prompt edits paired with their resulting generated code variants). This agent strategically selects corrective prompts to minimize verification iterations, effectively steering the LLM toward correct outputs using its latent capabilities. Once verified, Dafny code can be systematically translated to C for high-level synthesis (HLS), ensuring correctness-by-construction in downstream hardware design workflows. On a 100‑task benchmark, PREFACE’s error‑guided prompt refinement raises verification success by up to 21% — 14% for ChatGPT‑4o, 17% for ChatGPT‑o1‑mini, 10% for Qwen2.5‑Coder‑14B, 4% for Qwen2.5‑7B, and 21% for Gemini‑2‑Flash—demonstrating substantial gains across diverse LLMs.
… While several verified lifting tools have been developed for various application domains, … propose an LLM-based approach (LLMLIFT) to building verified lifting tools. We use the LLM’s …
Performance Evaluation of Prompt Generation Strategies for AI Agents in Online Programming Education
The integration of artificial intelligence agents in online programming education has revolutionized how students receive instructional support and feedback. This research investigates the performance evaluation of different prompt generation strategies employed by AI agents to assist programming learners. The study examines three distinct prompt generation approaches: rule-based progressive prompting, data-driven adaptive prompting, and hybrid context-aware prompting. Through a controlled experimental design involving 180 undergraduate students enrolled in introductory Python programming courses, we evaluated these strategies across multiple performance dimensions including learning effectiveness, engagement metrics, code completion rates, and student satisfaction. Quantitative analysis revealed that the hybrid context-aware prompting strategy achieved superior learning outcomes with normalized gains averaging 0.51 compared to data-driven (0.42) and rule-based approaches (0.35). The evaluation framework incorporated behavioral analytics, cognitive load measurements, and longitudinal performance tracking over an eight-week period. Results demonstrate significant variations in strategy effectiveness based on student proficiency levels, problem complexity, and learning contexts. This research contributes empirical evidence for optimizing AI agent design in educational technology and provides practical guidelines for implementing adaptive prompting mechanisms in programming learning environments.
As AI assistance becomes embedded in programming practice, researchers have increasingly examined how these systems help learners generate code and work more efficiently. However, these studies often position AI as a replacement for human collaboration and overlook the social and learning-oriented aspects that emerge in collaborative programming. Our work introduces human-human-AI (HHAI) triadic programming, where an AI agent serves as an additional collaborator rather than a substitute for a human partner. Through a within-subjects study with 20 participants, we show that triadic collaboration enhances collaborative learning and social presence compared to the dyadic human–AI (HAI) baseline. In the triadic HHAI conditions, participants relied significantly less on AI generated code in their work. This effect was strongest in the HHAI-shared condition, where participants had an increased sense of responsibility to understand AI suggestions before applying them. These findings demonstrate how triadic settings activate socially shared regulation of learning by making AI use visible and accountable to a human peer, suggesting that AI systems that augment rather than automate peer collaboration can better preserve the learning processes that collaborative programming relies on.
Interacting with computers is a ubiquitous activity for millions of people. Repetitive or specialized tasks often require creation of small, often one-off, programs. End-users struggle with learning and using the myriad of domain-specific languages (DSLs) to effectively accomplish these tasks. We present a general framework for constructing program synthesizers that take natural language (NL) inputs and produce expressions in a target DSL. The framework takes as input a DSL definition and training data consisting of NL/DSL pairs. From these it constructs a synthesizer by learning optimal weights and classifiers (using NLP features) that rank the outputs of a keyword-programming based translation. We applied our framework to three domains: repetitive text editing, an intelligent tutoring system, and flight information queries. On 1200+ English descriptions, the respective synthesizers rank the desired program as the top-1 and top-3 for 80% and 90% descriptions respectively.
… as they pertain to program synthesis. This includes the … be adapted for interpreting natural language program specifications. … paper “Program synthesis and natural language processing: …
Why it is unlikely new developments in machine intelligence will eventually make programming obsolete.
Large pre-trained language models such as GPT-3 [10], Codex [11], and Coogle's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.
… synthesis technology, as well as the advantages and disadvantages of each method. Finally, we summarize the intelligent program synthesis … framework translating natural language to …
Multimodal program synthesis, which leverages different types of user input to synthesize a desired program, is an attractive way to scale program synthesis to challenging settings; however, it requires integrating noisy signals from the user (like natural language) with hard constraints on the program's behavior. This paper proposes an optimal neural synthesis approach where the goal is to find a program that satisfies user-provided constraints while also maximizing the program's score with respect to a neural model. Specifically, we focus on multimodal synthesis tasks in which the user intent is expressed using combination of natural language (NL) and input-output examples. At the core of our method is a top-down recurrent neural model that places distributions over abstract syntax trees conditioned on the NL input. This model not only allows for efficient search over the space of syntactically valid programs, but it allows us to leverage automated program analysis techniques for pruning the search space based on infeasibility of partial programs with respect to the user's constraints. The experimental results on a multimodal synthesis dataset (StructuredRegex) show that our method substantially outperforms prior state-of-the-art techniques in terms of accuracy %, finds model-optimal programs more frequently, and explores fewer states during search.
Abstract Pre-trained Large Language Models (LLMs) are beginning to dominate the discourse around automatic code generation with natural language specifications. In contrast, the best-performing synthesizers in the domain of formal synthesis with precise logical specifications are still based on enumerative algorithms. In this paper, we evaluate the abilities of LLMs to solve formal synthesis benchmarks by carefully crafting a library of prompts for the domain. When one-shot synthesis fails, we propose a novel enumerative synthesis algorithm, which integrates calls to an LLM into a weighted probabilistic search. This allows the synthesizer to provide the LLM with information about the progress of the enumerator, and the LLM to provide the enumerator with syntactic guidance in an iterative loop. We evaluate our techniques on benchmarks from the Syntax-Guided Synthesis (SyGuS) competition. We find that GPT-3.5 as a stand-alone tool for formal synthesis is easily outperformed by state-of-the-art formal synthesis algorithms, but our approach integrating the LLM into an enumerative synthesis algorithm shows significant performance gains over both the LLM and the enumerative synthesizer alone and the winning SyGuS competition tool.
Requirements analysis is a crucial phase in software development. Manual conversion of natural language to models is error-prone and inefficient. Large Language Models (LLMs) offer a promising approach for automating requirement modelling, but there is a gap between their generated results and the needs of real applications. We introduce an interactive and iterative optimisation framework (GCSS) comprising generation, comparison, selection, and supplementation components. GCSS employs a staged strategy to guide LLMs in model generation. By continuously generating, comparing, and incorporating user decisions, GCSS explores and integrates various modelling options. This leads to an optimal solution. Automatically generated feedback is used as supplementary information to guide the next generation, enabling continuous optimisation of outcomes. We evaluated GCSS on different cases, and the experiments show that the models it generates align with expectations, while reducing workload.
It is challenging to generate the code for a complete user interface using a Large Language Model (LLM). User interfaces are complex and their implementations often consist of multiple, inter-related files that together specify the contents of each screen, the navigation flows between the screens, and the data model used throughout the application. It is challenging to craft a single prompt for an LLM that contains enough detail to generate a complete user interface, and even then the result is frequently a single large and intricate file that contains all of the generated screens. In this paper, we introduce Athena, a prototype application generation environment that demonstrates how the use of shared intermediate representations, including an app storyboard, data model, and GUI skeletons, can help a developer work with an LLM in an iterative fashion to craft a complete user interface. These intermediate representations also scaffold the LLM’s code generation process, producing organized and structured code in multiple files while limiting errors. We evaluated Athena with a user study with 12 developers. Participants appreciated Athena’s support for prototyping multi-screen iOS apps, acknowledged that the intermediate representations improved their control and understanding of generated code, and discussed the limitations of the system and potential directions for improvement.
This paper presents the Function Block Assistant (fbAssistant), an LLM-backed tool prototype for developing control logic in industrial automation. fbAssistant interprets natural language requirements and automatically generates state machines and their function blocks implementation. The study demonstrates iterative refinement, simulation validation, and deployment using EcoStruxure Automation Expert. The proposed approach aims at improved efficiency and accuracy in the development of automation software.
The increasing complexity of semiconductor designs necessitates agile hardware development methodologies to keep pace with rapid technological advancements. Following this trend, the Large Language Models (LLMs) emerge as a potential solution, providing new opportunities in hardware design automation. However, existing LLMs exhibit challenges in HDL design and verification, especially for complicated hardware systems. Addressing this need, we introduce ChatCPU, the first end-to-end agile hardware design and verification platform with LLM. ChatCPU streamlines the ASIC design and verification process, guiding it from initial specifications to the final RTL implementations with enhanced design agility. Incorporating the LLM fine-tuning and the processor description language design for CPU design automation, ChatCPU significantly enhances the hardware design capability using LLM. Utilizing ChatCPU, we developed a 6-stage in-order RISC-V CPU prototype, achieving successful tape-out using SkyWater 130nm MPW project with Efabless, which is currently the largest CPU design generated by LLM. Our results demonstrate a remarkable improvement in CPU design efficiency, accelerating the design iteration process by an average of 3.81X, and peaking at 12X and 9.33X in HDL implementations and verification stages, respectively. The ChatCPU also enhances the design capability of LLM by 2.63X as compared to base LLama2. These advancements in ChatCPU represent a significant milestone in LLM-driven ASIC design and optimization.CCS Concepts: • Computing methodologies → Artificial intelligence; • Computer systems organization; • Hardware → Electronic design automation;
Maintaining and evolving software systems often demands more effort than their initial development. Improving source code quality through automated support can significantly reduce technical debt, increase maintainability, and enhance developer productivity. This paper presents an experimental approach that integrates static analysis with Large Language Models (LLMs) to automate source code improvement. The proposed pipeline iteratively processes Java classes by extracting issues detected by SonarQube and transforming them into prompts for LLMs, which generate improved code versions. Each version is reanalyzed, and the process repeats until convergence or a predefined iteration limit is reached. The experimental setup includes multiple configurations combining two LLMs (GPT-4-mini and Gemini), variation in temperature, prompt style, and number of iterations. Evaluations were conducted using multiple Java datasets, with three repeated runs for the Commons Lang repository to identify behavioral patterns. The analysis focuses on the number of issues reduction, decrease in technical debt (measured in a SonarQube metric), and the evolution of issue severity. Functional correctness was assessed manually by inspecting and executing the improved code to ensure behavior preservation. The results demonstrate that combining SonarQube with LLMs is effective in reducing code issues—achieving over 58% average reduction in key scenarios—while preserving functionality. The iterative process proved successful in guiding the models to incrementally improve code quality based on real static analysis feedback. This work contributes a reproducible and extensible pipeline, offering insights into the impact of LLM configurations and supporting further research in the integration of AI and software quality engineering.
Recent Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements. This increasingly automated process mirrors traditional human-led software development, where code is often written in response to a requirement. Historically, Test-Driven Development (TDD) has proven its merit, requiring developers to write tests before the functional code, ensuring alignment with the initial problem statements. Applying TDD principles to LLM-based code generation offers one distinct benefit: it enables developers to verify the correctness of generated code against predefined tests. This paper investigates if and how TDD can be incorporated into AI-assisted code-generation processes. We experimentally evaluate our hypothesis that providing LLMs like GPT-4 and Llama 3 with tests in addition to the problem statements enhances code generation outcomes. We experimented with established function-level code generation benchmarks such as MBPP and HumanEval. Our results consistently demonstrate that including test cases leads to higher success in solving programming challenges. We assert that TDD is a promising paradigm for helping ensure that the code generated by LLMs effectively captures the requirements.CCS CONCEPTS• Software and its engineering → Software development techniques; • Computing methodologies → Artificial intelligence.
Large Language Models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in automatically generating code from provided natural language requirements. However, in real-world practice, it is inevitable that the requirements written by users might be ambiguous or insufficient. Current LLMs will directly generate programs according to those unclear requirements, regardless of interactive clarification, which will likely deviate from the original user intents. To bridge that gap, we introduce a novel framework named ClarifyGPT, which aims to enhance code generation by empowering LLMs with the ability to identify ambiguous requirements and ask targeted clarifying questions. Specifically, ClarifyGPT first detects whether a given requirement is ambiguous by performing a code consistency check. If it is ambiguous, ClarifyGPT prompts an LLM to generate targeted clarifying questions. After receiving question responses, ClarifyGPT refines the ambiguous requirement and inputs it into the same LLM to generate a final code solution. To evaluate our ClarifyGPT, we invite ten participants to use ClarifyGPT for code generation on two benchmarks: MBPP-sanitized and MBPP-ET. The results show that ClarifyGPT elevates the performance (Pass@1) of GPT-4 from 70.96% to 80.80% on MBPP-sanitized. Furthermore, to conduct large-scale automated evaluations of ClarifyGPT across different LLMs and benchmarks without requiring user participation, we introduce a high-fidelity simulation method to simulate user responses. The results demonstrate that ClarifyGPT can significantly enhance code generation performance compared to the baselines. In particular, ClarifyGPT improves the average performance of GPT-4 and ChatGPT across five benchmarks from 62.43% to 69.60% and from 54.32% to 62.37%, respectively. A human evaluation also confirms the effectiveness of ClarifyGPT in detecting ambiguous requirements and generating high-quality clarifying questions. We believe that ClarifyGPT can effectively facilitate the practical application of LLMs in real-world development environments.
In contemporary Software Engineering (SE), coordinated API calls are necessary to perform complex data retrieval operations as well as tasks. Large Language Models (LLMs) offer highly potential capabilities for natural language parsing and automation of tasks, which sparked research into integrating APIs orchestration with LLMs. Nonetheless, while existing LLM-based frameworks have developed considerably, yet they experience challenges in tackling complex tasks which tend to involve iterative, step-by-step problem-solving. Current frameworks lack structured guidance, relying on LLMs’ own capabilities, resulting in blind iterations, inefficient error correction, and inefficient token utilization. This work introduces DesDD (Design-enabled framework with Dual-layer Debugging), a structured framework for LLM-driven iterative API orchestration. By applying software engineering design principles, we organize API orchestration workflows into distinct design and coding phases. Our dual-layer debugging mechanism detects and corrects errors in both phases, making the orchestration process more reliable and efficient. DesDD provides a structured design-first, then-code pathway, enabling LLMs to solve tasks iteratively through a well-defined, stepwise approach that systematically guide each problem-solving stage. The built-in dual-layer debugging component provides hierarchical error detection in both the design and coding aspects, allowing targeted corrections. Through comprehensive testing, we found that DesDD performs better than existing frameworks in orchestration efficiency and accuracy while using significantly fewer tokens. DesDD offers an effective LLM-based solution for API orchestration in complex tasks, e.g., chemical continuous flow process control. Its use of SE design principles ensures wide applicability, making it a promising approach for LLM-driven automated task resolution across diverse scenarios.
本报告将Vibe Coding相关研究整合为六大核心领域:从底层的程序合成与形式化验证,到交互式的提示驱动与精炼框架,再到人机协作范式、复杂系统多智能体协同、工程效能评估与信任机制,最后涵盖工业落地与未来愿景。该结构全面覆盖了从技术实现到工程实践、再到教育与行业演进的完整生态,为理解AI辅助软件工程(AI4SE)提供了系统性视角。