vibe coding
Vibe Coding 的概念定义、范式演进与社会技术影响
这组文献奠定了 Vibe Coding 的理论基础,探讨了从‘手动编写代码’到‘基于意图的自然语言交互’的范式转移。研究涵盖了其本质定义、开发者角色的重构、软件工程民主化进程,以及人类与 AI 协作中的‘物质脱离’等社会技术现象。
- Vibe Coding as a Reconfiguration of Intent Mediation in Software Development: Definition, Implications, and Research Agenda(Christian Meske, Tobias Hermanns, Esther Von der Weiden, Kai-Uwe Loser, T. Berger, 2025, IEEE Access)
- Democratizing Software Engineering through Generative AI and Vibe Coding: The Evolution of No-Code Development(Akhilesh Gadde, 2025, Journal of Computer Science and Technology Studies)
- Vibe Coding: A Multivocal Systematic Mapping Study(Nicole Beaulieu, S. Dascalu, Emily Hand, 2025, 2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD))
- Beyond Syntax: The Paradigm Shift to System Architecture in Large Language Model (LLM) Driven Development(Rizwanul Islam Afraim, 2026, SSRN Electronic Journal)
- Vibe Coding(Andrew J. Harris, 2025, Journal of Computing Sciences in Colleges (JCSC; Formerly: Journal of Computing in Small Colleges))
- Paradigms of Generative Artificial Intelligence in Automating Corporate Code Writing(Ankit Agarwal, 2025, The American Journal of Engineering and Technology)
- Vibe Coding a Research Probe for Exploring AI/Voice Based Code Reviews(Martin Gundtoft, 2025, International Conference on AI Research)
- Vibe coding: programming through conversation with artificial intelligence(Advait Sarkar, Ian Drosos, 2025, arXiv.org)
- (R)evolution of Programming: Vibe Coding as a Post-Coding Paradigm(Kevin Krings, Nino S. Bohn, Thomas Ludwig, 2025, arXiv.org)
- Building Software by Rolling the Dice: A Qualitative Study of Vibe Coding(Yi-Hung Chou, Bo Jiang, Yi Chen, M. Weng, Victoria Jackson, Thomas Zimmermann, James A. Jones, 2025, arXiv.org)
提示工程、检索增强(RAG)与自然语言编程框架
该组研究关注如何通过结构化的提示词(Prompting)、专用声明式框架(如 PDL, TEML)以及引入外部知识(RAG)来增强 LLM 的代码生成能力,旨在解决自然语言的模糊性,实现更精确的意图传达与上下文感知。
- PDL: A Declarative Prompt Programming Language(M. Vaziri, Louis Mandel, Claudio Spiess, Martin Hirzel, 2024, arXiv.org)
- Improving ChatGPT Prompt for Code Generation(Chao Liu, Xuanlin Bao, Hongyu Zhang, Neng Zhang, Haibo Hu, Xiaohong Zhang, Meng Yan, 2023, arXiv.org)
- Prompt Sapper: A LLM-Empowered Production Tool for Building AI Chains(Yu Cheng, Jieshan Chen, Qing Huang, Zhenchang Xing, Xiwei Xu, Qinghua Lu, 2023, ACM Transactions on Software Engineering and Methodology)
- Code Semantic Zooming(Jinsheng Ba, Sverrir Thorgeirsson, Zhendong Su, 2025, arXiv.org)
- Sketch Then Generate: Providing Incremental User Feedback and Guiding LLM Code Generation through Language-Oriented Code Sketches(Zhuo Chen, Zeyu Xiong, Xiaoshuo Yao, E. Glassman, 2024, arXiv.org)
- Natural Language Outlines for Code: Literate Programming in the LLM Era(Kensen Shi, Deniz Altinbüken, Saswat Anand, Mihai Christodorescu, Katja Grünwedel, Alexa Koenings, S. Naidu, Anurag Pathak, M. Rasi, Fredde Ribeiro, Brandon Ruffin, Siddhant Sanyam, Maxim Tabachnyk, Sara Toth, Roy Tu, Tobias Welp, Pengcheng Yin, M. Zaheer, Satish Chandra, Charles Sutton, 2024, Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering)
- ProDec: Automated Prompt Decomposition(Ebtesam Al Haque, Brittany Johnson, 2025, 2025 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC))
- Feature Execution Graphs: A Human-AI Co-Programming Paradigm for Graph-Driven LLM Code Synthesis(H. Batatia, I. Svetlichnyĭ, 2025, Proceedings of the AAAI Symposium Series)
- Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation(Muzhaffar Hazman, Minh-Khoi Pham, Shweta Soundararajan, Gonçalo Mordido, L. Custode, David Lynch, Giorgio Cruciata, Yucheng Shi, Hongmeng Song, Wang Chao, Pan Yue, Aleksandar Milenovic, A. Agapitos, 2025, European Conference on Artificial Intelligence)
- Java Code Generation Using Prompt Engineering Techniques(Anh Truong, P. Le, Hau Tran, 2025, International Journal of Software Engineering and Knowledge Engineering)
- PathOCl: Path-Based Prompt Augmentation for OCL Generation with GPT-4(Seif Abukhalaf, Mohammad Hamdaqa, F. Khomh, 2024, Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering)
- AI Prompt Engineering for Reliable and Explainable Code Generation(Namanyay Goel, M. Sharath, Venkata Sesha Sai Praveen Tunikuntla, Ramkinker Singh, 2025, 2025 IEEE 11th International Conference on Computing, Engineering and Design (ICCED))
- Enhancing Code Generation in Low-Code Platforms through Systematic Prompt Engineering(Priyanshu Vishwakarma, 2026, International Journal for Research in Applied Science and Engineering Technology)
- Planning in Natural Language Improves LLM Search for Code Generation(Evan Z. Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean M. Hendryx, Summer Yue, Hugh Zhang, 2025, International Conference on Learning Representations)
- Automated Prompt Generation for Code Intelligence: An Empirical study and Experience in WeChat(Kexing Ji, Shiyun Fu, Cuiyun Gao, Yujia Chen, Zezhou Yang, Chaozheng Wang, Yuetang Deng, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
- Incorporating External Knowledge through Pre-training for Natural Language to Code Generation(Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, Graham Neubig, 2020, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics)
- From Words to Code: Harnessing Data for Program Synthesis from Natural Language(Anirudh Khatry, Joyce Cahoon, Jordan Henkel, Shaleen Deep, Venkatesh Emani, Avrilia Floratou, Sumit Gulwani, Vu Le, Mohammad Raza, Sherry Shi, Mukul Singh, A. Tiwari, 2023, arXiv.org)
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis(Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Haiquan Wang, Yingbo Zhou, S. Savarese, Caiming Xiong, 2022, International Conference on Learning Representations)
- Analysis of Natural language processing for code generation by using COPRAS Method(2024, REST Journal on Data Analytics and Artificial Intelligence)
- Automated Code Generation Algorithm Using GPT-Based Language Models for Secure Software Development(Saif Obbayed, Balasubramaniam Kumaraswamy, V.Hemamalini, Telidevara Naga, Pavana Madhuri, K. Bhuvaneshwari, Mohammed Abdul Malek, Al Saadi, 2025, 2025 3rd International Conference on Cyber Resilience (ICCR))
- Prompt Engineering, Tools and Methods for Immersive Experience Development(Alexander Rozo-Torres, Wilson J. Sarmiento, 2024, 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW))
- Promptware Engineering: Software Engineering for LLM Prompt Development(Zhenpeng Chen, Chong Wang, Weisong Sun, Guang Yang, Xuanzhe Liu, Jie M. Zhang, Yang Liu, 2025, arXiv.org)
- Compositional Program Synthesis from Natural Language and Examples(Mohammad Raza, Sumit Gulwani, Natasa Milic-Frayling, 2015, International Joint Conference on Artificial Intelligence)
代理化(Agentic)架构与多智能体协同开发模式
此类文献探讨了利用大模型智能体(Agents)进行自主规划、多角色协作(如 PM、SE、PG 的分工)以及复杂任务的端到端执行。研究重点在于多轮对话反馈、自我演化协作网络以及人机结对编程的交互流程。
- A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement(Huan Zhang, Wei Cheng, Yuhan Wu, Wei Hu, 2024, Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering)
- Self-Evolving Multi-Agent Collaboration Networks for Software Development(Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, Siheng Chen, 2024, International Conference on Learning Representations)
- Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections(Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, Jeff Da, 2025, arXiv.org)
- CoPrompt: Supporting Prompt Sharing and Referring in Collaborative Natural Language Programming(Felicia Li Feng, Ryan Yen, Yuzhe You, Mingming Fan, Jian Zhao, Zhicong Lu, 2023, Proceedings of the CHI Conference on Human Factors in Computing Systems)
- Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents(Zijie Lin, Qilin Cai, Liang Shen, Mingjun Xiao, 2025, arXiv.org)
- Beyond Code Generation: LLM-supported Exploration of the Program Design Space(J.D. Zamfirescu-Pereira, Eunice Jun, Michael Terry, Qian Yang, Bjorn Hartmann, 2025, Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems)
- daVinci-Dev: Agent-native Mid-training for Software Engineering(Ji Zeng, Dayuan Fu, Tiantian Mi, Yumin Zhuang, Yaxing Huang, Xuefeng Li, Lyumanshan Ye, Muhang Xie, Qishuo Hua, Zhen Huang, Mohan Jiang, Han Wang, Jifan Lin, Yang Xiao, Jie Sun, Yunze Wu, Pengfei Liu, 2026, arXiv.org)
- Advancing LLM Agents for Code Generation: Observability, Orchestration, Reliable Performance(A. Kaplunovich, 2025, 2025 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS))
- Optimizing Agentic Code Generation: Cost Efficiency, Observability and Orchestration(A. Kaplunovich, 2025, 2025 IEEE International Conference on Big Data (BigData))
- VibeCodeHPC: An Agent-Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLMs(Shun-ichiro Hayashi, Koki Morita, Daichi Mukunoki, Tetsuya Hoshino, Takahiro Katagiri, 2025, arXiv.org)
- A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis(Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, D. Eck, Aleksandra Faust, 2023, International Conference on Learning Representations)
- PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents(J. Islam, Md Ataullha, Saiful Azad, 2025, ArXiv)
- BanglaForge: LLM Collaboration with Self-Refinement for Bangla Code Generation(Mahir Labib Dihan, Sadif Ahmed, Md Nafiu Rahman, 2025, arXiv.org)
- Supporting Contextual Conversational Agent-Based Software Development(Glaucia Melo dos Santos, Luis Fernando Lins, P. Alencar, Donald D. Cowan, 2023, 2023 IEEE/ACM 5th International Workshop on Bots in Software Engineering (BotSE))
迭代修复、自调试与执行反馈优化循环
该分组集中探讨了如何通过闭环反馈提升生成代码的质量,包括利用编译器报错、静态分析、测试用例反馈以及模型的自我纠错机制(Self-Correction),是实现 Vibe Coding 可靠性的核心技术环节。
- Large Language Model Guided Self-Debugging Code Generation(Muntasir Adnan, Zhiwei Xu, Carlos C. N. Kuhn, 2025, arXiv.org)
- LLMLOOP: Improving LLM-Generated Code and Tests Through Automated Iterative Feedback Loops(R. Ravi, Dylan Bradshaw, Stefano Ruberto, Gunel Jahangirova, Valerio Terragni, 2025, 2025 IEEE International Conference on Software Maintenance and Evolution (ICSME))
- Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach(Melika Sepidband, Hamed Taherkhani, Song Wang, Hadi Hemmati, 2025, 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC))
- SELF-REDRAFT: Eliciting Intrinsic Exploration-Exploitation Balance in Test-Time Scaling for Code Generation(Yixiang Chen, Tian Zheng, Shijue Huang, Zhitao He, Yi R. Fung, 2025, arXiv.org)
- Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback(Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fan Lu, Zili Zhang, Yulei Sui, Xuanhua Shi, Hai Jin, 2024, Annual Meeting of the Association for Computational Linguistics)
- A Reflexion-Driven, Document-Constrained Multi-Expert Framework for Reliable Program Synthesis in Graph-Based QA(Rui Guo, 2025, Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing)
- Neuro-Symbolic Program Synthesis for Multi-Hop Natural Language Navigation(William English, Dominic Simon, Md Rubel Ahmed, S. Jha, Rickard Ewetz, 2024, 2024 International Conference on Assured Autonomy (ICAA))
- LLM-based Iterative Refinement of Finite-State Machines with STPA Controller Constraints and Generation of IEC 61499 Code(Akira King, V. Vyatkin, 2025, 2025 IEEE 30th International Conference on Emerging Technologies and Factory Automation (ETFA))
- LEVER: Learning to Verify Language-to-Code Generation with Execution(Ansong Ni, Srini Iyer, Dragomir R. Radev, Ves Stoyanov, Wen-tau Yih, Sida I. Wang, Xi Victoria Lin, 2023, International Conference on Machine Learning)
- Program Synthesis via Test-Time Transduction(Kang-il Lee, Jahyun Koo, Seunghyun Yoon, Minbeom Kim, Hyukhun Koh, Dongryeol Lee, Kyomin Jung, 2025, arXiv.org)
- B-Coder: Value-Based Deep Reinforcement Learning for Program Synthesis(Zishun Yu, Yunzhe Tao, Liyu Chen, Tao Sun, Hongxia Yang, 2023, International Conference on Learning Representations)
- CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement(Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan A. Rossi, Yixuan Li, Saayan Mitra, 2024, Trans. Mach. Learn. Res.)
特定领域与复杂硬件的 Vibe Coding 实践
这组论文展示了 Vibe Coding 在垂直领域的应用,如硬件电路(Verilog/RTL)、工业控制(PLC)、并行计算(MPI)、医学教育、自动驾驶及数据科学,强调了非专业人员在复杂系统中的开发潜力。
- A Comparative Study on Self-Driving Scenario Code Generation Through Prompt Engineering Based on LLM-Specific Characteristics(Haneul Yang, Hyoeun Kim, Jonggu Kang, 2025, Applied Sciences)
- LLM-based Iterative Requirements Refinement in FSM with IEC 61499 Code Generation(V. Vyatkin, Sandeep Patil, Dmitrii Drozdov, Anatoly Shalyto, 2025, 2025 IEEE 23rd International Conference on Industrial Informatics (INDIN))
- Synthesis of Mathematical programs from Natural Language Specifications(Ganesh Prasath, S. Karande, 2023, arXiv.org)
- CTPGenius: Category Theoretic Prompt Graphs for Modular RTL Generation with Large Language Models(Anmol Bhasin, 2025, 2025 IEEE 6th International Women in Technology Conference (WINTECHCON))
- Chemical classification program synthesis using generative artificial intelligence(Chris Mungall, Adnan Malik, Daniel R. Korn, Justin T. Reese, Noel M. O'boyle, Janna Hastings, 2025, Journal of Cheminformatics)
- PSD2Code: Automated Front-End Code Generation from Design Files via Multimodal Large Language Models(Yongxi Chen, Lei Chen, 2025, arXiv.org)
- SmartShell: Automated Shell Scripts Synthesis from Natural Language(Hao Li, Yuping Wang, Jie Yin, Gang Tan, 2019, International Journal of Software Engineering and Knowledge Engineering)
- SolSearch: An LLM-Driven Framework for Efficient SAT-Solving Code Generation(Junjie Sheng, Yanqiu Lin, Jiehao Wu, Yanhong Huang, Jianqi Shi, Min Zhang, Xiangfeng Wang, 2025, 2025 IEEE/ACM 47th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER))
- Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning(Manvi Jha, Jiaxin Wan, Deming Chen, 2025, arXiv.org)
- Unveiling the Power of Neural Program Synthesis in High Level Code Generation(Gurram Kumar Sai, Koganti Sri, Sai Harshith, K. A. Varma, G. Kiran, 2024, 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT))
- Program Synthesis Using Natural Language(Aditya Desai, Sumit Gulwani, V. Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, R. Sailesh, Subhajit Roy, 2015, Proceedings of the 38th International Conference on Software Engineering)
- NL2Viz: natural language to visualization via constrained syntax-guided synthesis(Zhengkai Wu, Vu Le, A. Tiwari, Sumit Gulwani, Arjun Radhakrishna, Ivan Radicek, Gustavo Soares, Xinyu Wang, Zhenwen Li, Tao Xie, 2022, Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering)
- GENCNIPPET: Automated Generation of Code Snippets for Supporting Programming Questions(Saikat Mondal, C. Roy, 2025, arXiv.org)
- Vibe Coding in nephrology education: clinician-led, AI-assisted development of open-source interactive learning tools(Francesco Pesce, Wisit Cheungpasitporn, 2025, Renal Failure)
- User-Centered Design with AI in the Loop: A Case Study of Rapid User Interface Prototyping with "Vibe Coding"(Tianyi Li, Tanay Maheshwari, Alex Voelker, 2025, arXiv.org)
- Rapid Development of Omics Data Analysis Applications through Vibe Coding(Jesse G. Meyer, 2025, arXiv.org)
- Academic Vibe Coding: Opportunities for Accelerating Research in an Era of Resource Constraint(M. Crowson, L. Celi, 2025, arXiv.org)
- Vibe Coding in Practice: Building a Driving Simulator Without Expert Programming Skills(Margarida Fortes-Ferreira, Md Shadab Alam, P. Bazilinskyy, 2025, Adjunct Proceedings of the 17th International Conference on Automotive User Interfaces and Interactive Vehicular Applications)
- On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software(Ali Nouri, Johan Andersson, Kailash De Jesus Hornig, Zhennan Fei, Emil Knabe, Håkan Sivencrona, Beatriz Cabrero-Daniel, Christian Berger, 2025, Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering)
- AI-Driven Enterprise Integration: Leveraging MuleSoft, Micro-services, and vibe coding for a Scalable Cloud Ecosystem(2025, International Conference on Recent Trends in Computer Science and Information Technology)
- Design of REST API Client for Conversational Agent using Large Language Model with Open API System(Seong-Gyeol Park, Ahtae Kim, Sookyung Lee, Haeun Lee, Chayapol Kamyod, Cheong-Ghil Kim, 2024, 2024 IEEE/ACIS 22nd International Conference on Software Engineering Research, Management and Applications (SERA))
- RTLCoder: Fully Open-Source and Efficient LLM-Assisted RTL Code Generation Technique(Shang Liu, Wenji Fang, Yao Lu, Jing Wang, Qijun Zhang, Hongce Zhang, Zhiyao Xie, 2025, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
- VeriMind: Agentic LLM for Automated Verilog Generation with a Novel Evaluation Metric(Bardia Nadimi, Ghali Omar Boutaib, Hao Zheng, 2025, arXiv.org)
- ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis(Runkai Li, Jia Xiong, Xiuyuan He, Jieru Zhao, Qiang Xu, Xi Wang, 2025, arXiv.org)
- Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents(Zihan Liu, Ruinan Zeng, Dongxia Wang, G. Peng, Jingyi Wang, Qiang Liu, Peiyu Liu, Wenhai Wang, 2024, arXiv.org)
- ChatMPI: LLM-Driven MPI Code Generation for HPC Workloads(Pedro Valero-Lara, Aaron R. Young, Thomas Naughton III, Christian Engelmann, Al Geist, Jeffrey S. Vetter, Keita Teranishi, William F. Godoy, 2026, Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region)
- OPL4GPT: An Application Space Exploration of Optimal Programming Language for Hardware Design by LLM(Kimia Tasnia, Sazadur Rahman, 2025, Proceedings of the 30th Asia and South Pacific Design Automation Conference)
- Type-directed synthesis of visualizations from natural language queries(Qiaochu Chen, Shankara Pailoor, Celeste Barnaby, Abby Criswell, Chenglong Wang, Greg Durrett, Işıl Dillig, 2022, Proceedings of the ACM on Programming Languages)
- MTP: A Meaning-Typed Language Abstraction for AI-Integrated Programming(Jason Mars, Yiping Kang, Jayanaka L. Dantanarayana, Chandra Irugalbandara, Kugesan Sivasothynathan, Christopher Clarke, Baichuan Li, Lingjia Tang, 2024, Proceedings of the ACM on Programming Languages)
- Athena: Intermediate Representations for Iterative Scaffolded App Generation with an LLM(Jon-Tait Beason, Ruijia Cheng, E. Schoop, Jeffrey Nichols, 2025, International Conference on Intelligent User Interfaces)
- Interactive Program Synthesis for Modeling Collaborative Physical Activities from Narrated Demonstrations(Edward Kim, Daniel He, Jorge Chao, Wiktor Rajca, Mohammed Amin, Nishant Malpani, Ruta Desai, Antti Oulasvirta, Bjoern Hartmann, S. Seshia, 2025, arXiv.org)
评估基准、安全性风险与形式化验证
此分组致力于建立科学的评价体系,测试 LLM 在不同复杂度下的表现,并深入研究 Vibe Coding 带来的安全风险(如漏洞生成、恶意提示词攻击)以及如何通过形式化验证(Vericoding)确保代码的绝对可靠。
- Program Synthesis with Large Language Models(Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, H. Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, Charles Sutton, 2021, arXiv.org)
- Evaluating Large Language Models in Class-Level Code Generation(Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, Yiling Lou, 2024, Proceedings of the IEEE/ACM 46th International Conference on Software Engineering)
- One-to-many testing for code generation from (just) natural language(Mansi Uniyal, Mukul Singh, Gust Verbruggen, Sumit Gulwani, Vu Le, 2024, Findings of the Association for Computational Linguistics: EMNLP 2024)
- CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis(Anjiang Wei, Tarun Suresh, Jiannan Cao, Naveen Kannan, Yuheng Wu, Kaiwen Yan, Thiago S. F. X. Teixeira, Ke Wang, Alex Aiken, 2025, arXiv.org)
- DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models(Yiming Huang, Jianwen Luo, Yang Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, Kang Liu, 2024, Conference on Empirical Methods in Natural Language Processing)
- Benchmarking Correctness and Security in Multi-Turn Code Generation(Ruchit Rawal, Jeffrey Yang Fan Chiang, Chihao Shen, Jeffery Siyuan Tian, Aastha Mahajan, Tom Goldstein, Yizheng Chen, 2025, arXiv.org)
- Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models(Marcelo Bruni, F. Gabrielli, Mohammad Ghafari, Martin Kropp, 2025, 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge))
- DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions(Fangzhou Wu, Xiaogeng Liu, Chaowei Xiao, 2023, arXiv.org)
- Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks(Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li, Lei Li, 2025, arXiv.org)
- Vibe Coding in Practice: Flow, Technical Debt, and Guidelines for Sustainable Use(Muhammad Waseem, Aakash Ahmad, Kai-Kristian Kemell, Jussi Rasku, Sami Lahti, Kalle Mäkelä, Pekka Abrahamsson, 2025, arXiv.org)
- Vibe Coding: Is Human Nature the Ghost in the Machine?(Cory Knobel, Nicole Radziwill, 2025, arXiv.org)
- mHumanEval - A Multilingual Benchmark to Evaluate Large Language Models for Code Generation(Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri, 2024, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers))
- NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations(Junkai Chen, Zhenhao Li, Hu Xing, Xia Xin, 2024, ACM Transactions on Software Engineering and Methodology)
- A benchmark for vericoding: formally verified program synthesis(Sergiu Bursuc, Theodore Ehrenborg, Shaowei Lin, L. Astefanoaei, Ionel Emilian Chiosa, Jure Kukovec, Alok Singh, Oliver Butterley, Adem Bizid, Quinn Dougherty, Miranda Zhao, Max Tan, Max Tegmark, 2025, arXiv.org)
- Natural Language to Code Generation in Interactive Data Science Notebooks(Pengcheng Yin, Wen-Ding Li, Kefan Xiao, A. Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, H. Michalewski, Oleksandr Polozov, Charles Sutton, 2022, Annual Meeting of the Association for Computational Linguistics)
- Research on Deep Learning Based Code Generation from Natural Language Description(Jiaqi Zhu, Mingzhu Shen, 2020, 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA))
- Large Language Models for Code Generation: The Practitioners Perspective(Zeeshan Rasheed, Muhammad Waseem, Kai-Kristian Kemell, Aakash Ahmad, Malik Abdul Sami, Jussi Rasku, Kari Systä, Pekka Abrahamsson, 2025, arXiv.org)
- FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding(Hao Chen, Chengze Li, Jia Li, 2025, arXiv.org)
- CADInstruct: A multimodal dataset for natural language-guided CAD program synthesis(Chaofan Lv, Jinsong Bao, 2025, Computer-Aided Design)
- Design of application to analyse differences in the nucleotide sequence of the P72 protein-encoding gene of the African swine fever virus using the Vibe-coding tool programming in natural language -Websim(2025, Ministry of Science and Technology, Vietnam)
- Multi-language Software Development in the LLM Era: Insights from Practitioners’ Conversations with ChatGPT(Lucas Aguiar, Matheus Paixão, R. Carmo, Edson Soares, Ant´onio Leal, Matheus Freitas, Eliakim Gama, 2024, Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement)
- Understanding and Mitigating Errors of LLM-Generated RTL Code(Jiazheng Zhang, Cheng Liu, Huawei Li, 2025, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
编程教育变革、计算思维与社会心理研究
这些论文探讨了 Vibe Coding 对计算机教育的影响,分析了如何通过提示词学习(PbL)提升学生的计算思维,并研究了非专业人员在交互过程中的心理障碍及无障碍设计要求。
- Accessibility Heuristics for Vibe Coding Interfaces(Shalini Madan, Sreelakshmi Surabiyil Bindu, Venkatesh Potluri, 2025, Proceedings of the 27th International ACM SIGACCESS Conference on Computers and Accessibility)
- Vibe Coding in Education(G. Kusper, Csaba Szabó, 2025, 2025 International Conference on Emerging eLearning Technologies and Applications (ICETA))
- Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts(J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, Qian Yang, 2023, Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems)
- From Programming to Prompting: Developing Computational Thinking through Large Language Model-Based Generative Artificial Intelligence(Hsiao-Ping Hsu, 2025, TechTrends)
- Educational Effects of Generative AI and No-code Vibe Coding Classes : An Analysis of Non-CS Students’ Project Outputs(Hyun-Ju Lee, Dae-Jin Kim, 2025, The Journal of Next-generation Convergence Technology Association)
- Evaluating GPT for use in K-12 Block Based CS Instruction Using a Transpiler and Prompt Engineering(David Gonzalez-Maldonado, Jonathan Liu, Diana Franklin, 2025, Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1)
- Exploring undergraduates’ computational thinking and human-computer interaction patterns in generative progressive prompt-assisted programming learning(Xin Gong, Weiqi Xu, Ailing Qiao, 2025, International Journal of Educational Technology in Higher Education)
- Investigating students’ programming behaviors, interaction qualities and perceptions through prompt-based learning in ChatGPT(Dan Sun, Azzeddine Boudouaia, Junfeng Yang, Jie Xu, 2024, Humanities and Social Sciences Communications)
- A vibe coding learning design to enhance EFL students' talking to, through, and about AI(D. Woo, Kai Guo, Yangyang Yu, 2025, arXiv.org)
- Prompt Programming: A Platform for Dialogue-based Computational Problem Solving with Generative AI Models(Victor-Alexandru Pădurean, Paul Denny, Alkis Gotovos, A. Singla, 2025, Proceedings of the 30th ACM Conference on Innovation and Technology in Computer Science Education V. 1)
- A Generative Artificial Intelligence (AI)-Based Human-Computer Collaborative Programming Learning Method to Improve Computational Thinking, Learning Attitudes, and Learning Achievement(Gang Zhao, Lijun Yang, Biling Hu, Jing Wang, 2025, Journal of Educational Computing Research)
- "Can you feel the vibes?": An exploration of novice programmer engagement with vibe coding(Kiev Gama, Filipe Calegario, Victoria Jackson, Alexander Nolte, Luiz A. Morais, Vinicius Garcia, 2025, arXiv.org)
- Enhancing Computer Programming Education with LLMs: A Study on Effective Prompt Engineering for Python Code Generation(Tianyu Wang, Nianjun Zhou, Zhixiong Chen, 2024, arXiv.org)
最终分组全面覆盖了 Vibe Coding 从理论定义、技术实现到社会影响的全生命周期。研究确认了软件开发正从“语法导向”转向“意图导向”的范式转移;技术上,通过 Agent 架构、迭代自调试和精细化提示工程显著提升了生产力;应用上,已深入渗透至硬件设计、高性能计算等垂直领域;同时,研究也警示了安全性漏洞与技术债风险,并提出了 AI 时代编程教育与计算思维培养的新路径。
总计188篇相关文献
We examine"vibe coding": an emerging programming paradigm where developers primarily write code by interacting with code-generating large language models rather than writing code directly. We present the first empirical study of vibe coding. We analysed over 8 hours of curated video capturing extended vibe coding sessions with rich think-aloud reflections. Using framework analysis, we investigated programmers'goals, workflows, prompting techniques, debugging approaches, and challenges encountered. We find that vibe coding follows iterative goal satisfaction cycles where developers alternate between prompting AI, evaluating generated code through rapid scanning and application testing, and manual editing. Prompts in vibe coding blend vague, high-level directives with detailed technical specifications. Debugging remains a hybrid process combining AI assistance with manual practices. Critically, vibe coding does not eliminate the need for programming expertise but rather redistributes it toward context management, rapid code evaluation, and decisions about when to transition between AI-driven and manual manipulation of code. Trust in AI tools during vibe coding is dynamic and contextual, developed through iterative verification rather than blanket acceptance. Vibe coding is an evolution of AI-assisted programming that represents an early manifestation of"material disengagement", wherein practitioners orchestrate code production and manipulation, mediated through AI, while maintaining selective and strategic oversight.
Vibe coding, a term coined by Andrej Karpathy in February 2025, has quickly become a compelling and controversial natural language programming paradigm in AI-assisted software development. Centered on iterative co-design with an AI assistant, vibe coding emphasizes flow and experimentation over strict upfront specification. While initial studies have begun to explore this paradigm, most focus on analyzing code artifacts or proposing theories with limited empirical backing. There remains a need for a grounded understanding of vibe coding as it is perceived and experienced by developers. We present the first systematic qualitative investigation of vibe coding perceptions and practice. Drawing on over 190,000 words from semi-structured interviews, Reddit threads, and LinkedIn posts, we characterize what vibe coding is, why and how developers use it, where it breaks down, and which emerging practices aim to support it. We propose a qualitatively grounded theory of vibe coding centered on conversational interaction with AI, co-creation, and developer flow and joy. We find that AI trust regulates movement along a continuum from delegation to co-creation and supports the developer experience by sustaining flow. We surface recurring pain points and risks in areas including specification, reliability, debugging, latency, code review burden, and collaboration. We also present best practices that have been discovered and shared to mitigate these challenges. We conclude with implications for the future of AI dev tools and directions for researchers investigating vibe coding.
The integration of generative artificial intelligence (AI) into software development processes represents a paradigm shift in how individuals interact with technology creation tools. This article examines the emergence of intuitive programming approaches colloquially termed "vibe coding" alongside traditional no-code and low code platforms, analyzing their combined potential to democratize software engineering practices. Through systematic analysis of current research, It identifies key technological frameworks, implementation challenges, and potential socioeconomic implications of AI-assisted development environments. The article findings suggest that generative AI fundamentally transforms the accessibility paradigm by bridging natural language expression with functional software creation, potentially reducing traditional barriers to entry while introducing new considerations regarding technical depth, sustainability, and equity in software production ecosystems.
Software development is undergoing a fundamental transformation as vibe coding becomes widespread, with large portions of contemporary codebases now being generated by Artificial Intelligence (AI). The disconnect between rapid adoption and limited conceptual understanding highlights the need for an inquiry into this emerging paradigm. Drawing on an intent perspective and historical analysis, we define vibe coding as a software development paradigm where humans and Generative AI (GenAI) engage in collaborative flow to co-create software artifacts through natural language dialogue, shifting the mediation of developer intent from deterministic instruction to probabilistic inference. By intent mediation, we refer to the fundamental process through which developers translate their conceptual goals into representations that computational systems can execute. Our results show that vibe coding redistributes epistemic labor between humans and machines, shifting expertise from technical implementation toward collaborative orchestration. We identify key opportunities, including democratization, acceleration, and systemic leverage, alongside risks such as black-box codebases, responsibility gaps, and ecosystem bias. We conclude with a research agenda spanning human-, technology-, and organization-centered directions to guide future investigations of this paradigm.
Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although vibe coding is increasingly adopted, are its outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
Background and Context. Chat-based and inline-coding-based GenAI has already had substantial impact on the CS Education community. The recent introduction of “vibe coding” may further transform how students program, as it introduces a new way for students to create software projects with minimal oversight. Objectives. The purpose of this study is to understand how students in introductory programming and advanced software engineering classes interact with a vibe coding platform (Replit) to build software and how the interactions differ by programming background. Methods. Participants were asked to think-aloud while building a web application using Replit. We qualitatively coded screen recordings and associated artifacts (prompts, code changes, and error logs) to construct an interaction labeling scheme capturing how students prompted, tested, debugged, and engaged with code. Findings. For both groups, the majority of student interactions with Replit were to test common cases in the prototype or use prompts to debug. Only rarely did students analyze or manually edit code. Prompts by advanced software engineering students were much more likely to include relevant app features and codebase contexts than those by introductory programming students.
No abstract available
We present a case study of using generative user interfaces, or ``vibe coding,''a method leveraging large language models (LLMs) for generating code via natural language prompts, to support rapid prototyping in user-centered design (UCD). Extending traditional UCD practices, we propose an AI-in-the-loop ideate-prototyping process. We share insights from an empirical experience integrating this process to develop an interactive data analytics interface for highway traffic engineers to effectively retrieve and analyze historical traffic data. With generative UIs, the team was able to elicit rich user feedback and test multiple alternative design ideas from user evaluation interviews and real-time collaborative sessions with domain experts. We discuss the advantages and pitfalls of vibe coding for bridging the gaps between design expertise and domain-specific expertise.
“Vibe coding” — the practice of developing software through iteratively conversing with a large language model (LLM) — has exploded in popularity within the last year. However, developers report key limitations including the accumulation of technical debt, security issues, and code churn to achieve satisfactory results. We argue that these pitfalls result from LLMs’ inability to reconcile accumulating human-imposed constraints during vibe coding, with developers inadvertently failing to resolve contradictions because LLMs prioritize user commands over code consistency. Given LLMs’ receptiveness to verification-based feedback, we argue that formal methods can mitigate these pitfalls, making vibe coding more reliable. However, we posit that integrating formal methods must transcend existing approaches that combine formal methods and LLMs. We advocate for a side-car system throughout the vibe coding process which: (1) Autoformalizes specifications (2) Validates against targets, (3) Delivers actionable feedback to the LLM, and (4) Allows intuitive developer influence on specifications.
No abstract available
We explore vibe coding as a rapid prototyping approach powered by generative AI. We discuss how it lowers the barrier to creating high-fidelity prototypes, enabling nontechnical users to build apps, and examine its implications for communication, validation, and iterative software design.
The emergence of large language models has introduced new opportunities in software development, particularly through a revolutionary paradigm known as vibe coding or “coding by vibes”, in which developers express their software ideas in natural language and where the LLM generates the code. This paper investigates the potential of vibe coding to support novice programmers. The first author, without coding experience, attempted to create a 3D driving simulator using the Cursor platform and Three.js. The iterative prompting process improved the simulation’s functionality and visual quality. The results indicated that LLM can reduce barriers to creative development and expand access to computational tools. However, challenges remain: prompts often required refinements, output code can be logically flawed, and debugging demanded a foundational understanding of programming concepts. These findings highlight that while vibe coding increases accessibility, it does not completely eliminate the need for technical reasoning and understanding prompt engineering.
This exploratory study examined the consistency of human-AI collaboration by analyzing three extensive"vibe coding"sessions between a human product lead and an AI software engineer. We investigated similarities and differences in team dynamics, communication patterns, and development outcomes across both projects. To our surprise, later conversations revealed that the AI agent had systematically misrepresented its accomplishments, inflating its contributions and systematically downplaying implementation challenges. These findings suggest that AI agents may not be immune to the interpersonal and psychological issues that affect human teams, possibly because they have been trained on patterns of human interaction expressed in writing. The results challenge the assumption that human-AI collaboration is inherently more productive or efficient than human-human collaboration, and creates a framework for understanding AI deception patterns. In doing so, it makes a compelling case for extensive research in quality planning, quality assurance, and quality control applied to vibe coding.
This innovative practice article reports on the piloting of vibe coding (using natural language to create software applications with AI) for English as a Foreign Language (EFL) education. We developed a human-AI meta-languaging framework with three dimensions: talking to AI (prompt engineering), talking through AI (negotiating authorship), and talking about AI (mental models of AI). Using backward design principles, we created a four-hour workshop where two students designed applications addressing authentic EFL writing challenges. We adopted a case study methodology, collecting data from worksheets and video recordings, think-aloud protocols, screen recordings, and AI-generated images. Contrasting cases showed one student successfully vibe coding a functional application cohering to her intended design, while another encountered technical difficulties with major gaps between intended design and actual functionality. Analysis reveals differences in students'prompt engineering approaches, suggesting different AI mental models and tensions in attributing authorship. We argue that AI functions as a beneficial languaging machine, and that differences in how students talk to, through, and about AI explain vibe coding outcome variations. Findings indicate that effective vibe coding instruction requires explicit meta-languaging scaffolding, teaching structured prompt engineering, facilitating critical authorship discussions, and developing vocabulary for articulating AI mental models.
Vibe Coding (VC) is a form of software development assisted by generative AI, in which developers describe the intended functionality or logic via natural language prompts, and the AI system generates the corresponding source code. VC can be leveraged for rapid prototyping or developing the Minimum Viable Products (MVPs); however, it may introduce several risks throughout the software development life cycle. Based on our experience from several internally developed MVPs and a review of recent industry reports, this article analyzes the flow-debt tradeoffs associated with VC. The flow-debt trade-off arises when the seamless code generation occurs, leading to the accumulation of technical debt through architectural inconsistencies, security vulnerabilities, and increased maintenance overhead. These issues originate from process-level weaknesses, biases in model training data, a lack of explicit design rationale, and a tendency to prioritize quick code generation over human-driven iterative development. Based on our experiences, we identify and explain how current model, platform, and hardware limitations contribute to these issues, and propose countermeasures to address them, informing research and practice towards more sustainable VC approaches.
No abstract available
Building custom data analysis platforms traditionally requires extensive software engineering expertise, limiting accessibility for many researchers. Here, I demonstrate that modern large language models (LLMs) and autonomous coding agents can dramatically lower this barrier through a process called'vibe coding', an iterative, conversational style of software creation where users describe goals in natural language and AI agents generate, test, and refine executable code in real-time. As a proof of concept, I used Vibe coding to create a fully functional proteomics data analysis website capable of performing standard tasks, including data normalization, differential expression testing, and volcano plot visualization. The entire application, including user interface, backend logic, and data upload pipeline, was developed in less than ten minutes using only four natural-language prompts, without any manual coding, at a cost of under $2. Previous works in this area typically require tens of thousands of dollars in research effort from highly trained programmers. I detail the step-by-step generation process and evaluate the resulting code's functionality. This demonstration highlights how vibe coding enables domain experts to rapidly prototype sophisticated analytical tools, transforming the pace and accessibility of computational biology software development.
Emerging alongside generative AI and the broader trend of AI-assisted coding, the term"vibe coding"refers to creating software via natural language prompts rather than direct code authorship. This approach promises to democratize software development, but its educational implications remain underexplored. This paper reports on a one-day educational hackathon investigating how novice programmers and mixed-experience teams engage with vibe coding. We organized an inclusive event at a Brazilian public university with 31 undergraduate participants from computing and non-computing disciplines, divided into nine teams. Through observations, an exit survey, and semi-structured interviews, we examined creative processes, tool usage patterns, collaboration dynamics, and learning outcomes. Findings reveal that vibe coding enabled rapid prototyping and cross-disciplinary collaboration, with participants developing prompt engineering skills and delivering functional demonstrations within time constraints. However, we observed premature convergence in ideation, uneven code quality requiring rework, and limited engagement with core software engineering practices. Teams adopted sophisticated workflows combining multiple AI tools in pipeline configurations, with human judgment remaining essential for critical refinement. The short format (9 hours) proved effective for confidence-building among newcomers while accommodating participants with limited availability. We conclude that vibe coding hackathons can serve as valuable low-stakes learning environments when coupled with explicit scaffolds for divergent thinking, critical evaluation of AI outputs, and realistic expectations about production quality.
No abstract available
Generative AI tools increasingly shape established software engineering practices such as code review, but the socio-technical implications of using AI for these practices remain understudied. In this paper we first introduce vibe coding (Andrej Karpathy [@karpathy], 2025) as a method for allowing researchers with limited coding experience to rapidly create custom made probes for conducting research. Guided by Alami and Ernst’s (2025) findings on AI-generated feedback for code review, we introduce a vibe coded AI/Voice based code review prototype as a provotype (Boer and Donovan, 2012). We then outline an explorative study to critically assess the socio-technical effects of using AI based voice interfaces in code reviews. We propose a qualitative approach, based on the Disruptive Research Playbook (Storey et al., 2024), involving Danish software developers to investigate voice-based feedback's impact on topics including trust, collaboration, and perceived skill shifts. Initial methodological reflections emphasize the need for cautious exploration using the provotype as an intervention for gathering data in the form of reactions, expectations, and concern about the effects of AI interactions, in the established professional practise of code review. Next steps are to finalize the provotype, complete the research design and collect and analyze qualitative data from interventions with danish software developer teams.
Vibe coding is an emergent, intuitive, and expressive approach to software development and engineering. Though not yet formally defined, the concept blends AI-assisted tooling, human-AI co-creation, and developer-centered aesthetics and intuition. The concept is the focus of blogs, essays, and practitioner-generated content. Few scholarly publications directly address the concept, although many engage with its adjacent practices and conceptual foundations.This study synthesizes and maps the current state of literature, discussing vibe coding and related topics to identify themes, publication trends, and future research opportunities. Drawing on eight seed publications, which yielded a total of 54 screened publications, including 30 grey literature sources and 24 peer-reviewed articles, this multivocal systematic mapping study employs structured rationale codes, rigorous screening, and established guidance for incorporating practitioner literature.Our analysis identified four primary themes: tooling impact, developer experience, human-AI collaboration, and cultural and organizational support and change. We present a catalog of publications that advance the understanding of vibe coding, a mapping of emergent themes, and a foundation for future research and design considerations related to human-AI co-creative development.
Recent advancements in generative artificial intelligence (GenAI), particularly large language models, have introduced new possibilities for software development practices. In our paper we investigate the emerging Vibe Coding (VC) paradigm that emphasizes intuitive, affect-driven, and improvisational interactions between developers and AI systems. Building upon the discourse of End-User Development (EUD), we explore how VC diverges from conventional programming approaches such as those supported by tools like GitHub Copilot. Through five semi-structured interview sessions with ten experienced software practitioners, we identify five thematic dimensions: creativity, sustainability, the future of programming, collaboration, and criticism. Our analysis conceptualizes VC within the metaphor of co-drifting, contrasting it with the prevalent co-piloting perspective of AI-assisted development. We argue that VC reconfigures the developers role, blurring boundaries between professional and non-developers. While VC enables novel forms of expression and rapid prototyping, it also introduces challenges regarding reproducibility, scalability, and inclusivity. We propose that VC represents a meaningful shift in programming culture, warranting further investigation within human-computer interaction (HCI) and software engineering research.
Abstract Medical education increasingly incorporates digital technologies; however, many tools remain passive and text-based. Vibe Coding is a clinician-led design framework that embeds expert reasoning and the cognitive ‘feel’ of clinical decision-making into interactive educational tools. This study demonstrates its application in nephrology training through the rapid development of open-source, AI-assisted, web-based applications. We conducted a proof-of-concept development study using a structured, physician-led, AI-assisted process combining (1) deconstruction of clinical algorithms, (2) natural-language-to-code generation with modern large language models, and (3) iterative refinement of user interfaces. The target audience included nephrology trainees and educators, with source content derived from peer-reviewed educational literature. Four open-source, web-based applications were developed: (1) Kidney Stone Navigator for 24-hour urine analysis interpretation, (2) NephroFlow CKRT Clinical Copilot for dose and anticoagulation management, (3) Renal Tubular Acidosis Diagnostic Assistant for algorithmic diagnosis, and (4) Interactive Guide to Disorders of Volume for dynamic visualization of pathophysiology. Each tool mirrored expert reasoning, integrated automated calculations, and was publicly released on GitHub with live deployment for global educational use. Clinician-led, AI-assisted development enables the translation of static educational materials into interactive, open-access tools. The Vibe Coding framework demonstrates a scalable, reproducible model for innovation in medical education and supports transparent digital scholarship in nephrology.
The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs'ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.
Large language models (LLMs) are reshaping software engineering by enabling"vibe coding,"in which developers build software primarily through prompts rather than writing code. Although widely publicized as a productivity breakthrough, little is known about how practitioners actually define and engage in these practices. To shed light on this emerging phenomenon, we conducted a grounded theory study of 20 vibe-coding videos, including 7 live-streamed coding sessions (about 16 hours, 254 prompts) and 13 opinion videos (about 5 hours), supported by additional analysis of activity durations and prompt intents. Our findings reveal a spectrum of behaviors: some vibe coders rely almost entirely on AI without inspecting code, while others examine and adapt generated outputs. Across approaches, all must contend with the stochastic nature of generation, with debugging and refinement often described as"rolling the dice."Further, divergent mental models, shaped by vibe coders'expertise and reliance on AI, influence prompting strategies, evaluation practices, and levels of trust. These findings open new directions for research on the future of software engineering and point to practical opportunities for tool design and education.
Academic laboratories face mounting resource constraints: budgets are tightening, grant overheads are potentially being capped, and the market rate for data-science talent significantly outstrips university compensation. Vibe coding, which is structured, prompt-driven code generation with large language models (LLMs) embedded in reproducible workflows, offers one pragmatic response. It aims to compress the idea-to-analysis timeline, reduce staffing pressure on specialized data roles, and maintain rigorous, version-controlled outputs. This article defines the vibe coding concept, situates it against the current academic resourcing crisis, details a beginner-friendly toolchain for its implementation, and analyzes inherent limitations that necessitate governance and mindful application.
Large language models are reshaping programming by enabling'vibe coding': the development of softwares through natural-language interaction with model-driven toolchains. This article argues that vibe coding is best understood as interface flattening, a reconfiguration in which previously distinct modalities (GUI, CLI, and API) appear to converge into a single conversational surface, even as the underlying chain of translation from intention to machinic effect lengthens and thickens. Drawing on Friedrich Kittler's materialist media theory and Alexander Galloway's account of interfaces as sites of protocol control, the paper situates programming as a historically localised interface arrangement rather than an essential relation to computation. Through a materialist reconstruction of the contemporary vibe-coding stack, it shows how remote compute infrastructures, latency and connectivity, structured outputs, function/tool calling, and interoperability standards such as the Model Context Protocol relocate control and meaning-making power to model and protocol providers. The apparent democratisation of technical capability therefore depends on new dependencies and new literacies. By foregrounding the tension between experiential flattening and infrastructural thickening, I demonstrate how LLM-mediated development redistributes symbolic labour/power, obscures responsibility, and privatises competencies previously dispersed across programming communities, contributing a critical lens on the political economy of AI-mediated human-computer interaction.
AI coding tools are transforming programming, shifting it from a highly editorial process into a conversational activity. Popular Vibe coding tools such as Replit and Cursor integrate natural language interfaces with traditional development environments. While these tools promise simplicity, increased productivity, and automation, they also introduce new accessibility challenges for blind or visually impaired (BVI) developers. A systematic identification of these accessibility challenges necessitates comprehensive guidelines that account for the complex interactions in these tools. To address this need, we develop accessibility heuristics to assess the accessibility of AI conversational programming tools. Our heuristics combine web accessibility guidelines, best practices to design conversational interfaces, and accessibility needs specific to BVI developers. Our evaluation of three widely used conversational programming tools shows that most accessibility challenges arise from complex keyboard interactions, poor focus management, and insufficient feedback and access to the various actions and output of the tools.
Vibe coding is an emerging approach to programming education in which students collaborate with AI assistants in a creative and immersive manner. In this paper we present the development of an educational material designed to practice vibe coding and position it in relation to project-based learning, highlighting the novel perspective that AI can act as a teammate in the learning process. The present version of this paper reports a preliminary analysis based on data from the earlier study "We Are Not Afraid of the Wolf!". From this dataset, two hypotheses were examined: (H1) vibe coding increases students’ confidence in their programming solutions, and (H2) vibe coding enhances productivity, enabling more tasks to be completed in a given time. Preliminary results suggest that while confidence is only weakly improved, it is strongly associated with enjoyment and flow experiences, and that students perceive and demonstrate increased productivity when using AI support. The next version of this study will include the results of a dedicated, targeted experiment using a questionnaire specifically designed for vibe coding. This instrument will allow us to capture multiple dimensions of confidence, align subjective and objective productivity, and explore students’ perception of AI as a collaborator.
No abstract available
No abstract available
No abstract available
We present SLEAN (Simple Lightweight Ensemble Analysis Network), a deterministic framework for coordinating multiple LLM providers through text-based prompt orchestration. Unlike complex multi-agent systems requiring specialized infrastructure, SLEAN operates as a simple prompt bridge between LLMs using .txt templates, requiring no deep technical knowledge for deployment. The three-phase protocol formed by independent analysis, cross-critique, and arbitration, filters harmful AI-generated code suggestions before production deployment, addressing how AI-assisted debugging increasingly produces modifications that introduce unnecessary complexity, break existing functionality, or address problems. Evaluating 15 software bugs, we analyzed 69 AI-generated fix propositions. SLEAN's filtering accepted 22 fixes (31.9%, 95% CI 20.9-42.9%) while rejecting 47 that would have been harmful if applied verbatim. The arbitration process reduced code change surface by 83-90% relative to raw AI outputs, enforcing minimal causal edits over scope-expanding modifications. Minimal Type 2 inputs proved more efficient than detailed Type 1 inputs, requiring 2.85 versus 3.56 propositions per accepted fix (35.1% versus 28.1% acceptance, about a 20% efficiency gain). Agreement between AI systems showed weak correlation with fix quality: high convergence (at least 80%) occurred in 4 of 15 cases and improved acceptance by only 2.4% points; arbitration appeared only at exactly 10% convergence in 2 of 15 cases, although low convergence alone did not necessitate arbitration. The file-driven, provider-agnostic architecture enables deployment without specialized coding expertise, making it applicable to security auditing, code review, document verification, and other domains requiring reliable multi-provider synthesis with end-to-end auditability.
ChatGPT has proven to facilitate computer programming tasks through the strategic use of prompts, which effectively steer the interaction with the language model towards eliciting relevant information. However, the impact of specifically designed prompts on programming learning outcomes has not been rigorously examined through empirical research. This study adopted a quasi-experimental framework to investigate the differential effects of prompt-based learning (PbL) versus unprompted learning (UL) conditions on the programming behaviors, interaction qualities, and perceptions of college students. The study sample consisted of 30 college students who were randomly assigned to two groups. A mixed-methods approach was employed to gather multi-faceted data. Results revealed notable distinctions between the two learning conditions. First, the PbL group students frequently engaged in coding with Python and employed debugging strategies to verify their work, whereas their UL counterparts typically transferred Python code from PyCharm into ChatGPT and posed new questions within ChatGPT. Second, PbL participants were inclined to formulate more complex queries independently, prompted by the guiding questions, and consequently received more precise feedback from ChatGPT compared to the UL group. UL students tended to participate in more superficial-level interactions with ChatGPT, yet they also obtained accurate feedback. Third, there were noticeable differences in perception observed before and after the ChatGPT implementation, UL group reported a more favorable perception in the perceived ease of use in the pre-test, while the PbL group experienced an improvement in their mean scores for perceived usefulness, ease of use, behavioral intention to utilize, and a significant difference regarding the attitude towards utilizing ChatGPT. Specifically, the use of structured output and delimiters enhanced learners’ understanding of problem-solving steps and made learning more efficient with ChatGPT. Drawing on these outcomes, the study offers recommendations for the incorporation of ChatGPT into future instructional designs, highlighting the structured prompting benefits in enhancing programming learning experience.
In this research paper, we discuss our attempt to teach high school students introductory programming with Python using a custom learning platform that leverages ChatGPT to generate personalized learning materials based on each student’s educational background. The platform features topics and subtopics, each supported by prompts for Explanation, Example, Exercise, and Exercise Solution, with a context-setting prompt tailored to individual students’ backgrounds while respecting their privacy. The case study brought up compelling insights. Students exhibited heightened engagement, and the lecturers transitioned from being traditional instructors teaching content to becoming mentors who guide students on what to do next, clarifying misunderstandings and addressing potential questions. Furthermore, students gained hands-on programming experience during the learning process, eliminating the traditional post-class experimentation phase. This innovative approach not only enhances traditional CS1 education but also suggests a broader application of Large Language Models (LLMs) for personalized learning across diverse fields, providing tailored instruction and fostering engagement.
Computing students increasingly rely on generative AI tools for programming assistance, often without formal instruction or guidance. This highlights a need to teach students how to effectively interact with AI models, particularly through natural language prompts, to generate and critically evaluate code for solving computational tasks. To address this, we developed a novel platform for prompt programming that enables authentic dialogue-based interactions, supports problems involving multiple interdependent functions, and offers on-request execution of generated code. Data analysis from over 900 students in an introductory programming course revealed high engagement, with the majority of prompts occurring within multi-turn dialogues. Problems with multiple interdependent functions encouraged iterative refinement, with progression graphs highlighting several common strategies. Students were highly selective about the code they chose to test, suggesting that on-request execution of generated code promoted critical thinking. Given the growing importance of learning dialogue-based programming with AI, we provide this tool as a publicly accessible resource, accompanied by a corpus of programming problems for educational use.
No abstract available
Large language models (LLMs) and prompt engineering hold significant potential for advancing computer programming education through personalized instruction. This paper explores this potential by investigating three critical research questions: the systematic categorization of prompt engineering strategies tailored to diverse educational needs, the empowerment of LLMs to solve complex problems beyond their inherent capabilities, and the establishment of a robust framework for evaluating and implementing these strategies. Our methodology involves categorizing programming questions based on educational requirements, applying various prompt engineering strategies, and assessing the effectiveness of LLM-generated responses. Experiments with GPT-4, GPT-4o, Llama3-8b, and Mixtral-8x7b models on datasets such as LeetCode and USACO reveal that GPT-4o consistently outperforms others, particularly with the"multi-step"prompt strategy. The results show that tailored prompt strategies significantly enhance LLM performance, with specific strategies recommended for foundational learning, competition preparation, and advanced problem-solving. This study underscores the crucial role of prompt engineering in maximizing the educational benefits of LLMs. By systematically categorizing and testing these strategies, we provide a comprehensive framework for both educators and students to optimize LLM-based learning experiences. Future research should focus on refining these strategies and addressing current LLM limitations to further enhance educational outcomes in computer programming instruction.
Large language models (LLMs) have taken the world by storm by making many previously difficult uses of AI feasible. LLMs are controlled via highly expressive textual prompts and return textual answers. Unfortunately, this unstructured text as input and output makes LLM-based applications brittle. This motivates the rise of prompting frameworks, which mediate between LLMs and the external world. However, existing prompting frameworks either have a high learning curve or take away control over the exact prompts from the developer. To overcome this dilemma, this paper introduces the Prompt Declaration Language (PDL). PDL is a simple declarative data-oriented language that puts prompts at the forefront, based on YAML. PDL works well with many LLM platforms and LLMs. It supports writing interactive applications that call LLMs and tools, and makes it easy to implement common use-cases such as chatbots, RAG, or agents. We hope PDL will make prompt programming simpler, less brittle, and more enjoyable.
Large language models (LLMs) are revolutionizing the field of computing education with their powerful code-generating capabilities. Traditional pedagogical practices have focused on code writing tasks, but there is now a shift in importance towards reading, comprehending and evaluating LLM-generated code. Alongside this shift, an important new skill is emerging -- the ability to solve programming tasks by constructing good prompts for code-generating models. In this work we introduce a new type of programming exercise to hone this nascent skill: 'Prompt Problems'. Prompt Problems are designed to help students learn how to write effective prompts for AI code generators. A student solves a Prompt Problem by crafting a natural language prompt which, when provided as input to an LLM, outputs code that successfully solves a specified programming task. We also present a new web-based tool called Promptly which hosts a repository of Prompt Problems and supports the automated evaluation of prompt-generated code. We deploy Promptly in one CS1 and one CS2 course and describe our experiences, which include student perceptions of this new type of activity and their interactions with the tool. We find that students are enthusiastic about Prompt Problems, and appreciate how the problems engage their computational thinking skills and expose them to new programming constructs. We discuss ideas for the future development of new variations of Prompt Problems, and the need to carefully study their integration into classroom practice.
Natural language (NL) programming has become more approachable due to the powerful code-generation capability of large language models (LLMs). This shift to using NL to program enhances collaborative programming by reducing communication barriers and context-switching among programmers from varying backgrounds. However, programmers may face challenges during prompt engineering in a collaborative setting as they need to actively keep aware of their collaborators’ progress and intents. In this paper, we aim to investigate ways to assist programmers’ prompt engineering in a collaborative context. We first conducted a formative study to understand the workflows and challenges of programmers when using NL for collaborative programming. Based on our findings, we implemented a prototype, CoPrompt, to support collaborative prompt engineering by providing referring, requesting, sharing, and linking mechanisms. Our user study indicates that CoPrompt assists programmers in comprehending collaborators’ prompts and building on their collaborators’ work, reducing repetitive updates and communication costs.
The advancement of large language model-based generative artificial intelligence (LLM-based GenAI) has sparked significant interest in its potential to address challenges in computational thinking (CT) education. CT, a critical problem-solving approach in the digital age, encompasses elements such as abstraction, iteration, and generalisation. However, its abstract nature often poses barriers to meaningful teaching and learning. This paper proposes a constructionist prompting framework that leverages LLM-based GenAI to foster CT development through natural language programming and prompt engineering. By engaging learners in crafting and refining prompts, the framework aligns CT elements with five prompting principles, enabling learners to apply and develop CT in contextual and organic ways. A three-phase workshop is proposed to integrate the framework into teacher education, equipping future teachers to support learners in developing CT through interactions with LLM-based GenAI. The paper concludes by exploring the framework’s theoretical, practical, and social implications, advocating for its implementation and validation.
Though the increased availability of Large Language Models (LLMs) presents significant potential for change in the way students learn to program, the text-based nature of the available tools currently preclude block-based languages from much of that innovation. In an attempt to remedy this, we identify the strengths and weaknesses of using a transpiler to leverage the existing learning in commercially available LLMs and Scratch, a visual block-based programming language. Using only prompt engineering, we evaluate an LLM's performance on two common classroom tasks in a Scratch curriculum. We evaluate the LLM's ability to: 1) Create project solutions that compile and satisfy project requirements and 2) Analyze student projects' completion of project requirements using natural language. In both cases, we find results indicating that prompt-engineering alone is insufficient to reliably produce high-quality results. For projects of medium complexity, the LLM-generated solutions consistently failed to follow correct syntax or, in the few instances with correct syntax, produce correct solutions. When used for auto-grading, we found a correlation between scores assigned by the official Scratch Encore autograder and those generated by the LLM, nevertheless the discrepancies between the 'real' scores and the scores assigned by the LLM remained too great for the tool to be reliable in a classroom setting.
Large language models trained on massive code corpora can generalize to new tasks without the need for task-specific fine-tuning. In few-shot learning, these models take as input a prompt, composed of natural language instructions, a few instances of task demonstration, and a query and generate an output. However, the creation of an effective prompt for code-related tasks in few-shot learning has received little attention. We present a technique for prompt creation that automatically retrieves code demonstrations similar to the developer task, based on embedding or frequency analysis. We apply our approach, Cedar, to two different programming languages, statically and dynamically typed, and two different tasks, namely, test assertion generation and program repair. For each task, we compare Cedar with state-of-the-art task-specific and fine-tuned models. The empirical results show that, with only a few relevant code demonstrations, our prompt creation technique is effective in both tasks with an accuracy of 76% and 52% for exact matches in test assertion generation and program repair tasks, respectively. For assertion generation, Cedar outperforms existing task-specific and fine-tuned models by 333% and 11%, respectively. For program repair, Cedar yields 189% better accuracy than task-specific models and is competitive with recent fine-tuned models. These findings have practical implications for practitioners, as Cedar could potentially be applied to multilingual and multitask settings without task or language-specific training with minimal examples and effort.
Human-computer collaboration is an effective way to learn programming courses. However, most existing human-computer collaborative programming learning is supported by traditional computers with a relatively low level of personalized interaction, which greatly limits the efficiency of students’ efficiency of programming learning and development of computational thinking. To address the above issues, this study introduces generative AI into human-computer collaborative programming learning and proposes a dialogue-negotiated human-computer collaborative programming learning method based on generative AI. The method focuses on the problems-solving process and constructs multiple agents through Prompt design, which enable students to improve their computational thinking and master programming skills in the process of human-computer interaction for problem-solving. Finally, a quasi-experiment was conducted to verify the effectiveness of the proposed method in a 10th grade computer programming course in a high school. 43 students in the experimental group learned with the proposed method, while 42 students in the control group adopted the traditional computer-supported human-computer collaborative programming learning method. The experimental results showed that the proposed method more significantly improved students’ computational thinking, programming learning attitudes, and learning achievement. This study provides theoretical foundations and application reference for future generative AI-assisted human-computer collaborative teaching.
The rapid progress of AI-powered programming assistants, such as GitHub Copilot, has facilitated the development of software applications. These assistants rely on large language models (LLMs), which are foundation models (FMs) that support a wide range of tasks related to understanding and generating language. LLMs have demonstrated their ability to express UML model specifications using formal languages like the Object Constraint Language (OCL). However, the context size of the prompt is limited by the number of tokens an LLM can process. This limitation becomes significant as the size of UML class models increases. In this study, we intro-duce PathOCL, a novel path-based prompt augmentation technique designed to facilitate OCL generation. PathOCL addresses the limi-tations of LLMs, specifically their token processing limit and the challenges posed by large UML class models. PathOCL is based on the concept of chunking, which selectively augments the prompts with a subset of UML classes relevant to the English specification. Our findings demonstrate that PathOCL, compared to augmenting the complete UML class model (UML-Augmentation), generates a higher number of valid and correct OCL constraints using the GPT-4 model. Moreover, the average prompt size crafted using PathOCL significantly decreases when scaling the size of the UML class models.
Pre-trained large language models (“LLMs”) like GPT-3 can engage in fluent, multi-turn instruction-taking out-of-the-box, making them attractive materials for designing natural language interactions. Using natural language to steer LLM outputs (“prompting”) has emerged as an important design technique potentially accessible to non-AI-experts. Crafting effective prompts can be challenging, however, and prompt-based interactions are brittle. Here, we explore whether non-AI-experts can successfully engage in “end-user prompt engineering” using a design probe—a prototype LLM-based chatbot design tool supporting development and systematic evaluation of prompting strategies. Ultimately, our probe participants explored prompt designs opportunistically, not systematically, and struggled in ways echoing end-user programming systems and interactive machine learning systems. Expectations stemming from human-to-human instructional experiences, and a tendency to overgeneralize, were barriers to effective prompt design. These findings have implications for non-AI-expert-facing LLM-based tool design and for improving LLM-and-prompt literacy among programmers and the public, and present opportunities for further research.
Machine learning-based building load forecasting (BLF) is crucial for the building automation community, and numerous ML models have been developed for this purpose. However, a significant challenge arises when promoting these models for deployment in real buildings: building practitioners often struggle with ML-related programming. To address this issue, we propose BuildProg, a program generation tool that leverages prompt engineering to decompose user requirements and guide large language models (LLMs) in generating the necessary Python code. In its current version, BuildProg supports four tasks related to the testing of BLF models.
Code generation problems differ from common natural language problems - they require matching the exact syntax of the target language, identifying happy paths and edge cases, paying attention to numerous small details in the problem spec, and addressing other code-specific issues and requirements. Hence, many of the optimizations and tricks that have been successful in natural language generation may not be effective for code tasks. In this work, we propose a new approach to code generation by LLMs, which we call AlphaCodium - a test-based, multi-stage, code-oriented iterative flow, that improves the performances of LLMs on code problems. We tested AlphaCodium on a challenging code generation dataset called CodeContests, which includes competitive programming problems from platforms such as Codeforces. The proposed flow consistently and significantly improves results. On the validation set, for example, GPT-4 accuracy (pass@5) increased from 19% with a single well-designed direct prompt to 44% with the AlphaCodium flow. Many of the principles and best practices acquired in this work, we believe, are broadly applicable to general code generation tasks. Full implementation is available at: https://github.com/Codium-ai/AlphaCodium
No abstract available
Prompt engineering has proven to be a crucial step in leveraging pretrained large language models (LLMs) in solving various real-world tasks. Numerous solutions have been proposed that seek to automate prompt engineering by using the model itself to edit prompts. However, the majority of state-of-the-art approaches are evaluated on tasks that require minimal prompt templates and on very large and highly capable LLMs. In contrast, solving complex tasks that require detailed information to be included in the prompt increases the amount of text that needs to be optimised. Furthermore, smaller models have been shown to be more sensitive to prompt design. To address these challenges, we propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases. In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes by searching the space of programmes populated by function compositions of syntactic, dictionary-based and LLM-based prompt-editing functions. In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes in an attempt to further fine-tune their performance. Our approach outperforms three state-of-the-art prompt optimisation approaches, PromptWizard, OPRO, and RL-Prompt, on three relatively small general-purpose LLMs in four domain-specific challenging tasks. We also illustrate several examples where these benchmark methods suffer relatively severe performance degradation, while our approach improves performance in almost all task-model combinations, only incurring minimal degradation when it does not.
OPL4GPT: An Application Space Exploration of Optimal Programming Language for Hardware Design by LLM
Despite the emergence of Large Language Models (LLMs) as potential tools for automating hardware design, the optimal programming language to describe hardware functions remains unknown. Prior works extensively explored optimizing Verilog-based HDL design, which often overlooked the potential capabilities of alternative programming languages for hardware designs. This paper investigates the efficacy of C++ and Verilog as input languages in extensive application space exploration, tasking an LLM to generate implementations for various System-on-chip functional blocks. We proposed an automated Optimal Programming Language (OPL) framework that leverages OpenAI's GPT-4o LLM to translate natural language specifications into hardware descriptions using both high-level and low-level programming paradigms. The OPL4GPT demonstration initially employs a novel prompt engineering approach that decomposes design specifications into manageable submodules, presented to the LLM to generate code in both C++ and Verilog. A closed-loop feedback mechanism automatically incorporates error logs from the LLM's outputs, encompassing both syntax and functionality. Finally, functionally correct outputs are synthesized using either RTL (Register-Transfer Level) for Verilog or High-Level Synthesis for C++ to assess area, power, and performance. Our findings illuminate the strengths and weaknesses of each language across various application domains, empowering hardware designers to select the most effective approach.
The translation of natural language to formal constraint models requires expertise in the problem domain and modeling frameworks. To explore the effectiveness of agentic workflows, we propose CP-Agent, a Python coding agent that uses the ReAct framework with a persistent IPython kernel. We provide the relevant domain knowledge as a project prompt of under 50 lines. The algorithm works by iteratively executing code, observing the solver's feedback, and refining constraint models based on execution results. We evaluate CP-Agent on 101 constraint programming problems from CP-Bench. We made minor changes to the benchmark to address systematic ambiguities in the problem specifications and errors in the ground-truth models. On the clarified benchmark, CP-Agent achieves perfect accuracy on all 101 problems. Our experiments show that minimal guidance outperforms detailed procedural scaffolding. Our experiments also show that explicit task management tools can have both positive and negative effects on focused modeling tasks.
This paper presents relevant considerations about the challenges of integrating “prompt engineering” tools in the design and development of immersive systems. Thus, the contribution of this work is a general framework for the design and development of immersive systems using “prompt engineering” tools that use Generative Artificial Intelligence (GenAI). In other words, this proposal provides how to incorporate these tools into a traditional workflow of extreme programming agile methodology, establishing creative collaboration between the user and the GenAI tools. This work includes a concrete example of this dynamic interaction in developing a virtual reality experience for preparing specialty coffee using the V60 method. Thus, the development example shows the use of a set tool of GertAI, “prompt” inputs used, their corresponding outputs, and how they are integrated into the final product at each stage of the process. Therefore, the recommendations and challenges presented in this document are based on the knowledge gained from this experience.
The emergence of foundation models, such as large language models (LLMs) GPT-4 and text-to-image models DALL-E, has opened up numerous possibilities across various domains. People can now use natural language (i.e., prompts) to communicate with AI to perform tasks. While people can use foundation models through chatbots (e.g., ChatGPT), chat, regardless of the capabilities of the underlying models, is not a production tool for building reusable AI services. APIs like LangChain allow for LLM-based application development but require substantial programming knowledge, thus posing a barrier. To mitigate this, we systematically review, summarise, refine and extend the concept of AI chain by incorporating the best principles and practices that have been accumulated in software engineering for decades into AI chain engineering, to systematize AI chain engineering methodology. We also develop a no-code integrated development environment, Prompt Sapper, which embodies these AI chain engineering principles and patterns naturally in the process of building AI chains, thereby improving the performance and quality of AI chains. With Prompt Sapper, AI chain engineers can compose prompt-based AI services on top of foundation models through chat-based requirement analysis and visual programming. Our user study evaluated and demonstrated the efficiency and correctness of Prompt Sapper.
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.
No abstract available
Solving navigation problems from natural language descriptions is essential for advancing humanrobot interaction and enhancing the usability of autonomous systems. Symbolic approaches to path planning excel in well defined environments but cannot cope with the ambiguity of natural language inputs. On the other hand, neural solutions centered on large language models (LLMs) can parse free-form natural language but lack the reasoning capabilities for solving complex multihop path planning problems. In this paper, we propose a neuro-symbolic framework based on program synthesis for multi-hop natural language navigation called NSPS. The framework uses an LLM to parse the problem definition in natural language, a graph of the environment, and an API of a graph library. Next, the code generation capabilities of the LLM are used to synthesize a program for path planning and verification. The path planning program is executed to generate a solution path that is checked by the verification program. A selfcorrection loop is used to fix both syntax and value errors. The framework is evaluated using 600 multihop navigation tasks with 1 to 10 hops. Compared with neural approaches, the NSPS framework improves the success rate and path efficiency by an average of 64.3% and 19.4% across all tasks, respectively.
Creating programs to correctly manipulate data is a difficult task, as the underlying programming languages and APIs can be challenging to learn for many users who are not skilled programmers. Large language models (LLMs) demonstrate remarkable potential for generating code from natural language, but in the data manipulation domain, apart from the natural language (NL) description of the intended task, we also have the dataset on which the task is to be performed, or the"data context". Existing approaches have utilized data context in a limited way by simply adding relevant information from the input data into the prompts sent to the LLM. In this work, we utilize the available input data to execute the candidate programs generated by the LLMs and gather their outputs. We introduce semantic reranking, a technique to rerank the programs generated by LLMs based on three signals coming the program outputs: (a) semantic filtering and well-formedness based score tuning: do programs even generate well-formed outputs, (b) semantic interleaving: how do the outputs from different candidates compare to each other, and (c) output-based score tuning: how do the outputs compare to outputs predicted for the same task. We provide theoretical justification for semantic interleaving. We also introduce temperature mixing, where we combine samples generated by LLMs using both high and low temperatures. We extensively evaluate our approach in three domains, namely databases (SQL), data science (Pandas) and business intelligence (Excel's Power Query M) on a variety of new and existing benchmarks. We observe substantial gains across domains, with improvements of up to 45% in top-1 accuracy and 34% in top-3 accuracy.
We present Semantic Interpreter, a natural language-friendly AI system for productivity software such as Microsoft Office that leverages large language models (LLMs) to execute user intent across application features. While LLMs are excellent at understanding user intent expressed as natural language, they are not sufficient for fulfilling application-specific user intent that requires more than text-to-text transformations. We therefore introduce the Office Domain Specific Language (ODSL), a concise, high-level language specialized for performing actions in and interacting with entities in Office applications. Semantic Interpreter leverages an Analysis-Retrieval prompt construction method with LLMs for program synthesis, translating natural language user utterances to ODSL programs that can be transpiled to application APIs and then executed. We focus our discussion primarily on a research exploration for Microsoft PowerPoint.
Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.
Pre-trained Large Language Models (LLMs) are beginning to dominate the discourse around automatic code generation with natural language specifications. In contrast, the best-performing synthesizers in the domain of formal synthesis with precise logical specifications are still based on enumerative algorithms. In this paper, we evaluate the abilities of LLMs to solve formal synthesis benchmarks by carefully crafting a library of prompts for the domain. When one-shot synthesis fails, we propose a novel enumerative synthesis algorithm, which integrates calls to an LLM into a weighted probabilistic search. This allows the synthesizer to provide the LLM with information about the progress of the enumerator, and the LLM to provide the enumerator with syntactic guidance in an iterative loop. We evaluate our techniques on benchmarks from the Syntax-Guided Synthesis (SyGuS) competition. We find that GPT-3.5 as a stand-alone tool for formal synthesis is easily outperformed by state-of-the-art formal synthesis algorithms, but our approach integrating the LLM into an enumerative synthesis algorithm shows significant performance gains over both the LLM and the enumerative synthesizer alone and the winning SyGuS competition tool.
Program synthesis from natural language (NL) is practical for humans and, once technically feasible, would significantly facilitate software development and revolutionize end-user programming. We present SAPS, an end-to-end neural network capable of mapping relatively complex, multi-sentence NL specifications to snippets of executable code. The proposed architecture relies exclusively on neural components, and is trained on abstract syntax trees, combined with a pretrained word embedding and a bi-directional multi-layer LSTM for processing of word sequences. The decoder features a doubly-recurrent LSTM, for which we propose novel signal propagation schemes and soft attention mechanism. When applied to a large dataset of problems proposed in a previous study, SAPS performs on par with or better than the method proposed there, producing correct programs in over 92% of cases. In contrast to other methods, it does not require post-processing of the resulting programs, and uses a fixed-dimensional latent representation as the only interface between the NL analyzer and the source code generator.
No abstract available
We present PanGu-Coder, a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation, i.e. the synthesis of programming language solutions given a natural language problem description. We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling (CLM) to pre-train on raw programming language data, while the second stage uses a combination of Causal Language Modelling and Masked Language Modelling (MLM) training objectives that focus on the downstream task of text-to-code generation and train on loosely curated pairs of natural language program definitions and code functions. Finally, we discuss PanGu-Coder-FT, which is fine-tuned on a combination of competitive programming problems and code with continuous integration tests. We evaluate PanGu-Coder with a focus on whether it generates functionally correct programs and demonstrate that it achieves equivalent or better performance than similarly sized models, such as CodeX, while attending a smaller context window and training on less data.
Several decision problems that are encountered in various business domains can be modeled as mathematical programs, i.e. optimization problems. The process of conducting such modeling often requires the involvement of experts trained in operations research and advanced algorithms. Surprisingly, despite the significant advances in the methods for program and code synthesis, AutoML, learning to optimize etc., there has been little or no attention paid to automating the task of synthesizing mathematical programs. We imagine a scenario where the specifications for modeling, i.e. the objective and constraints are expressed in an unstructured form in natural language (NL) and the mathematical program has to be synthesized from such an NL specification. In this work we evaluate the efficacy of employing CodeT5 with data augmentation and post-processing of beams. We utilize GPT-3 with back translation for generation of synthetic examples. Further we apply rules of linear programming to score beams and correct beams based on common error patterns. We observe that with these enhancements CodeT5 base gives an execution accuracy of 0.73 which is significantly better than zero-shot execution accuracy of 0.41 by ChatGPT and 0.36 by Codex.
Interacting with computers is a ubiquitous activity for millions of people. Repetitive or specialized tasks often require creation of small, often one-off, programs. End-users struggle with learning and using the myriad of domain-specific languages (DSLs) to effectively accomplish these tasks. We present a general framework for constructing program synthesizers that take natural language (NL) inputs and produce expressions in a target DSL. The framework takes as input a DSL definition and training data consisting of NL/DSL pairs. From these it constructs a synthesizer by learning optimal weights and classifiers (using NLP features) that rank the outputs of a keyword-programming based translation. We applied our framework to three domains: repetitive text editing, an intelligent tutoring system, and flight information queries. On 1200+ English descriptions, the respective synthesizers rank the desired program as the top-1 and top-3 for 80% and 90% descriptions respectively.
No abstract available
We propose a new technique based on program synthesis for automatically generating visualizations from natural language queries. Our method parses the natural language query into a refinement type specification using the intents-and-slots paradigm and leverages type-directed synthesis to generate a set of visualization programs that are most likely to meet the user's intent. Our refinement type system captures useful hints present in the natural language query and allows the synthesis algorithm to reject visualizations that violate well-established design guidelines for the input data set. We have implemented our ideas in a tool called Graphy and evaluated it on NLVCorpus, which consists of 3 popular datasets and over 700 real-world natural language queries. Our experiments show that Graphy significantly outperforms state-of-the-art natural language based visualization tools, including transformer and rule-based ones.
No abstract available
Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning. Our code, data, and models are publicly available at https://github.com/Anjiang-Wei/CodeARC
We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark
Teaching systems physical tasks is a long standing goal in HCI, yet most prior work has focused on non collaborative physical activities. Collaborative tasks introduce added complexity, requiring systems to infer users assumptions about their teammates intent, which is an inherently ambiguous and dynamic process. This necessitates representations that are interpretable and correctable, enabling users to inspect and refine system behavior. We address this challenge by framing collaborative task learning as a program synthesis problem. Our system represents behavior as editable programs and uses narrated demonstrations, i.e. paired physical actions and natural language, as a unified modality for teaching, inspecting, and correcting system logic without requiring users to see or write code. The same modality is used for the system to communicate its learning to users. In a within subjects study, 20 users taught multiplayer soccer tactics to our system. 70 percent (14/20) of participants successfully refined learned programs to match their intent and 90 percent (18/20) found it easy to correct the programs. The study surfaced unique challenges in representing learning as programs and in enabling users to teach collaborative physical activities. We discuss these issues and outline mitigation strategies.
We introduce transductive program synthesis, a new formulation of the program synthesis task that explicitly leverages test inputs during synthesis. While prior approaches to program synthesis--whether based on natural language descriptions or input-output examples--typically aim to generalize from training examples, they often struggle with robustness, especially in real-world settings where training examples are limited and test inputs involve various edge cases. To address this, we propose a novel framework that improves robustness by treating synthesis as an active learning over a finite hypothesis class defined by programs'outputs. We use an LLM to predict outputs for selected test inputs and eliminate inconsistent hypotheses, where the inputs are chosen via a greedy maximin algorithm to minimize the number of LLM queries required. We evaluate our approach on four benchmarks: Playgol, MBPP+, 1D-ARC, and programmatic world modeling on MiniGrid. We demonstrate that our method significantly improves program synthesis in both accuracy and efficiency. We release our code at https://github.com/klee972/SYNTRA.
Program synthesis for graph-analysis question answering is difficult because natural language queries are complex, executable programs have strict structural requirements, and current generation methods are unstable. Existing solutions often do not keep structural consistency, do not stop invalid code, or do not use feedback to improve. This paper presents TRIDENT-Forge, a multi-expert framework that uses task decomposition, documentation-constrained retrieval, multi-model fusion decoding, execution-guided search, and reflection-based repair to reach robust and accurate code generation. By aligning meaning with syntax in one optimization process, TRIDENT-Forge improves program synthesis beyond single-model and retrieval-augmented approaches, and it gives more stable execution and better clarity.
Accurately classifying chemical structures is essential for cheminformatics and bioinformatics, including tasks such as identifying bioactive compounds of interest, screening molecules for toxicity to humans, finding non-organic compounds with desirable material properties, or organizing large chemical libraries for drug discovery or environmental monitoring. However, manual classification is labor-intensive and difficult to scale to large chemical databases. Existing automated approaches either rely on manually constructed classification rules, or are deep learning methods that lack explainability. This work presents an approach that uses generative artificial intelligence to automatically write chemical classifier programs for classes in the Chemical Entities of Biological Interest (ChEBI) database. These programs can be used for efficient deterministic run-time classification of SMILES structures, with natural language explanations. The programs themselves constitute an explainable computable ontological model of chemical class nomenclature, which we call the ChEBI Chemical Class Program Ontology (C3PO). We validated our approach against the ChEBI database, and compared our results against deep learning models and a naive SMARTS pattern based classifier. C3PO outperforms the naive classifier, but does not reach the performance of state of the art deep learning methods. However, C3PO has a number of strengths that complement deep learning methods, including explainability and reduced data dependence. C3PO can be used alongside deep learning classifiers to provide an explanation of the classification, where both methods agree. The programs can be used as part of the ontology development process, and iteratively refined by expert human curators. We demonstrate a novel knowledge distillation technique in which the classifiers are programs, leveraging the power of cheminformatics software libraries. We demonstrate applicability for classifying chemical structures, and for assisting curation of chemical databases.
Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on user features. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, highlighting the need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is crucial that these agents ask the right questions. As agents determine when to terminate a conversation, they face a trade-off between accuracy and the number of questions asked, a key metric for both user experience and cost. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. To address this, we introduce ProADA, a novel approach that leverages program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 55.6 while maintaining nearly the same number of dialog turns.
Motivated by applications in robotics, we consider the task of synthesizing linear temporal logic (LTL) specifications based on examples and natural language descriptions. While LTL is a flexible, expressive, and unambiguous language to describe robotic tasks, it is often challenging for non-expert users. In this paper, we present an interactive method for synthesizing LTL specifications from a single example trace and a natural language description. The interaction is limited to showing a small number of behavioral examples to the user who decides whether or not they exhibit the original intent. Our approach generates candidate LTL specifications and distinguishing examples using an encoding into optimization modulo theories problems. Additionally, we use a grammar extension mechanism and a semantic parser to generalize synthesized specifications to parametric task descriptions for subsequent use. Our implementation in the tool LtlTalk starts with a domain-specific language that maps to a fragment of LTL and expands it through example-based user interactions, thus enabling natural language-like robot programming, while maintaining the expressive power and precision of a formal language. Our experiments show that the synthesis method is precise, quick, and asks only a few questions to the users, and we demonstrate in a case study how LtlTalk generalizes from the synthesized tasks to other, yet unseen, tasks.
No abstract available
No abstract available
In this research, we delve into a transformative exploration of the realm of code generation, unveiling three fundamental modules that constitute the backbone of our study: the Text-to-Code Generation Model, Code Autocompletion Model, and the formidable Neural Program Synthesis Model. At the forefront of our investigation, we underscore the paramount importance of Neural Program Synthesis—a dynamic force adept at crafting custom programs from input-output examples. Our exploration delves into the intricate landscape of natural language-to-code generation, offering profound insights and invaluable contributions to the field. The ability to automatically generate code from natural language instructions holds immense significance in streamlining software development processes, reducing manual effort, and enhancing productivity. Furthermore, this research addresses a pressing need within the software engineering community for efficient methods of code generation, particularly in domains where human-program interaction is prevalent. Through rigorous experimentation and meticulous analysis, we shed light on methodologies and promising avenues for future research in this burgeoning domain, paving the way for advancements in automated software development and programming assistance tools.
The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the ground truth distribution and demonstrate a proof-of-concept on a neural program synthesis task. We use ILF to improve a Codegen-Mono 6.1B model's pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python Problems (MBPP) benchmark, outperforming both fine-tuning on MBPP and fine-tuning on repaired programs written by humans. Overall, our results suggest that learning from human-written natural language feedback is both more effective and sample-efficient than training exclusively on demonstrations for improving an LLM's performance on code generation tasks.
Modern shell scripts provide interfaces with rich functionality for system administration. However, it is not easy for end-users to write correct shell scripts; misusing commands may cause unpredictable results. In this paper, we present SmartShell, an automated function-based tool for shell script synthesis, which uses natural language descriptions as input. It can help the computer system to “understand” users’ intentions. SmartShell is based on two insights: (1) natural language descriptions for system objects (such as files and processes) and operations can be recognized by natural language processing tools; (2) system-administration tasks are often completed by short shell scripts that can be automatically synthesized from natural language descriptions. SmartShell synthesizes shell scripts in three steps: (1) using natural language processing tools to convert the description of a system-administration task into a syntax tree; (2) using program-synthesis techniques to construct a SmartShell intermediate-language script from the syntax tree; (3) translating the intermediate-language script into a shell script. Experimental results show that SmartShell can successfully synthesize 53.7% of tasks collected from shell-script helping forums.
Program synthesis aims to create accurate, executable programs from problem specifications, specifically from natural language descriptions in our context. Recent studies have leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. The application of RL focuses on directly optimizing for functional correctness, offering an advantage over conventional supervised methods. Despite policy-based RL methods dominating the literature on RL for program synthesis, the nature of program synthesis tasks hints at a natural alignment with value-based methods. This stems from the rich collection of off-policy programs, including those developed by human programmers and also historical samples, coupled with the straightforward verification of generated programs through automated unit testing, meaning rewards are easy to obtain. Diverging from the dominant use of policy-based algorithms, our work explores the feasibility of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we introduce an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance when compared to policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.
The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges, including overly simplistic benchmarks and the difficulty of conducting fair comparisons between different agent architectures due to confounding implementation variables. To address these limitations, we first construct a challenging and dynamically curated E2EDevBench to simulate realistic development scenarios. Second, we propose a hybrid evaluation framework that combines test-case-based functional assessment with fine-grained, LLM-based requirement verification. Using this framework, we conduct a controlled empirical study on three representative agent architectures implemented upon a unified foundation to isolate the impact of workflow design. Our findings reveal that state-of-the-art agents can fulfill approximately 50\% of requirements on \bench{}, but their success is critically dependent on the architectural strategy for task decomposition and collaboration. Furthermore, our analysis indicates that the primary bottleneck is the omission of requirements and inadequate self-verification. This work provides the community with a more realistic benchmark, a comprehensive evaluation framework, and crucial insights into the current capabilities and core challenges of software development agents, guiding future research toward enhancing requirement comprehension and planning.
LLM-driven multi-agent collaboration (MAC) systems have demonstrated impressive capabilities in automatic software development at the function level. However, their heavy reliance on human design limits their adaptability to the diverse demands of real-world software development. To address this limitation, we introduce EvoMAC, a novel self-evolving paradigm for MAC networks. Inspired by traditional neural network training, EvoMAC obtains text-based environmental feedback by verifying the MAC network's output against a target proxy and leverages a novel textual backpropagation to update the network. To extend coding capabilities beyond function-level tasks to more challenging software-level development, we further propose rSDE-Bench, a requirement-oriented software development benchmark, which features complex and diverse software requirements along with automatic evaluation of requirement correctness. Our experiments show that: i) The automatic requirement-aware evaluation in rSDE-Bench closely aligns with human evaluations, validating its reliability as a software-level coding benchmark. ii) EvoMAC outperforms previous SOTA methods on both the software-level rSDE-Bench and the function-level HumanEval benchmarks, reflecting its superior coding capabilities. The benchmark can be downloaded at https://yuzhu-cai.github.io/rSDE-Bench/.
The advent of Large Language Models (LLMs) with advanced code generation capabilities marks a distinct inflection point in software engineering. This paper argues that the traditional role of the "coder"-defined by the manual translation of logic into syntax-is rapidly becoming obsolete. In its place, a new paradigm is emerging: System Architecture as the primary unit of engineering value. Drawing on case studies from the development of *Gaari*, a distributed mobility platform, and *The Trail*, an intelligence system, this paper demonstrates that as syntax becomes commoditized, the engineering bottleneck shifts to system design, data flow coherence, and architectural resilience. We propose that the modern engineer must transition from a syntax-first approach to a logic-first "System Architect" model to remain relevant in an AI-driven ecosystem.
Recent advances in large language models (LLMs) have enabled transformative approaches in software development, positioning Artificial Intelligence (AI) not just as an assistant but as an integral programming layer. We introduce a novel, five-tiered framework for LLM-driven code synthesis. At its base is a minimal Task Execution Meta-Language (TEML) defining atomic tasks with typed parameters, return schemas, synchronous/asynchronous and fork/join control, plus hooks for logging, security and data management. Layered atop TEML, a Domain Task Specification Language (DTSL) instantiates these primitives into semantically rich, field-specific operations and enforces valid invocation patterns. The centrepiece is the Feature Execution Graph (FXG), a directed, attributed graph whose nodes and edges encode configured tasks and their calls. A Generation Engine traverses the FXG, issues context-aware prompts to an LLM to synthesise code for each task, and packages the results either as local functions or as containerised services. Finally, an Orchestration Engine executes the synthesised pipeline by invoking tasks locally or orchestrating services in environments such as Kubernetes. Evaluated on two representative workflows, a six-node data-science pipeline and a twelve-node EEG signal-analysis pipeline, our FXG-driven approach cut manual development time by about 40%, produced code that passed unit tests on the first attempt in 90% of local runs (85% when containerised), and preserved baseline predictive accuracy while trimming up to 25% of boilerplate.
The Satisfiability (SAT) problem is a core challenge with significant applications in software engineering, including automated testing, configuration management, and program verification. This paper presents SolSearch, a novel framework that harnesses large language models (LLMs) to discover and optimize SAT-solving strategies automatically. Leveraging a curriculum-based, trial-and-error process, SolSearch enables the LLM to iteratively modify and generate SAT solver code, thereby improving solving efficiency and performance. This automated SAT-solving paradigm has the advantage of being plug-and-play, allowing integration with any SAT solver and accelerating the development or design process of new SAT solvers (new methods). Our preliminary experimental results are encouraging by demonstrating that the LLM-powered paradigm improves state-of-the-art SAT solvers on general SAT benchmarks and significantly enhances the performance of the widely used Z3 solver (11% on PAR-2 score). These results highlight the potential for using LLM-driven methods to advance solver adaptability and effectiveness in real-world software engineering challenges. Future research directions are discussed to further refine and validate this approach, offering a promising avenue for integrating AI with traditional software engineering tasks.
The latest paradigm shift in software development brings in the innovation and automation afforded by Large Language Models (LLMs), showcased by Generative Pre-trained Transformer (GPT), which has shown remarkable capacity to generate code autonomously, significantly reducing the manual effort required for various programming tasks. Although, the potential benefits of LLM-generated code are vast, most notably in efficiency and rapid prototyping, as LLMs become increasingly integrated into the software development lifecycle and hence the supply chain, complex and multifaceted challenges arise as the code generated from these language models carry profound questions on quality and correctness. Research is required to comprehensively explore these critical concerns surrounding LLM-generated code. In this paper, we propose a novel solution called metamorphic prompt testing to address these challenges. Our intuitive observation is that intrinsic consistency always exists among correct code pieces but may not exist among flawed code pieces, so we can detect flaws in the code by detecting inconsistencies. Therefore, we can vary a given prompt to multiple prompts with paraphrasing, and to ask the LLM to acquire multiple versions of generated code, so that we can validate whether the semantic relations still hold in the acquired code through cross-validation. Our evaluation on HumanEval shows that metamorphic prompt testing is able to detect 75 percent of the erroneous programs generated by GPT-4, with a false positive rate of 8.6 percent.
No abstract available
Autonomous agents powered by large language models (LLMs) show significant potential for achieving high autonomy in various scenarios such as software development. Recent research has shown that LLM agents can leverage past experiences to reduce errors and enhance efficiency. However, the static experience paradigm, reliant on a fixed collection of past experiences acquired heuristically, lacks iterative refinement and thus hampers agents' adaptability. In this paper, we introduce the Iterative Experience Refinement framework, enabling LLM agents to refine experiences iteratively during task execution. We propose two fundamental patterns: the successive pattern, refining based on nearest experiences within a task batch, and the cumulative pattern, acquiring experiences across all previous task batches. Augmented with our heuristic experience elimination, the method prioritizes high-quality and frequently-used experiences, effectively managing the experience space and enhancing efficiency. Extensive experiments show that while the successive pattern may yield superior results, the cumulative pattern provides more stable performance. Moreover, experience elimination facilitates achieving better performance using just 11.54% of a high-quality subset.
Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents, **agentic mid-training**-mid-training (MT) on large-scale data that mirrors authentic agentic workflows-remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing both the data synthesis principles and training methodology for effective agent development at scale. Central to our approach is **agent-native data**-supervision comprising two complementary types of trajectories: **contextually-native trajectories** that preserve the complete information flow an agent experiences, offering broad coverage and diversity; and **environmentally-native trajectories** collected from executable repositories where observations stem from actual tool invocations and test executions, providing depth and interaction authenticity. We verify the model's agentic capabilities on `SWE-Bench Verified`. We demonstrate our superiority over the previous open software engineering mid-training recipe `Kimi-Dev` under two post-training settings with an aligned base model and agentic scaffold, while using less than half mid-training tokens (73.1B). Besides relative advantage, our best performing 32B and 72B models achieve **56.1%** and **58.5%** resolution rates, respectively, which are ...
Software development has entered a new era where large language models (LLMs) now serve as general-purpose reasoning engines, enabling natural language interaction and transformative applications across diverse domains. This paradigm is now extending into computer-aided engineering (CAE). Recent applications of LLMs in CAE have successfully automated routine tasks, including CAD model generation and FEM simulations. Nevertheless, these contributions, which primarily serve to reduce manual labor, are often insufficient for addressing the significant computational challenges posed by large-scale, high-dimensional systems. To this aim, we first introduce the concept of LLM-empowered CAE agent, where LLMs act as autonomous collaborators that plan, execute, and adapt CAE workflows. Then, we propose an LLM-empowered CAE agent for data-free model order reduction (MOR), a powerful yet underused approach for ultra-fast large-scale parametric analysis due to the intrusive nature and labor-intensive redevelopment of solvers. LLMs can alleviate this barrier by automating derivations, code restructuring, and implementation, making intrusive MOR both practical and broadly accessible. To demonstrate feasibility, we present an LLM-empowered CAE agent for solving ultra-large-scale space-parameter-time (S-P-T) physical problems using Tensor-decomposition-based A Priori Surrogates (TAPS). Our results show that natural language prompts describing parametric partial differential equations (PDEs) can be translated into efficient solver implementations, substantially reducing human effort while producing high-fidelity reduced-order models. Moreover, LLMs can synthesize novel MOR solvers for unseen cases such as nonlinear and high-dimensional parametric problems based on their internal knowledge base. This highlights the potential of LLMs to establish the foundation for next-generation CAE systems.
Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, verifiable computation, and secure finance. However, authoring ZK programs remains challenging: unlike conventional software development, ZK programming manifests a fundamental paradigm shift from \textit{imperative computation} to \textit{declarative verification}. This process requires rigorous reasoning about finite field arithmetic and complex constraint systems (which is rare in common imperative languages), making it knowledge-intensive and error-prone. While large language models (LLMs) have demonstrated strong code generation capabilities in general-purpose languages, their effectiveness for ZK programming, where correctness hinges on both language mastery and constraint-level reasoning, remains unexplored. To address this gap, we propose \textsc{ZK-Eval}, a domain-specific evaluation pipeline that probes LLM capabilities on ZK programming at three levels: language knowledge, algebraic primitive competence, and end-to-end program generation. Our evaluation of four state-of-the-art LLMs reveals that while models demonstrate strong proficiency in language syntax, they struggle when implementing and composing algebraic primitives to specify correct constraint systems, frequently producing incorrect programs. Based on these insights, we introduce \textsc{ZK-Coder}, an agentic framework that augments LLMs with constraint sketching, guided retrieval, and interactive repair. Experiments with GPT-o3 on Circom and Noir show substantial gains, with success rates improving from 20.29\% to 87.85\% and from 28.38\% to 97.79\%, respectively. With \textsc{ZK-Eval} and \textsc{ZK-Coder}, we establish a new basis for systematically measuring and augmenting LLMs in ZK code generation to lower barriers for practitioners and advance privacy computing.
A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.
Recent advances in Large Language Models (LLMs) have introduced a new paradigm for software development, where source code is generated directly from natural language prompts. While this paradigm significantly boosts development productivity, building complex, real-world software systems remains challenging because natural language offers limited control over the generated code. Inspired by the historical evolution of programming languages toward higher levels of abstraction, we advocate for a high-level abstraction language that gives developers greater control over LLM-assisted code writing. To this end, we propose Code Semantic Zooming, a novel approach based on pseudocode that allows developers to iteratively explore, understand, and refine code across multiple layers of semantic abstraction. We implemented Code Semantic Zooming as a VS Code extension and demonstrated its effectiveness through two real-world case studies.
This paper examines the paradigm shifts in leveraging generative artificial intelligence for automated code generation at the enterprise level. It is thus a critical review of prevailing prescriptions for integrating LLM agents into the software development lifecycles of modern enterprises, assessing their impact on team productivity and the new risks they introduce to confidentiality and licensing matters. The study would therefore be most befitting at this stage, as fast-forward steps are being made towards organizational adoption of generative AI, from mere IDE autocompletion features to more than a co-programmer but an autonomous agent capable even of popping pull requests sans humans in the loop, demanding new forms of legibility both organizationally and technically. The novelty of this research lies in its integration of material from scholarly works, industry reports, and case studies, along with lab pilot runs of Copilot and actual DevSecOps implementations, to triangulate the current state and future promise of this technology on a practical business level. Key findings include: a reduction of development cycle time by 50–60% without compromising code quality thanks to the integration of AI agents into IDEs and CI/CD pipelines; a shift of developers’ roles toward architects and reviewers as routine tasks are delegated to digital co‑programmers; and a necessity for phased implementation that accounts for private code protection and compliance with licensing norms. Significant barriers identified include model hallucination management, ensuring the traceability of changes, and adapting organizational culture and regulations to new roles such as prompt designers and AI-agent curators. The article will be of use to IT department heads, software architects, DevSecOps specialists, and researchers in the field of artificial intelligence.
The increasing complexity of computational demands has spurred the adoption of domain-specific accelerators, yet traditional hardware design methodologies remain constrained by prolonged development and verification cycles. High-Level Synthesis (HLS) bridges the software-hardware gap by enabling hardware design from high-level languages. However, its widespread adoption is hindered by strict coding constraints and intricate hardware-specific optimizations. To address these challenges, we introduce ChatHLS, an agile HLS design automation workflow that leverages fine-tuned LLMs integrated within a multi-agent framework for HLS-specific error correction and design optimization. Through navigating LLM training with a novel verification-oriented data augmentation paradigm, ChatHLS achieves an average repair pass rate of 82.7% over 612 error cases. Furthermore, by enabling optimization reasoning within practical computational budgets, ChatHLS delivers performance improvements ranging from 1.9$\times$ to 14.8$\times$ on resource-constrained kernels, attaining a 3.6$\times$ average speedup compared to SOTA approaches. These results underscore the potential of ChatHLS in substantially expediting hardware development cycles while upholding rigorous standards of design reliability and quality.
Large language models (LLMs) are increasingly used as interactive systems rather than single-turn text generators. However, most prompt engineering paradigms remain oriented toward linear or reactive interactions and offer limited support for persistent state, structured workflows, or explicit user control. This paper introduces program simulation prompting, a prompting paradigm in which an LLM is instructed to simulate the behavior of a structured software program using natural language interaction alone. In this approach, the model adopts a persistent program identity, presents menu-driven interaction modes, manages explicit task state, and constrains behavior to a bounded domain. We analyze three domain-diverse program simulation prompts—spanning startup ideation, culinary recipe development, and creative writing—using a qualitative comparative methodology. Across these case studies, we identify recurring design patterns, including main menu abstractions, explicit separation between creating new and continuing existing artifacts, progressive elaboration, summarization-based state compression, and transparent signaling of memory constraints. Based on these findings, we propose a general, reusable framework for program simulation prompting that decomposes prompt design into five layers: initialization, interaction, state management, output constraints, and user control mechanisms. Our results demonstrate that program simulation prompting enables interactive, stateful, and user-controlled LLM behavior without fine-tuning, external tools, or autonomous agents. This paradigm bridges conversational AI and traditional software affordances, offering a lightweight and transparent alternative to agent-based systems. We conclude by discussing implications for LLM usability, system design, and future hybrid approaches.
Software development is shifting from traditional programming to AI-integrated applications that leverage generative AI and large language models (LLMs) during runtime. However, integrating LLMs remains complex, requiring developers to manually craft prompts and process outputs. Existing tools attempt to assist with prompt engineering, but often introduce additional complexity. This paper presents Meaning-Typed Programming (MTP), a novel paradigm that abstracts LLM integration through intuitive language-level constructs. By leveraging the inherent semantic richness of code, MTP automates prompt generation and response handling without additional developer effort. We introduce the (1) by operator for seamless LLM invocation, (2) MT-IR, a meaning-based intermediate representation for semantic extraction, and (3) MT-Runtime, an automated system for managing LLM interactions. We implement MTP in Jac, a programming language that supersets Python, and find that MTP significantly reduces coding complexity while maintaining accuracy and efficiency. MTP significantly reduces development complexity, lines of code modifications needed, and costs while improving run-time performance and maintaining or exceeding the accuracy of existing approaches. Our user study shows that developers using MTP completed tasks 3.2× faster with 45% fewer lines of code compared to existing frameworks. Moreover, demonstrates resilience even when up to 50% of naming conventions are degraded, demonstrating robustness to suboptimal code. is developed as part of the Jaseci open-source project, and is available under the module byLLM.
Non-trivial software systems are commonly developed using more than a single programming language. However, multi-language development is not straightforward. Nowadays, tools powered by Large Language Models (LLMs), such as ChatGPT, have been shown to successfully assist practitioners in several aspects of software development. This paper reports a preliminary study aimed to investigate to what extent ChatGPT is being used in multi-language development scenarios. Hence, we leveraged DevGPT, a dataset of conversations between software practitioners and ChatGPT. In total, we studied data from 3,584 conversations, comprising a total of 18,862 code snippets. Our analyses show that only 18.33% of the code snippets suggested by ChatGPT are written in the same programming language as the primary language in the repository where the conversation was shared. In an in-depth analysis, we observed expected scenarios, such as 31.54% of JavaScript snippets being suggested in CSS repositories However, we also unveiled surprising ones, such as Python snippets being largely suggested in C++ repositories. After a qualitative open card sorting of the conversations, we found that in 70% of them developers were asking for coding support while in 57% developers used ChatGPT as a tool to generate code. Our initial results indicate that not only LLMs are being used in multi-language development but also showcase the contexts in which such tools are assisting developers.
In this work, we explore explicit Large Language Model (LLM)-powered support for the iterative design of computer programs. Program design, like other design activity, is characterized by navigating a space of alternative problem formulations and associated solutions in an iterative fashion. LLMs are potentially powerful tools in helping this exploration; however, by default, code-generation LLMs deliver code that represents a particular point solution. This obscures the larger space of possible alternatives, many of which might be preferable to the LLM’s default interpretation and its generated code. We contribute an IDE that supports program design through generating and showing new ways to frame problems alongside alternative solutions, tracking design decisions, and identifying implicit decisions made by either the programmer or the LLM. In a user study, we find that with our IDE, users combine and parallelize design phases to explore a broader design space—but also struggle to keep up with LLM-originated changes to code and other information overload. These findings suggest a core challenge for future IDEs that support program design through higher-level instructions given to LLM-based agents: carefully managing attention and deciding what information agents should surface to program designers and when.
No abstract available
This paper presents the Function Block Assistant (fbAssistant), an LLM-backed tool prototype for developing control logic in industrial automation. fbAssistant interprets natural language requirements and automatically generates state machines and their function blocks implementation. The study demonstrates iterative refinement, simulation validation, and deployment using EcoStruxure Automation Expert. The proposed approach aims at improved efficiency and accuracy in the development of automation software.
Large language models (LLMs) substantially enhance developer productivity in repository-level code generation through interactive collaboration. However, as interactions progress, repository context must be continuously preserved and updated to integrate newly validated information. Meanwhile, the expanding session history increases cognitive burden, often leading to forgetting and the reintroduction of previously resolved errors. Existing memory management approaches show promise but remain limited by natural language-centric representations. To overcome these limitations, we propose CodeMEM, an AST-guided dynamic memory management system tailored for repository-level iterative code generation. Specifically, CodeMEM introduces the Code Context Memory component that dynamically maintains and updates repository context through AST-guided LLM operations, along with the Code Session Memory that constructs a code-centric representation of interaction history and explicitly detects and mitigates forgetting through AST-based analysis. Experimental results on the instruction-following benchmark CodeIF-Bench and the code generation benchmark CoderEval demonstrate that CodeMEM achieves state-of-the-art performance, improving instruction following by 12.2% for the current turn and 11.5% for the session level, and reducing interaction rounds by 2-3, while maintaining competitive inference latency and token efficiency.
Large Language Models (LLMs) are increasingly being used in software development and in applications like code generation. While LLMs can provide significant value in the form of time savings in common programming languages like Python, their usability in generating automation software has yet to be studied extensively. In the context of generating control software in the form of IEC 61131-3 compliant code, initial studies suggest LLMs provide a promising avenue for increasing control engineer productivity. However, similar code generation for IEC 61499-based control applications is still scarce. While tools are being developed for this purpose, their capabilities are not yet fully understood, and they often require significant human input to generate the intended outcomes. This paper explores LLM-based code generation for IEC 61499-based applications through iterative prompting. The prompts for the experiments are derived from requirements generated by System-Theoretic Process Analysis (STPA), which provides a systematic approach to creating prompts that also connect to the larger systems engineering workflow. The results indicate that while the approach may be successful in some instances, more work is required to mitigate the issues arising from its application.
Automatic code generation has gained significant momentum with the advent of Large Language Models (LLMs) such as GPT-4. Although many studies focus on improving the effectiveness of LLMs for code generation, very limited work tries to understand the generated code’s characteristics and leverage that to improve failed cases. In this paper, as the most straightforward characteristic of code, we investigate the relationship between code complexity and the success of LLM-generated code. Using a large set of standard complexity metrics, we first conduct an empirical analysis to explore their correlation with LLM’s performance on code generation (i.e., Pass@1). Using logistic regression models, we identify which complexity metrics are most predictive of code correctness. Building on these findings, we propose an iterative feedback method, where LLMs are prompted to generate correct code based on complexity metrics from previous failed outputs. We validate our approach across multiple benchmarks (i.e., HumanEval, MBPP, LeetCode, and BigCodeBench) and various LLMs (i.e., GPT-4o, GPT-3.5 Turbo, Llama 3.1, and GPT-o3 mini), comparing the results with two baseline methods: (a) zero-shot generation, and (b) iterative execution-based feedback without our code complexity insights. Experiment results show that our approach makes notable improvements, particularly with a smaller LLM (GPT-3.5 Turbo), where, e.g., Pass@1 increased by 35.71% compared to the baseline’s improvement of 12.5% on the HumanEval dataset. The study expands experiments to BigCodeBench and integrates the method with the Reflexion code generation agent, leading to Pass@1 improvements of 20% (GPT-4o) and 23.07% (GPT-o3 mini). The results highlight that complexity-aware feedback enhances both direct LLM prompting and agent-based workflows.
Large Language Models (LLMs) have shown remarkable progress in automated code generation. Yet, LLM-generated code may contain errors in API usage, class, data structure, or missing project-specific information. As much of this project-specific context cannot fit into the prompts of LLMs, we must find ways to allow the model to explore the project-level code context. We present CoCoGen, a new code generation approach that uses compiler feedback to improve the LLM-generated code. CoCoGen first leverages static analysis to identify mismatches between the generated code and the project's context. It then iteratively aligns and fixes the identified errors using information extracted from the code repository. We integrate CoCoGen with two representative LLMs, i.e., GPT-3.5-Turbo and Code Llama (13B), and apply it to Python code generation. Experimental results show that CoCoGen significantly improves the vanilla LLMs by over 80% in generating code dependent on the project context and consistently outperforms the existing retrieval-based code generation baselines.
The rapid adoption of Large Language Models (LLMs) for code generation has transformed software development, yet little attention has been given to how security vulnerabilities evolve through iterative LLM feedback. This paper analyzes security degradation in AI-generated code through a controlled experiment with 400 code samples across 40 rounds of "improvements" using four distinct prompting strategies. Our findings show a 37.6% increase in critical vulnerabilities after just five iterations, with distinct vulnerability patterns emerging across different prompting approaches. This evidence challenges the assumption that iterative LLM refinement improves code security and highlights the essential role of human expertise in the loop. We propose practical guidelines for developers to mitigate these risks, emphasizing the need for robust human validation between LLM iterations to prevent the paradoxical introduction of new security issues during supposedly beneficial code "improvements."
Code generation has attracted increasing attention with the rise of Large Language Models (LLMs). Many studies have developed powerful code LLMs by synthesizing code-related instruction data and applying supervised fine-tuning. However, these methods are limited by teacher model distillation and ignore the potential of iterative refinement by self-generated code. In this paper, we propose Adaptive Critique Refinement (ACR), which enables the model to refine itself by self-generated code and external critique, rather than directly imitating the code responses of the teacher model. Concretely, ACR includes a composite scoring system with LLM-as-a-Judge to evaluate the quality of code responses and a selective critique strategy with LLM-as-a-Critic to critique self-generated low-quality code responses. We develop the RefineCoder series by iteratively applying ACR, achieving continuous performance improvement on multiple code generation benchmarks. Compared to the baselines of the same size, our proposed RefineCoder series can achieve comparable or even superior performance using less data.
Automated Driving System (ADS) is a safety-critical software system responsible for the interpretation of the vehicle’s environment and making decisions accordingly. The unbounded complexity of the driving context, including unforeseeable events, necessitate continuous improvement, often achieved through iterative DevOps processes. However, DevOps processes are themselves complex, making these improvements both time- and resource-intensive. Automation in code generation for ADS using Large Language Models (LLM) is one potential approach to address this challenge. Nevertheless, the development of ADS requires rigorous processes to verify, validate, assess, and qualify the code before it can be deployed in the vehicle and used. In this study, we developed and evaluated a prototype for automatic code generation and assessment using a designed pipeline of a LLM-based agent, simulation model, and rule-based feedback generator in an industrial setup. The LLM-generated code is evaluated automatically in a simulation model against multiple critical traffic scenarios, and an assessment report is provided as feedback to the LLM for modification or bug fixing. We report about the experimental results of the prototype employing Codellama:34b, DeepSeek (r1:32b and Coder:33b), CodeGemma:7b, Mistral:7b, and GPT4 for Adaptive Cruise Control (ACC) and Unsupervised Collision Avoidance by Evasive Manoeuvre (CAEM). We finally assessed the tool with 11 experts at two Original Equipment Manufacturers (OEMs) by conducting an interview study.
Large Language Models (LLMs) are showing remarkable performance in generating source code, yet the generated code often has issues like compilation errors or incorrect code. Researchers and developers often face wasted effort in implementing checks and refining LLM-generated code, frequently duplicating their efforts. This paper presents LLMLOOP, a framework that automates the refinement of both source code and test cases produced by LLMs. LLMLOOP employs five iterative loops: resolving compilation errors, addressing static analysis issues, fixing test case failures, and improving test quality through mutation analysis. These loops ensure the generation of high-quality test cases that serve as both a validation mechanism and a regression test suite for the generated code. We evaluated llmloop on HumanEval-X, a recent benchmark of programming tasks. Results demonstrate the tool effectiveness in refining LLM-generated outputs. A demonstration video of the tool is available at https://youtu.be/2CLG9x1fsNI.
We introduce a general stochastic differential equation framework for modelling multiobjective optimization dynamics in iterative Large Language Model (LLM) interactions. Our framework captures the inherent stochasticity of LLM responses through explicit diffusion terms and reveals systematic interference patterns between competing objectives via an interference matrix formulation. We validate our theoretical framework using iterative code generation as a proof-of-concept application, analyzing 400 sessions across security, efficiency, and functionality objectives. Our results demonstrate strategy-dependent convergence behaviors with rates ranging from 0.33 to 1.29, and predictive accuracy achieving R2 = 0.74 for balanced approaches. This work proposes the feasibility of dynamical systems analysis for multi-objective LLM interactions, with code generation serving as an initial validation domain.
Recent advancements in code generation have shown remarkable success across software domains, yet hardware description languages (HDLs) such as Verilog remain underexplored due to their concurrency semantics, syntactic rigidity, and simulation complexity. In this work, we address these challenges by introducing a reinforcement learning (RL) framework tailored for Verilog code generation. We first construct Veribench-53K, a high-quality dataset curated from over 700K Verilog problems, enriched with structured prompts, complexity labels, and diverse testbenches. To tackle the problem of sparse and noisy reward signals, we propose a Trace-back based Rescore mechanism that leverages reasoning paths and iterative refinement to enhance feedback reliability and support reward model training. Furthermore, to mitigate catastrophic forgetting and overfitting during RL fine-tuning, we introduce a sample-balanced weighting strategy that adaptively balances learning dynamics based on reward-probability distributions. These innovations are integrated into an iterative RL pipeline that co-evolves the policy and reward models. In contrast to recent work such as CraftRTL, which relies on large-scale closed-source model distillation, and DeepSeekstyle approaches that struggle with sparse feedback, our method demonstrates superior performance using a smaller but high-quality dataset combined with RL optimization. Experiments on Verilog generation tasks demonstrate state-of-the-art performance, with substantial gains in test pass rate, functional correctness, and compilation robustness. Our findings highlight the potential of RL-driven approaches for structured code generation in hardware-centric domains. VeriRL is publicly available at https://github.com/omniAI-Lab/VeriRL.
LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0\% on the development set and 71.6\% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at github.com/jahidulzaid/PyBanglaCodeActAgent.
Bangla is a low-resource language for code generation, lacking large-scale annotated datasets and tools to transform natural language specifications into executable programs. This makes Bangla-to-code generation a challenging task requiring innovative solutions. To address this, we introduce BanglaForge, a novel framework for generating code from Bangla function descriptions. BanglaForge leverages a retrieval-augmented dual-model collaboration paradigm with self-refinement, combining in-context learning, llm-based translation, systematic prompt engineering, and iterative self-refinement based on execution feedback, where a coder generates initial solutions and a reviewer enhances them for robustness. On the BLP-2025 Bangla Code Generation benchmark, BanglaForge achieves a competitive Pass@1 accuracy of 84.00%, demonstrating the effectiveness of retrieval, model collaboration, and self-refinement for low-resource Bangla code generation.
Large Language Model (LLM) agents have gained popularity for constructing complex coding pipelines that incorporate multiple tools, iterative processes, and advanced libraries such as LlamaIndex and LangChain. Recognizing the critical need for observability, we emphasize techniques to monitor and control the process, costs, frequency of model calls, prompt usage, and overall logging. While contemporary agent libraries offer foundational classes for these purposes, many online examples and articles still rely on default implementations with ChatGPT models, allowing agents to invoke generative AI methods without robust flow control, prompt management, or detailed application logging. In our research, we identified effective ways to implement essential observability methods for popular frameworks, including LlamaIndex, LangChain, and CreAI. Our approach introduces workflows, custom model calls, comprehensive logging, and enhanced observability, which improve the reliability, controllability, predictability, and efficiency of LLM agents in software development applications. Through this methodology, we enhanced the architecture and structure of the evolving agentic framework for code generation, achieving a notable 98% accuracy in solving challenging coding tasks via specified workflows.
In this study, we propose VibeCodeHPC, a multi-agent system based on large language models (LLMs) for the automatic tuning of high-performance computing (HPC) programs on supercomputers. VibeCodeHPC adopts Claude Code as its backend and provides an integrated environment that facilitates program development in supercomputer settings. The system not only brings the Vibe Coding paradigm -- program development through natural language interaction with users -- to HPC programming, but also enables autonomous performance optimization with minimal user intervention through a sophisticated multi-agent design. To achieve these objectives, VibeCodeHPC implements three core functionalities: (1) configuration capabilities tailored to the unique development environments of supercomputers, (2) collaborative operation among multiple LLM agents with distinct roles -- Project Manager (PM), System Engineer (SE), Programmer (PG), and Continuous Deliverer (CD), and (3) long-term autonomous operation through agent activity monitoring and dynamic deployment mechanisms. This paper highlights one of the most powerful features of VibeCodeHPC: fully automated code optimization through autonomous operation without user intervention. Specifically, it demonstrates the performance optimization of CPU-based codes on GPU-equipped systems for matrix multiplication and a Poisson equation solver using Jacobi's iterative method. The results show that the multi-agent configuration employed in VibeCodeHPC enables faster and more reliable development of higher-performance code compared to a single-agent setup.
It is challenging to generate the code for a complete user interface using a Large Language Model (LLM). User interfaces are complex and their implementations often consist of multiple, inter-related files that together specify the contents of each screen, the navigation flows between the screens, and the data model used throughout the application. It is challenging to craft a single prompt for an LLM that contains enough detail to generate a complete user interface, and even then the result is frequently a single large and intricate file that contains all of the generated screens. In this paper, we introduce Athena, a prototype application generation environment that demonstrates how the use of shared intermediate representations, including an app storyboard, data model, and GUI skeletons, can help a developer work with an LLM in an iterative fashion to craft a complete user interface. These intermediate representations also scaffold the LLM’s code generation process, producing organized and structured code in multiple files while limiting errors. We evaluated Athena with a user study with 12 developers. Participants appreciated Athena’s support for prototyping multi-screen iOS apps, acknowledged that the intermediate representations improved their control and understanding of generated code, and discussed the limitations of the system and potential directions for improvement.
Large Language Models (LLMs) have revolutionized code generation but require significant resources and often over-generalize, limiting their task-specific efficiency. Fine-tuning smaller, open-source LLMs provides a cost-effective alternative. However, standard supervised approaches rely only on correct examples, missing valuable insights from failures. We introduce CodeLutra, a framework that leverages both correct and incorrect code attempts. Instead of using only correct solutions, CodeLutra applies iterative preference-based refinement, comparing successful and failed outputs to better approximate desired results. This approach narrows the performance gap with state-of-the-art larger models without requiring massive datasets or auxiliary models. For instance, on a challenging data science coding task, using only 500 samples improved Llama-3-8B's accuracy from 28.2% to 48.6%, approaching GPT-4's level. By learning from both successes and mistakes, CodeLutra provides a scalable and efficient path to high-quality code generation, making smaller open-source models more competitive with leading closed-source alternatives.
Large Language Models (LLMs) are widely used for tasks such as natural language and code generation, but their outputs often suffer from issues like hallucination, toxicity, and incorrect results. Current libraries for structured LLM generation rely on left-to-right decoding without support for backtracking, limiting the ability to correct or refine outputs mid-generation. To address this, we introduce IterGen, a user-friendly library for iterative, grammar-guided LLM generation that enables users to move both forward and backward within the generated output based on grammar symbols. By leveraging a symbol-to-position mapping and maintaining the key-value (KV) cache state, IterGen ensures efficient and structured generation while allowing for corrections during the process. We demonstrate IterGen's effectiveness in two important applications: reducing privacy leakage in LLM outputs and improving the accuracy of LLM-generated SQL and Vega-Lite queries. Our code and additional resources are available at https://structuredllm.com.
Prompt engineering reduces reasoning mistakes in Large Language Models (LLMs). However, its effectiveness in mitigating vulnerabilities in LLM-generated code remains underexplored. To address this gap, we implemented a benchmark to automatically assess the impact of various prompt engineering strategies on code security. Our benchmark leverages two peer-reviewed prompt datasets and employs static scanners to evaluate code security at scale. We tested multiple prompt engineering techniques on GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. Our results show that for GPT-4o and GPT-4o-mini, a security-focused prompt prefix can reduce the occurrence of security vulnerabilities by up to 56%. Additionally, all tested models demonstrated the ability to detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code when using iterative prompting techniques. Finally, we introduce a "prompt agent" that demonstrates how the most effective techniques can be applied in real-world development workflows.
Large language models (LLMs) have revolutionized automated code generation, yet the evaluation of their real-world effectiveness remains limited by static benchmarks and simplistic metrics. We present ProxyWar, a novel framework that systematically assesses code generation quality by embedding LLM-generated agents within diverse, competitive game environments. Unlike existing approaches, ProxyWar evaluates not only functional correctness but also the operational characteristics of generated programs, combining automated testing, iterative code repair, and multi-agent tournaments to provide a holistic view of program behavior. Applied to a range of state-of-the-art coders and games, our approach uncovers notable discrepancies between benchmark scores and actual performance in dynamic settings, revealing overlooked limitations and opportunities for improvement. These findings highlight the need for richer, competition-based evaluation of code generation. Looking forward, ProxyWar lays a foundation for research into LLM-driven algorithm discovery, adaptive problem solving, and the study of practical efficiency and robustness, including the potential for models to outperform hand-crafted agents. The project is available at https://github.com/xinke-wang/ProxyWar.
The Message Passing Interface (MPI) standard plays a crucial role in enabling scientific applications for parallel computing and is an essential component in high-performance computing (HPC). However, implementing MPI code manually—especially applying a proper domain decomposition and communication pattern—is a challenging and error-prone task. We present ChatMPI, an AI assistant for MPI parallelization of sequential C codes. In our analysis, we focus on testing six essential HPC workloads, which are based on Basic Linear Algebra Subprograms levels 1, 2, and 3 as well as sparse, stencil, and iterative operations. We analyze the process of creating ChatMPI by using the ChatHPC library. This lightweight large language model (LLM)–based infrastructure enables HPC experts to efficiently create and supervise trustworthy AI capabilities for critical HPC software tasks. We study the data required for training (fine-tuning) ChatMPI to generate parallel codes that not only use MPI syntax correctly but also apply HPC techniques to reduce memory communication and maximize performance by using proper work decomposition. With a relatively small training dataset composed of a few dozen prompts and fewer than 15 minutes of fine-tuning on one node equipped with two NVIDIA H100 GPUs, ChatMPI elevates trustworthiness for MPI code generation of current LLMs (e.g., Code Llama, ChatGPT-4o and ChatGPT 5). Additionally, we evaluate the performance of the MPI codes generated by ChatMPI in comparison with the ones generated by ChatGPT-4o and ChatGPT-5. The codes generated by ChatMPI provide up to a 4 × boost in performance by using better problem decomposition, communication patterns, and HPC techniques (e.g., communication avoiding).
Designing Verilog modules requires meticulous attention to correctness, efficiency, and adherence to design specifications. However, manually writing Verilog code remains a complex and time-consuming task that demands both expert knowledge and iterative refinement. Leveraging recent advancements in large language models (LLMs) and their structured text generation capabilities, we propose VeriMind, an agentic LLM framework for Verilog code generation that significantly automates and optimizes the synthesis process. Unlike traditional LLM-based code generators, VeriMind employs a structured reasoning approach: given a user-provided prompt describing design requirements, the system first formulates a detailed train of thought before the final Verilog code is generated. This multi-step methodology enhances interpretability, accuracy, and adaptability in hardware design. In addition, we introduce a novel evaluation metric-pass@ARC-which combines the conventional pass@k measure with Average Refinement Cycles (ARC) to capture both success rate and the efficiency of iterative refinement. Experimental results on diverse hardware design tasks demonstrated that our approach achieved up to $8.3\%$ improvement on pass@k metric and $8.1\%$ on pass@ARC metric. These findings underscore the transformative potential of agentic LLMs in automated hardware design, RTL development, and digital system synthesis.
Large language models (LLMs) have achieved impressive performance on code generation. Although prior studies enhanced LLMs with prompting techniques and code refinement, they still struggle with complex programming problems due to rigid solution plans. In this paper, we draw on pair programming practices to propose PairCoder, a novel LLM-based framework for code generation. PairCoder incorporates two collaborative LLM agents, namely a Navigator agent for high-level planning and a Driver agent for specific implementation. The Navigator is responsible for proposing promising solution plans, selecting the current optimal plan, and directing the next iteration round based on execution feedback. The Driver follows the guidance of Navigator to undertake initial code generation, code testing, and refinement. This interleaved and iterative workflow involves multi-plan exploration and feedback-based refinement, which mimics the collaboration of pair programmers. We evaluate PairCoder with both open-source and closed-source LLMs on various code generation benchmarks. Extensive experimental results demonstrate the superior accuracy of PairCoder, achieving relative pass@1 improvements of 12.00%– 162.43% compared to prompting LLMs directly.CCS CONCEPTS• Software and its engineering → Automatic programming.
Test-time scaling without interpreter feedback is essential for real-world code generation scenarios where test cases are not readily available. While existing paradigms often rely on either greedy exploitation (i.e., iterative refinement) or stochastic exploration (i.e., relying on sample-based voting or reranking mechanisms), the balance between these two dimensions remains underexplored. To investigate the LLM's intrinsic ability to balance exploitation and exploration, we introduce SELF-REDRAFT, a framework built upon Self-Refine that encourages the model to propose new drafts for solutions that are fundamentally flawed. Our results show that SELF-REDRAFT consistently achieves better performance than Self-Refine when converged under the same maximum number of iterations. Still, we observe that significant room for improvement remains, largely due to two core aspects of current self-redraft capabilities: constrained capacity for generating instructive feedback and fragile discriminative judgment. We also find that balancing strategies vary notably across different LLMs, reflecting distinct, model-specific behaviors. Overall, our study establishes a baseline for intrinsic exploration-exploitation balancing in test-time scaling and identifies feedback and discrimination as key areas with potential for future advances.
AI coding assistants powered by large language models (LLMs) have transformed software development, significantly boosting productivity. While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real-world development. We introduce MT-Sec, the first benchmark to systematically evaluate both correctness and security in multi-turn coding scenarios. We construct this using a synthetic data pipeline that transforms existing single-turn tasks into semantically aligned multi-turn interaction sequences, allowing reuse of original test suites while modeling the complexity of real-world coding processes. We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec and observe a consistent 20-27% drop in"correct and secure"outputs from single-turn to multi-turn settings -- even among state-of-the-art models. Beyond full-program generation, we also evaluate models on multi-turn code-diff generation -- an unexplored yet practically relevant setting -- and find that models perform worse here, with increased rates of functionally incorrect and insecure outputs. Finally, we find that while agent scaffoldings boost single-turn code generation performance, they are not quite as effective in multi-turn evaluations. Together, these findings highlight the need for benchmarks that jointly evaluate correctness and security in multi-turn, real-world coding workflows.
Despite limited success in large language model (LLM)-based register-transfer-level (RTL) code generation, the root causes of errors remain poorly understood. To address this, we conduct a comprehensive error analysis, finding that most failures arise not from deficient reasoning, but from a lack of RTL programming knowledge, insufficient circuit understanding, ambiguous specifications, or misinterpreted multimodal inputs. Leveraging in-context learning, we propose targeted correction techniques: a retrieval-augmented generation (RAG) knowledge base to supply domain expertise; design description rules with rule-checking to clarify inputs; external tools to convert multimodal data into LLM-compatible formats; and an iterative simulation-debugging loop for remaining errors. Integrating these into an LLM-based framework yields significant improvement, achieving 98.1% accuracy on the VerilogEval benchmark with DeepSeek-v3.2-Speciale, demonstrating the effectiveness of our approach.
The emergence of Large Language Models (LLMs) has transformed software development, enabling automated coding for serverless computing. However, several challenges remain regarding the correctness, performance, cost-efficiency, and code quality of automatically generated functions. The proposed work first characterizes the variable nature of serverless code generation with respect to the key challenges. We then propose the creation of new agentic LLM agents to generate, optimize, and adapt serverless code to improve it's overall quality. We will innovate approaches to provide iterative feedback between LLMs and serverless platforms-for performance assessment, and static code analysis-for code quality assessment by leveraging LLM interoperability protocols. This research will ultimately contribute novel, performance-aware, and cross-platform generative AI tools for serverless computing.
Large Language Model (LLM) agents have recently shown promise in automating complex software development workflows, integrating multiple tools and iterative processes. However, existing agentic frameworks often fall short on observability - robust logging, cost and workflow monitoring, and prompt transparency - leading to unpredictable behavior and high operational overhead. In this paper, we introduce a unified, observability-driven multi-agent approach that integrates customizable LLM wrappers, explicit flow control (via LangGraph), and thorough logging into cloud native NoSQL DynamoDB database. Beyond describing a purely engineering-focused solution, we demonstrate via testing results that our observability enhancements yield improvements in reliability, performance, and cost control on standard benchmarks (LeetCode, MBPP, and HumanEval). Our wise model selection leading to $ 0.01 per correct solution generation. We also show that carefully orchestrated role-specialized agents, combined with real-time prompt and cost tracking, can achieve state-of-the-art performance (up to 98% accuracy) on diverse and complex code generation tasks. Our findings emphasize how bridging best-practice engineering with multi-agent AI research can substantially improve the cost, reliability and scalability of LLM-based development pipelines.
Conversational large-language models (LLMs), such as ChatGPT, are extensively used for issue resolution tasks, particularly for generating ideas to implement new features or resolve bugs. However, not all developer-LLM conversations are useful for effective issue resolution and it is still unknown what makes some of these conversations not helpful. In this paper, we analyze 686 developer-ChatGPT conversations shared within GitHub issue threads to identify characteristics that make these conversations effective for issue resolution. First, we empirically analyze the conversations and their corresponding issue threads to distinguish helpful from unhelpful conversations. We begin by categorizing the types of tasks developers seek help with (e.g., code generation, bug identification and fixing, test generation), to better understand the scenarios in which ChatGPT is most effective. Next, we examine a wide range of conversational, project, and issue-related metrics to uncover statistically significant factors associated with helpful conversations. Finally, we identify common deficiencies in unhelpful ChatGPT responses to highlight areas that could inform the design of more effective developer-facing tools. We found that only 62% of the ChatGPT conversations were helpful for successful issue resolution. Among different tasks related to issue resolution, ChatGPT was most helpful in assisting with code generation, and tool/library/API recommendations, but struggled with generating code explanations. Our conversational metrics reveal that helpful conversations are shorter, more readable, and exhibit higher semantic and linguistic alignment. Our project metrics reveal that larger, more popular projects and experienced developers benefit more from ChatGPT’s assistance. Our issue metrics indicate that ChatGPT is more effective on simpler issues characterized by limited developer activity and faster resolution times. These typically involve well-scoped technical problems such as compilation errors and tool feature requests. In contrast, it performs less effectively on complex issues that demand deep project-specific understanding, such as system-level code debugging and refactoring. The most common deficiencies in unhelpful ChatGPT responses include incorrect information and lack of comprehensiveness. Our findings have wide implications including guiding developers on effective interaction strategies for issue resolution, informing the development of tools or frameworks to support optimal prompt design, and providing insights on fine-tuning LLMs for issue resolution tasks.
Software Development (SD) is remarkably dynamic and is critically dependent on the knowledge acquired by the project’s software developers as the project progresses. Software developers need to understand large amounts of information related to the tasks at hand. This information (context) is often not explicit, as it can be lost in large documentation repositories, a team member’s brain, or beyond their cognitive memory capacity. These contexts include tool features, integration strategies, data structures, code syntax, approaches to tasks, project definitions, and even implicit or tacit contexts, which add significant complexity to the SD process. Current software development practices still lack sufficient techniques using the existing SD execution information and context to provide developers with relevant process guidance, augmenting their capacity to do their job using available applicable information. This paper presents ongoing and future research on an approach to support conversational agent-based knowledge-augmented software development. Developers benefit by receiving recommendations about task-related information and workflows they need to execute. This work advances human-computer interaction patterns in workflow engines, from graphical user interfaces to conversational patterns in software engineering.
No abstract available
Recently, the demand for remote counseling has been on the rise using conversational agents with Large Language Models (LLM) in many areas. This LLM trend is expanding beyond generating simple words to constructing complex sentences, showing progress in various fields. The significant development of LLM is attributed to the progress in Natural Language Processing (NLP) technology, built upon extensive language data. This study introduces a client technology of Open API System based conversational agent using LLM. The target agent is a Chatbot which can generate health counseling messages including empathetic conversations for users with caring his chronic conditions requiring ongoing health management. The overall system can be linked with the community care system through user APPs and APIs for digital healthcare context management. The advantage with the client with REST API is that APIs make it easy to integrate new applications with existing software systems, allowing target systems to meet requirements across a variety of platforms.
No abstract available
The demand for quick development cycles and the growing complexity of software systems support the need for intelligent automation in coding processes. Conventional software development approaches are time-consuming, error-prone, and usually unable to satisfy high standards. This work presents an automated code-generating method based on GPT-based language models to increase software development efficiency. The approach combines prompt engineering techniques and reinforcement learning strategies with fine-tuning a pre-trained GPT model on various high-quality programming datasets to optimize the relevance and accuracy of output. A validation module helps validate the created code in terms of both syntactic and semantic. Based on experimental data, the method reduces development time by as much as 40% compared to hand coding and achieves an average code correctness rate of 88% over several programming languages. With little to no human input, the results show the model can produce code that fits the particular context and is functional. All things considered, GPT-based models, when customized to specific software domains and reinforced with validation systems, can transform future software development methods. They offer a dependable, customized, reasonably priced automated code-generating solution.
No abstract available
Test cases are essential for validating the reliability and quality of software applications. Recent studies have demonstrated the capability of Large Language Models (LLMs) to generate useful test cases for given source code. However, the existing work primarily relies on human-written plain prompts, which often leads to suboptimal results since the performance of LLMs can be highly influenced by the prompts. Moreover, these approaches use the same prompt for all LLMs, overlooking the fact that different LLMs might be best suited to different prompts. Given the wide variety of possible prompt formulations, automatically discovering the optimal prompt for each LLM presents a significant challenge. Although there are methods on automated prompt optimization in the natural language processing field, they are hard to produce effective prompts for the test case generation task. First, the methods iteratively optimize prompts by simply combining and mutating existing ones without proper guidance, resulting in prompts that lack diversity and tend to repeat the same errors in the generated test cases. Second, the prompts are generally lack of domain contextual knowledge, limiting LLMs' performance in the task.
Large Code Models (LCMs) have demonstrated potential in advancing various code intelligence tasks. However, their effectiveness can be greatly influenced by the quality of the prompts. Current prompt design strategies in code intelligence studies are mostly manually generated, which could be time-consuming and extremely rely on the base LCMs and tasks. Although automated prompt generation (APG) has been investigated in the natural language processing field, it has not attracted sufficient attention and been well explored in the code intelligence tasks. Considering the various tasks and black-box nature of LCMs faced by developers in practice, it is essential to automate the prompt generation process.To mitigate the gap, we empirically investigate the two important parts in APG, including Instruction Generation (IG) and Muti-Step Reasoning (MSR). The instruction generation part aims at providing a task-related description for instructing LCMs to effectively accomplish specific tasks; while the multistep reasoning part aims at guiding LCMs to produce a series of logical steps before arriving at the final answer. For each part, we evaluate the widely-used APG methods on four open-source LCMs and three code intelligence tasks, i.e., code translation (PL-PL), code summarization (PL-NL) and API recommendation (NL-PL). Experimental results indicate that the two parts in APG can dramatically enhance the performance of the code intelligence tasks compared with the basic prompts. Based on the results, we further propose a novel APG approach by combining the best methods of the two studied parts of APG. Experiments show that the proposed APG approach achieves an average improvement of 28.38% with respect to CodeBLEU for the code translation, 58.11% in terms of ROUGE-L for the code summarization and 84.53% in SuccessRate@1 for the API recommendation over the basic prompts, respectively. To validate the effectiveness in industrial scenario, we further evaluate our approach on WeChat-Bench, a proprietary dataset from the WeChat Group in Tencent for API recommendation, achieving an average improvement of 148.89% in MRR.
Automated code generation using large language models (LLMs) has attracted significant attention due to its potential to enhance software development. However, ensuring both accuracy and efficiency in generated code remains challenging. Prior research has mainly advanced along two directions: (i) enhancing models through architectural improvements, larger parameter scaling, and domain-specific fine-tuning; and (ii) refining prompt engineering techniques to better structure inputs and guide outputs. In this work, we pursue the latter direction and introduce a prompt engineering–based approach for Java code generation. Rather than directly generating Java code from natural language specifications, we propose a two-step pipeline: (i) generating intermediate Python code and, (ii) translating Python into Java. This design leverages the strong performance of LLMs on Python while enabling systematic optimization of the translation stage. To achieve this, we propose a set of translation strategies combining prompt engineering principles—including explicit instructions, syntax guidance, and domain keyword constraints—with advanced reasoning strategies such as Zero-shot Chain of Thought (Zero-shot-CoT) to efficiently generate Java code. Experiments on the HumanEval-X benchmark using the CodeGeeX3 model show that the proposed strategies significantly improve the accuracy of Java code generation. We further evaluate across diverse programming tasks, including file operations, HTTP APIs, database connectivity, parallel computing, and graphical applications, confirming the robustness of our approach. Finally, we validate the generality of our findings using ChatGPT (GPT-4o), observing substantial improvements over baseline prompt designs.
Recent developments in large language models (LLMs) change automated code generation. Still, there remains difficulty in framing performance, explainability, and consistent output. This is factual study investigating how AI-augmented prompt engineering, practiced in ML and DL, can improve the plausibility and interpretability of code generation. It applies a hybrid ML-DL system (a Prompt Optimization Model (POM) using Random Forest classifiers to rank relevance of prompts, and a Transformer-based Deep Code Model (DCM) trained on a verified programming dataset in Python, Java, and C++). It demonstrates that the code correctness (by $16.3 \%$), semantic stability (by $11.8 \%$), and explainability metrics (by $19.7 \%$) are much greater compared to current baseline LLM models. The results indicate that optimized prompt engineering methods have the potential to enhance predictive ability and interpretability of the AI produced code. The paper concludes with a recommendation on the way explainable AI (XAI) approaches should be incorporated in the short-term design processes.
PSD2Code: Automated Front-End Code Generation from Design Files via Multimodal Large Language Models
Design-to-code generation has emerged as a promising approach to bridge the gap between design prototypes and deployable frontend code. However, existing methods often suffer from structural inconsistencies, asset misalignment, and limited production readiness. This paper presents PSD2Code, a novel multi-modal approach that leverages PSD file parsing and asset alignment to generate production-ready React+SCSS code. Our method introduces a ParseAlignGenerate pipeline that extracts hierarchical structures, layer properties, and metadata from PSD files, providing large language models with precise spatial relationships and semantic groupings for frontend code generation. The system employs a constraint-based alignment strategy that ensures consistency between generated elements and design resources, while a structured prompt construction enhances controllability and code quality. Comprehensive evaluation demonstrates significant improvements over existing methods across multiple metrics including code similarity, visual fidelity, and production readiness. The method exhibits strong model independence across different large language models, validating the effectiveness of integrating structured design information with multimodal large language models for industrial-grade code generation, marking an important step toward design-driven automated frontend development.
Large Language Models (LLMs) demonstrate potential in code generation capabilities, yet their applicability in autonomous vehicle control has not been sufficiently explored. This study verifies whether LLMs can generate executable MATLAB code for software-defined vehicle scenarios, comparing five models: GPT-4, Gemini 2.5 Pro, Claude Sonnet 4.0, CodeLlama-13B-Instruct, and StarCoder2. Thirteen standardised prompts were applied across three types of scenarios: programming-based driving scenarios, inertial sensor-based simulations, and vehicle parking scenarios. Multiple automated evaluation metrics—BLEU, ROUGE-L, ChrF, Spec-Compliance, and Runtime-Sanity—were used to assess code executability, accuracy, and completeness. The results showed GPT-4 achieved the highest score 0.54 in the parking scenario with an overall average score of 0.27, followed by Gemini 2.5 Pro as 0.26. Commercial models demonstrated over 60% execution success rates across all scenarios, whereas open-source models like CodeLlama and StarCoder2 were limited to under 20%. Furthermore, the parking scenario yielded the lowest average score of 0.19, confirming that complex tasks involving sensor synchronisation and trajectory control represent a common limitation across all models. This study presents a new benchmark for quantitatively evaluating the quality of SDV control code generated by LLMs, empirically demonstrating that prompt design and task complexity critically influence model reliability and real-world applicability.
Smart contracts are important for digital finance, yet they are hard to patch once deployed. Prior work mostly studies LLMs for vulnerability detection, leaving their automated exploit generation (AEG) capability unclear. This paper closes that gap with \textsc{ReX}, a framework that links LLM-based exploit synthesis to the Foundry stack for end-to-end generation, compilation, execution, and verification. Five recent LLMs are evaluated across eight common vulnerability classes, supported by a curated dataset of 38{+} real incident PoCs and three automation aids: prompt refactoring, a compiler feedback loop, and templated test harnesses. Results indicate strong performance on single-contract PoCs and weak performance on cross-contract attacks; outcomes depend mainly on the model and bug type, with code structure and prompt tuning contributing little. The study also surfaces gaps in current defenses against LLM-driven AEG, pointing to the need for stronger protections.
Large Language Models (LLMs) have demonstrated impressive performance in software engineering tasks. However, improving their accuracy in generating correct and reliable code remains challenging. Numerous prompt engineering techniques (PETs) have been developed to address this, but no single approach is universally optimal. Selecting the right PET for each query is difficult for two primary reasons: (1) interactive prompting techniques may not consistently deliver the expected benefits, especially for simpler queries, and (2) current automated prompt engineering methods lack adaptability and fail to fully utilize multi-stage responses. To overcome these challenges, we propose PET-Select, a PET-agnostic selection model that uses code complexity as a proxy to classify queries and select the most appropriate PET. By incorporating contrastive learning, PET-Select effectively distinguishes between simple and complex problems, allowing it to choose PETs that are best suited for each query's complexity level. Our evaluations on the MBPP and HumanEval benchmarks using GPT-3.5 Turbo and GPT-4o show up to a 1.9% improvement in pass@1 accuracy, along with a 74.8% reduction in token usage. Additionally, we provide both quantitative and qualitative results to demonstrate how PET-Select effectively selects the most appropriate techniques for each code generation query, further showcasing its efficiency in optimizing PET selection.
Automated code generation can be a powerful technique for software development, significantly reducing developers' efforts and time required to create new code by generating it automatically based on requirements. Recently, OpenAI's language model ChatGPT has emerged as a powerful tool for generating human-like responses to a wide range of textual inputs (i.e., prompts), including those related to code generation. However, the effectiveness of ChatGPT for code generation is not well understood, and the generation performance could be heavily influenced by the choice of prompt. To answer these questions, we conducted experiments using the CodeXGlue dataset to evaluate ChatGPT's capabilities for two code generation tasks, including text-to-code and code-to-code generation. We designed prompts by leveraging the chain-of-thought strategy with multi-step optimizations. Our results showed that by carefully designing prompts to guide ChatGPT, the generation performance can be improved substantially. We also analyzed the factors that influenced the prompt design and provided insights that could guide future research.
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing. Among the myriad of applications that benefit from LLMs, automated code generation is increasingly promising. The potential to transform natural language prompts into executable code promises a major shift in software development practices and paves the way for significant reductions in manual coding efforts and the likelihood of human-induced errors. This paper reports the results of a study that evaluates the performance of various LLMs, such as Bard, ChatGPT-3.5, ChatGPT-4, and Claude-2, in generating Python for coding problems. We focus on how levels of prompt specificity impact the accuracy, time efficiency, and space efficiency of the generated code. A benchmark of 104 coding problems, each with four types of prompts with varying degrees of tests and specificity, was employed to examine these aspects comprehensively. Our results indicate significant variations in performance across different LLMs and prompt types, and its key contribution is to reveal the ideal prompting strategy for creating accurate Python functions. This study lays the groundwork for further research in LLM capabilities and suggests practical implications for utilizing LLMs in automated code generation tasks and test-driven development.
Large Language Models (LLMs) have demonstrated impressive capabilities in automated code generation but frequently produce code that fails formal verification, an essential requirement for hardware and safety-critical domains. To overcome this fundamental limitation, we previously proposed PREFACE, a model-agnostic framework based on reinforcement learning (RL) that iteratively repairs the prompts provided to frozen LLMs, systematically steering them toward generating formally verifiable Dafny code without costly fine-tuning. This work presents Proof2Silicon, a novel end-to-end synthesis framework that embeds the previously proposed PREFACE flow to enable the generation of correctness-by-construction hardware directly from natural language specifications. Proof2Silicon operates by: (1) leveraging PREFACE's verifier-driven RL agent to optimize prompt generation iteratively, ensuring Dafny code correctness; (2) automatically translating verified Dafny programs into synthesizable high-level C using Dafny's Python backend and PyLog; and (3) employing Vivado HLS to produce RTL implementations. Evaluated rigorously on a challenging 100-task benchmark, PREFACE's RL-guided prompt optimization consistently improved Dafny verification success rates across diverse LLMs by up to 21%. Crucially, Proof2Silicon achieved an end-to-end hardware synthesis success rate of up to 72%, generating RTL designs through Vivado HLS synthesis flows. These results demonstrate a robust, scalable, and automated pipeline for LLM-driven, formally verified hardware synthesis, bridging natural-language specification and silicon realization.
Context: Software developers often ask questions on Technical Q&A forums like Stack Overflow (SO) to seek solutions to their programming-related problems (e.g., errors and unexpected behavior of code). Problem: Many questions miss required code snippets due to the lack of readily available code, time constraints, employer restrictions, confidentiality concerns, or uncertainty about what code to share. Unfortunately, missing but required code snippets prevent questions from getting prompt and appropriate solutions. Objective: We plan to introduce GENCNIPPET, a tool designed to integrate with SO's question submission system. GENCNIPPET will generate relevant code examples (when required) to support questions for their timely solutions. Methodology: We first downloaded the SO April 2024 data dump, which contains 1.94 million questions related to Python that have code snippets and 1.43 million questions related to Java. Then, we filter these questions to identify those that genuinely require code snippets using a state-of-the-art machine learning model. Next, we select questions with positive scores to ensure high-quality data. Our plan is to fine-tune Llama-3 models (e.g., Llama-3-8B), using 80% of the selected questions for training and 10% for validation. The primary reasons for choosing Llama models are their open-source accessibility and robust fine-tuning capabilities, which are essential for deploying a freely accessible tool. GENCNIPPET will be integrated with the SO question submission system as a browser plugin. It will communicate with the fine-tuned model to generate code snippets tailored to the target questions. The effectiveness of the generated code examples will be assessed using automatic evaluation against ground truth, user perspectives, and live (wild) testing in real-world scenarios.
By enabling natural-language-to-code translation, recent advances in Large Language Models (LLMs) like Gemini, GPT-4, and Codex have revolutionized human-computer interaction. Despite these advancements, the quality, maintainability, and dependability of generated code are still inconsistent, primarily due to the way prompts are phrased. Prompt sensitivity becomes a major constraint in Low-Code/No-Code (LC/NC) systems, where automation and accessibility are given top priority. The literature on prompt engineering as the primary method for coordinating user intent with executable logic in AI-driven code development is compiled in this review study. It examines current frameworks for automated refining, contextual enrichment, iterative feedback, and structured prompting [1]–[5]. The study also looks at how these techniques might be integrated into LC/NC ecosystems to produce development workflows that are flexible and suitable for production. We find that systematic rapid engineering, which bridges the semantic gap between executable syntax and natural language, is the primary enabler for deterministic, high-fidelity AI code creation through comparative analysis of works published between 2023 and 2025.
Automated code generation is gaining significant importance in intelligent computer programming and system deployment. However, current approaches often face challenges in computational efficiency and lack robust mechanisms for code parsing and error correction. In this work, we propose a novel framework, PyCapsule, with a simple yet effective two-agent pipeline and efficient self-debugging modules for Python code generation. PyCapsule features sophisticated prompt inference, iterative error handling, and case testing, ensuring high generation stability, safety, and correctness. Empirically, PyCapsule achieves up to 5.7% improvement of success rate on HumanEval, 10.3% on HumanEval-ET, and 24.4% on BigCodeBench compared to the state-of-art methods. We also observe a decrease in normalized success rate given more self-debugging attempts, potentially affected by limited and noisy error feedback in retention. PyCapsule demonstrates broader impacts on advancing lightweight and efficient code generation for artificial intelligence systems.
In industrial control systems, the generation and verification of Programmable Logic Controller (PLC) code are critical for ensuring operational efficiency and safety. While Large Language Models (LLMs) have made strides in automated code generation, they often fall short in providing correctness guarantees and specialized support for PLC programming. To address these challenges, this paper introduces Agents4PLC, a novel framework that not only automates PLC code generation but also includes code-level verification through an LLM-based multi-agent system. We first establish a comprehensive benchmark for verifiable PLC code generation area, transitioning from natural language requirements to human-written-verified formal specifications and reference PLC code. We further enhance our `agents' specifically for industrial control systems by incorporating Retrieval-Augmented Generation (RAG), advanced prompt engineering techniques, and Chain-of-Thought strategies. Evaluation against the benchmark demonstrates that Agents4PLC significantly outperforms previous methods, achieving superior results across a series of increasingly rigorous metrics. This research not only addresses the critical challenges in PLC programming but also highlights the potential of our framework to generate verifiable code applicable to real-world industrial applications.
The growing demand for rapid, high-quality software development has intensified the need for intelligent automation in code generation and optimization. This paper presents an autonomous large language model (LLM) framework that integrates context-aware prompt engineering, multi-stage optimization, and a self-healing correction loop to produce efficient, reliable, and maintainable code with minimal human intervention. The system begins by analyzing functional and non-functional requirements, translating them into a structured context vector to guide the LLM. It then dynamically refines prompts based on real-time feedback, ensuring progressive improvement in output quality. Generated code undergoes automated syntax and semantic validation, followed by a multi-stage optimization pipeline targeting algorithmic efficiency, memory utilization, and compiler-level enhancements. A hybrid execution simulation, combining symbolic and sandboxed testing, further validates robustness and performance. Experimental evaluation across diverse programming tasks demonstrated significant performance improvements over baseline LLM code generation. The proposed framework achieved an average accuracy of 96.6% in hybrid execution reliability, with error rate reductions exceeding 61%, execution time improvements averaging 24%, and code quality scores improving by ~16%. These results confirm that the integration of adaptive prompt engineering and iterative optimization enables autonomous, high-accuracy code generation suitable for real-world applications. The framework’s adaptability positions it as a transformative solution for domains requiring both speed and quality in software development.
Automating software development processes through the orchestration of GitHub Action workflows has revolutionized the efficiency and agility of software delivery pipelines. This paper presents a detailed investigation into the use of Large Language Models (LLMs) specifically, GPT 3.5 and GPT 4 to generate and evaluate GitHub Action workflows for DevOps tasks. Our methodology involves data collection from public GitHub repositories, prompt engineering for LLM utilization, and evaluation metrics encompassing exact match scores, BLEU scores, and a novel DevOps Aware score. The research scrutinizes the proficiency of GPT 3.5 and GPT 4 in generating GitHub workflows, while assessing the influence of various prompt elements in constructing the most efficient pipeline. Results indicate substantial advancements in GPT 4, particularly in DevOps awareness and syntax correctness. The research introduces a GitHub App built on Probot, empowering users to automate workflow generation within GitHub ecosystem. This study contributes insights into the evolving landscape of AI-driven automation in DevOps practices.
Automated paper reproduction has emerged as a promising approach to accelerate scientific research, employing multi-step workflow frameworks to systematically convert academic papers into executable code. However, existing frameworks often lack mechanisms to verify and refine the outputs at each generation step, or rely heavily on manually designed prompts for self-refinement, which limits their adaptability and scalability. To address these limitations, we propose a prompt-free collaborative agent framework that automatically enhances the quality of paper-to-code generation. Our approach employs two collaborative agents: a verification agent that examines whether the outputs at each step satisfy the requirements specified in the corresponding system prompt, and a refinement agent that revises the outputs based on the identified issues. Unlike previous methods that require human experts to craft specific refinement prompts for each step, our framework achieves automatic verification and improvement by leveraging only the original system prompts. We integrate our collaborative agents into the Paper2Code framework and conduct comprehensive experiments on PaperBench Code-Dev and Paper2CodeBench datasets. Experimental results demonstrate that our approach significantly improves the accuracy and completeness of reproduced code, achieving performance gains of approximately 15\% and 13\%, respectively, compared to the baseline without our agents. Furthermore, comparative experiments against Self-Refine validate the robustness and consistency of our prompt-free approach across different datasets.
Current AI-powered programming assistants generate large code blocks through single-shot interactions, violating established human-computer interaction principles and increasing cognitive load while reducing developer understanding and control. We present ProDec, a system that automatically decomposes complex programming prompts into structured subtasks with explicit dependencies, enabling interactive visual exploration and modification of AI reasoning steps. By transforming AIassisted programming from black-box code generation to transparent collaborative problem-solving tool, our approach aims to restore developer agency.
The automatic generation of register-transfer level (RTL) code from natural language has advanced rapidly with large language models (LLMs). However, existing LLM-driven hardware design flows often produce monolithic and inconsistent implementations, lacking modularity and interface consistency. This paper introduces CTPGenius, a Category-Theoretic Prompt Graph (CTPGs) approach integrated with LLM-based code synthesis to enable systematic, hierarchical RTL generation. By modeling hardware modules as categorical objects and their connections as morphisms, our approach guides the LLM to generate consistent submodules and assemble them into a correct top-level design. Functorial prompting ensures parameter and port consistency, while iterative LLM-aided correction addresses residual errors. Experiments show that CTPGenius significantly improves modularity, correctness, and extensibility in LLM-generated RTL, advancing automated hardware synthesis from natural language.
Large Language Models (LLMs) have shown promising results in automatic code generation by improving coding efficiency to a certain extent. However, generating high-quality and reliable code remains a formidable task because of LLMs' lack of good programming practice, especially in exception handling. In this paper, we first conduct an empirical study and summarize three crucial challenges of LLMs in exception handling, i.e., incomplete exception handling, incorrect exception handling and abuse of try-catch. We then try prompts with different granularities to address such challenges, finding fine-grained knowledge-driven prompts works best. Based on our empirical study, we propose a novel Knowledge-driven Prompt Chaining-based code generation approach, name KPC, which decomposes code generation into an AI chain with iterative check-rewrite steps and chains fine-grained knowledge-driven prompts to assist LLMs in considering exception-handling specifications. We evaluate our KPC-based approach with 3,079 code generation tasks extracted from the Java official API documentation. Extensive experimental results demonstrate that the KPC-based approach has considerable potential to ameliorate the quality of code generated by LLMs. It achieves this through proficiently managing exceptions and obtaining remarkable enhancements of 109.86% and 578.57% with static evaluation methods, as well as a reduction of 18 runtime bugs in the sampled dataset with dynamic validation.
No abstract available
Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.
LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level across various benchmarks. However, the complexities associated with creating code artifacts like classes, particularly within the context of real-world software repositories, remain underexplored. Prior research treats class-level generation as an isolated task, neglecting the intricate dependencies&interactions that characterize real-world software environments. To address this gap, we introduce RepoClassBench, a comprehensive benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world repositories. RepoClassBench includes"Natural Language to Class generation"tasks across Java, Python&C# from a selection of repositories. We ensure that each class in our dataset not only has cross-file dependencies within the repository but also includes corresponding test cases to verify its functionality. We find that current models struggle with the realistic challenges posed by our benchmark, primarily due to their limited exposure to relevant repository contexts. To address this shortcoming, we introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate&reason about repository-level context in an agent-based framework. Our experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages&under various settings. Our findings emphasize the critical need for code-generation benchmarks to incorporate repo-level dependencies to more accurately reflect the complexities of software development. Our work shows the benefits of leveraging specialized tools to enhance LLMs' understanding of repository context. We plan to make our dataset&evaluation harness public.
Natural Language Processing has emerged as a powerful tool in various fields, including code generation. This paper explores the application of NLP techniques for the automatic generation of code from natural language specifications. The goal is to bridge the gap between human-readable requirements and machine-executable code, making software development more accessible and efficient. NLP for code generation serves as a bridge between human-readable natural language specifications and machine-executable code. This is particularly significant as it facilitates communication between developers and non-technical stakeholders, allowing them to contribute to the software development process without detailed programming knowledge. As the demand for software development continues to grow, there is a persistent talent gap in the industry. NLP for code generation can contribute to addressing this gap by allowing individuals with domain expertise to participate in software development without requiring extensive programming skills. The COPRAS-G method requires identifying selection criteria; evaluating information related to these criteria, and developing methods to evaluate Meeting the participant's needs Criteria for doing in order to assess the overall performance of the surrogate. Decision analysis involves a Decision Maker (DM) Situation to do consider a particular set of alternatives and select one among several alternatives, usually with conflicting criteria. For this reason, the developed complexity proportionality assessment (COPRAS) method can be used. From the result TRANX is got the first rank whereas is the Tree2 tree is having the lowest rank.
MBPP is a popular dataset for evaluating the task of code generation from natural language. Despite its popularity, there are three problems: (1) it relies on providing test cases to generate the right signature, (2) there is poor alignment between instruction and evaluation test cases, and (3) contamination of the exact phrasing being present in training datasets. We adapt MBPP to emphasize on generating code from just natural language by (1) removing ambiguity about the semantics of the task from the descriptions, and (2) evaluating generated code on multiple sets of assertions to account for ambiguity in the syntax. We compare popular open and closed weight models on the original (MBPP) and adapted (MBUPP) datasets.
Large Language Models (LLMs) have shown remarkable potential in code generation, making them increasingly important in the field. However, the security issues of generated code have not been fully addressed, and the usability of LLMs in code generation still requires further exploration. This work introduces SecCode, a framework that leverages an innovative interactive encouragement prompting (EP) technique for secure code generation with \textit{only NL} prompts. This approach ensures that the prompts can be easily shared and understood by general users. SecCode functions through three stages: 1) Code Generation using NL Prompts; 2) Code Vulnerability Detection and Fixing, utilising our proposed encouragement prompting; 3) Vulnerability Cross-Checking and Code Security Refinement. These stages are executed in multiple interactive iterations to progressively enhance security. By using both proprietary LLMs (i.e., GPT-3.5 Turbo, GPT-4 and GPT-4o) and open-source LLMs (i.e., Llama 3.1 8B Instruct, DeepSeek Coder V2 Lite Instruct) evaluated on three benchmark datasets, extensive experimental results show that our proposed SecCode greatly outperforms compared baselines, generating secure code with a high vulnerability correction rate. For example, SecCode exhibits a high fix success rate of over 76\% after running 5 automated EP interactive iterations and over 89\% after running 10 automated EP interactive iterations. To the best of our knowledge, this work is the first to formulate secure code generation with NL prompts only. We have open-sourced our code and encourage the community to focus on secure code generation.
With the advancement of Large Language Models (LLMs), significant progress has been made in code generation, enabling LLMs to transform natural language into programming code. These Code LLMs have been widely accepted by massive users and organizations. However, a dangerous nature is hidden in the code, which is the existence of fatal vulnerabilities. While some LLM providers have attempted to address these issues by aligning with human guidance, these efforts fall short of making Code LLMs practical and robust. Without a deep understanding of the performance of the LLMs under the practical worst cases, it would be concerning to apply them to various real-world applications. In this paper, we answer the critical issue: Are existing Code LLMs immune to generating vulnerable code? If not, what is the possible maximum severity of this issue in practical deployment scenarios? In this paper, we introduce DeceptPrompt, a novel algorithm that can generate adversarial natural language instructions that drive the Code LLMs to generate functionality correct code with vulnerabilities. DeceptPrompt is achieved through a systematic evolution-based algorithm with a fine grain loss design. The unique advantage of DeceptPrompt enables us to find natural prefix/suffix with totally benign and non-directional semantic meaning, meanwhile, having great power in inducing the Code LLMs to generate vulnerable code. This feature can enable us to conduct the almost-worstcase red-teaming on these LLMs in a real scenario, where users are using natural language. Our extensive experiments and analyses on DeceptPrompt not only validate the effectiveness of our approach but also shed light on the huge weakness of LLMs in the code generation task. When applying the optimized prefix/suffix, the attack success rate (ASR) will improve by average 50% compared with no prefix/suffix applying.
Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1078 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions. Arcade is publicly available at https://github.com/google-research/arcade-nl2code/.
No abstract available
Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen.
Training datasets for semantic parsing are typically small due to the higher expertise required for annotation than most other NLP tasks. As a result, models for this application usually need additional prior knowledge to be built into the architecture or algorithm. The increased dependency on human experts hinders automation and raises the development and maintenance costs in practice. This work investigates whether a generic transformer-based seq2seq model can achieve competitive performance with minimal code-generation-specific inductive bias design. By exploiting a relatively sizeable monolingual corpus of the target programming language, which is cheap to mine from the web, we achieved 81.03% exact match accuracy on Django and 32.57 BLEU score on CoNaLa. Both are SOTA to the best of our knowledge. This positive evidence highlights a potentially easier path toward building accurate semantic parsers in practice.
Recently, many large language models (LLMs) have been proposed, showing advanced proficiency in code generation. Meanwhile, many efforts have been dedicated to evaluating LLMs on code generation benchmarks such as HumanEval. Although being very helpful for comparing different LLMs, existing evaluation focuses on a sim-ple code generation scenario (i.e., function-level or statement-level code generation), which mainly asks LLMs to generate one single code unit (e.g., a function or a statement) for the given natural language description. Such evaluation focuses on generating independent and often small-scale code units, thus leaving it unclear how LLMs perform in real-world software development scenarios. To fill this knowledge gap, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e., class-level code generation. Compared with existing code generation benchmarks, it better reflects real-world software development scenarios due to it comprising broader contextual dependencies and multiple, interdependent units of code. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on the new benchmark ClassEval, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we find that all LLMs perform much worse on class-level code generation compared to the method-level. While GPT models still dominate other LLMs on class-level code generation, the performance rankings of other models on method-level code generation no longer holds for class-level code generation. Besides, most models (except GPT models) perform better when generating the class method by method; and they have the limited ability of generating dependent code. Based on our findings, we call for software engineering (SE) researchers' expertise to build more LLM benchmarks based on practical and complicated software development scenarios.
Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are developing various tools, benchmarks, and metrics to evaluate the effectiveness of LLM-generated code. However, there is a lack of solutions evaluated through empirically grounded methods that incorporate practitioners perspectives to assess functionality, syntax, and accuracy in real world applications. To address this gap, we propose and develop a multi-model unified platform to generate and execute code based on natural language prompts. We conducted a survey with 60 software practitioners from 11 countries across four continents working in diverse professional roles and domains to evaluate the usability, performance, strengths, and limitations of each model. The results present practitioners feedback and insights into the use of LLMs in software development, including their strengths and weaknesses, key aspects overlooked by benchmarks and metrics, and a broader understanding of their practical applicability. These findings can help researchers and practitioners make informed decisions for systematically selecting and using LLMs in software development projects. Future research will focus on integrating more diverse models into the proposed system, incorporating additional case studies, and conducting developer interviews for deeper empirical insights into LLM-driven software development.
We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, including Python and SQL, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously designed the evaluation suite to ensure the accuracy and robustness of the evaluation. We developed the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at [link](https://github.com/yiyihum/dabench)
Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation
The application of large language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of training data, and few hardware benchmarks exist. The open-source VerilogEval benchmark, released in November 2023, provided a consistent evaluation framework for LLMs on code completion tasks. Since then, both commercial and open models have seen significant development. In this work, we evaluate new commercial and open models since VerilogEval’s original release—including GPT-4o, GPT-4 Turbo, Llama3.1 (8B/70B/405B), Llama3 70B, Mistral Large, DeepSeek Coder (33B and 6.7B), CodeGemma 7B, and RTL-Coder—against an improved VerilogEval benchmark suite. We find measurable improvements in state-of-the-art models: GPT-4o achieves a 63% pass rate on specification-to-RTL tasks. The recently released and open Llama3.1 405B achieves a 58% pass rate, almost matching GPT-4o, while the smaller domain-specific RTL-Coder 6.7B models achieve an impressive 34% pass rate. Additionally, we enhance VerilogEval’s infrastructure by automatically classifying failures, introducing in-context learning support, and extending the tasks to specification-to-RTL translation. We find that prompt engineering remains crucial for achieving good pass rates and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is essential for continued model development and deployment.
Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.
Large language models (LLMs) have revolutionized various applications in natural language processing and exhibited proficiency in generating programming code. We propose a framework for evaluating the code generation ability of LLMs and introduce a new metric, pass‐ratio@n , which captures the granularity of accuracy according to the pass rate of test cases. The framework is intended to be fully automatic to handle the repetitive work involved in generating prompts, conducting inferences, and executing the generated codes. A preliminary evaluation focusing on the prompt detail, problem publication date, and difficulty level demonstrates the successful integration of our framework with the LeetCode coding platform and highlights the applicability of the pass‐ratio@n metric.
The advent of large language models trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the LLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.
Large language models achieve promising results in code generation based on a given natural language description. They have been integrated into open source projects and commercial products to facilitate daily coding activities. The natural language description in the prompt is crucial for LLMs to comprehend users’ requirements. Prior studies have uncovered that LLMs are sensitive to changes in the prompts, including slight changes that look inconspicuous. However, the natural language descriptions often vary in real-world scenarios (e.g., different formats, grammar, and wording). Prior studies on the robustness of LLMs were often based on random perturbations, and such perturbations may not actually happen. In this article, we conduct a comprehensive study to investigate how code LLMs are robust to variations of natural language descriptions in real-world scenarios. We summarize 18 categories of perturbations of natural language and three combinations of co-occurred categories based on our literature review and online survey with practitioners. We propose an automated framework, NLPerturbator, which can perform perturbations of each category given a set of prompts. Through a series of experiments on code generation using sevencode LLMs, we find that the perturbed prompts can decrease the performance of code generation by a considerable margin. Our study highlights the importance of enhancing the robustness of LLMs to real-world variations in the prompts, as well as the essentiality of attentively constructing the prompts.
With significant advances in deep learning, the code generation from natural language description has become a prevailing research. Existing researches demonstrate that these methods have achieved high BLEU values. However, the data sets used in the existing researches lack diversity, and they usually use BLEU as the only evaluation metric. To overcome these limitations, in this paper, we crawled a data set that is more suitable for code generation from the online judge system, and re-run the existing code generation models on this data set. We evaluate the generated code from five aspects: lexical similarity, tree similarity, syntactic legality, semantic legality, and functional correctness. This study provides a deeper analysis of the performance of existing code generation methods.
Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4o has achieved a 91.0\% pass rate on HumanEval. However, this draws into question the adequacy of existing benchmarks in thoroughly assessing function-level code generation capabilities. Our study analyzed two common benchmarks, HumanEval and MBPP, and found that these might not thoroughly evaluate LLMs'code generation capacities due to limitations in quality, difficulty, and granularity. To resolve this, we introduce the Mostly Hard Python Problems (MHPP) dataset, consisting of 210 unique human-curated problems. By focusing on the combination of natural language and code reasoning, MHPP gauges LLMs'abilities to comprehend specifications and restrictions, engage in multi-step reasoning, and apply coding knowledge effectively. Initial evaluations of 26 LLMs using MHPP showed many high-performing models on HumanEval failed to achieve similar success on MHPP. Moreover, MHPP highlighted various previously undiscovered limitations within various LLMs, leading us to believe that it could pave the way for a better understanding of LLMs'capabilities and limitations. MHPP, evaluation pipeline, and leaderboard can be found in https://github.com/SparksofAGI/MHPP.
As LLMs become increasingly prevalent, it is interesting to consider how ``creative'' these models can be. From cognitive science, creativity consists of at least two key characteristics: \emph{convergent} thinking (purposefulness to achieve a given goal) and \emph{divergent} thinking (adaptability to explore new environments or constraints) \citep{runco2003critical}. In this work, we introduce a framework for quantifying LLM creativity that incorporates the two design ingredients: (1) We introduce DENIAL PROMPTING which pushes LLMs to develop more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies. (2) We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses by LLMs. We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions. We quantify NEOGAUGE for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NEOCODER dataset for reproducing our results on future models.
The automatic generation of RTL code (e.g., Verilog) using natural language instructions and large language models (LLMs) has attracted significant research interest recently. However, most existing approaches heavily rely on commercial LLMs, such as ChatGPT, while open-source LLMs tailored for this specific design generation task exhibit notably inferior performance. The absence of high-quality open-source solutions restricts the flexibility and data privacy of this emerging technique. In this study, we present a new customized LLM solution with a modest parameter count of only 7B, achieving better performance than GPT-3.5 on all representative benchmarks for RTL code generation. Especially, it outperforms GPT-4 in VerilogEval Machine benchmark. This remarkable balance between accuracy and efficiency is made possible by leveraging our new RTL code dataset and a customized LLM algorithm, both of which have been made fully open-source. Furthermore, we have successfully quantized our LLM to 4-bit with a total size of 4 GB, enabling it to function on a single laptop with only slight performance degradation. This efficiency allows the RTL generator to serve as a local assistant for engineers, ensuring all design privacy concerns are addressed.
We propose using natural language outlines as a novel modality and interaction surface for providing AI assistance to developers throughout the software development process. An NL outline for a code function comprises multiple statements written in concise prose, which partition the code and summarize its main ideas in the style of literate programming. Crucially, we find that modern LLMs can generate accurate and high-quality NL outlines in practice. Moreover, NL outlines enable a bidirectional sync between code and NL, where a developer can change either code or NL and have the LLM automatically update the other. We discuss many use cases for NL outlines: they can accelerate understanding and navigation of code and diffs, simplify code maintenance, augment code search, steer code generation, and more. We then propose and compare multiple LLM prompting techniques for generating outlines and ask professional developers to judge outline quality. Finally, we present two case studies applying NL outlines toward code review and malware detection.
Code assistance refers to the utilization of various tools, techniques, and models to help developers in the process of software development. As coding tasks become increasingly complex, code assistant plays a pivotal role in enhancing developer productivity, reducing errors, and facilitating a more efficient coding workflow. This assistance can manifest in various forms, including code autocompletion, error detection and correction, code generation, documentation support, and context-aware suggestions. Language models have emerged as integral components of code assistance, offering developers the capability to receive intelligent suggestions, generate code snippets, and enhance overall coding proficiency. In this paper, we propose new hybrid models for code generation by leveraging pre-trained language models BERT, RoBERTa, ELECTRA, and LUKE with the Marian Causal Language Model. Selecting these models based on their strong performance in various natural language processing tasks. We evaluate the performance of these models on two datasets CoNaLa and DJANGO and compare them to existing state-of-the-art models. We aim to investigate the potential of pre-trained transformer language models to revolutionize code generation, offering improved precision and efficiency in navigating complex coding scenarios. Additionally, conducting error analysis and refining the generated code. Our results show that these models, when combined with the Marian Decoder, significantly improve code generation accuracy and efficiency. Notably, the RoBERTaMarian model achieved a maximum BLEU score of 35.74 and an exact match accuracy of 13.8% on CoNaLa, while LUKE-Marian attained a BLEU score of 89.34 and an exact match accuracy of 78.50% on DJANGO. Implementation of this work is available at https://github.com/AhmedSSoliman/Leveraging-Pretrained-Language-Models-for-Code-Generation.
Crafting effective prompts for code generation or editing with Large Language Models (LLMs) is not an easy task. Particularly, the absence of immediate, stable feedback during prompt crafting hinders effective interaction, as users are left to mentally imagine possible outcomes until the code is generated. In response, we introduce Language-Oriented Code Sketching, an interactive approach that provides instant, incremental feedback in the form of code sketches (i.e., incomplete code outlines) during prompt crafting. This approach converts a prompt into a code sketch by leveraging the inherent linguistic structures within the prompt and applying classic natural language processing techniques. The sketch then serves as an intermediate placeholder that not only previews the intended code structure but also guides the LLM towards the desired code, thereby enhancing human-LLM interaction. We conclude by discussing the approach's applicability and future plans.
The capabilities of Large Language Models (LLMs) in code generation have been extensively studied, particularly for implementing target functionalities from natural-language descriptions. As an alternative to natural language, input-output (I/O) examples provide an accessible, unambiguous, and flexible way to describe functionalities. However, their inherent diversity, opaqueness, and incompleteness impose greater challenges for understanding and implementing the target requirements. Therefore, generating code from I/O examples (i.e., example-based code generation) provides a new perspective, allowing us to additionally evaluate LLMs’ capability to infer target functionalities from limited information and to process new-form requirements. However, related research about LLMs in example-based code generation remains largely unexplored. To fill this gap, this paper presents the first comprehensive study on example-based code generation using LLMs. To address the incorrectness caused by the incompleteness of I/O examples, we adopt an iterative evaluation framework and formalize the objective of example-based code generation as two sequential sub-objectives: generating code conforming to the given examples and generating code that successfully implements the target functionalities from (iteratively) given examples. We assess six state-of-the-art LLMs using a new benchmark of 172 diverse target functionalities (derived from HumanEval and CodeHunt). The results demonstrate that when requirements are described using iterative I/O examples rather than natural language, the LLMs’ score decreases by over 60%, indicating that example-based code generation remains challenging for the evaluated LLMs. Notably, the vast majority (even over 95%) of successfully implemented functionalities are achieved in the first round of the iterations, suggesting that the LLMs struggle to effectively utilize the iteratively supplemented requirements. Furthermore, we find that combining I/O examples with even imprecise and fragmental natural language descriptions greatly improves LLM performance, and the selection of initial I/O examples can also influence the score, suggesting opportunities for prompt optimization. These findings highlight the importance of early prompts during interactions and offer critical insights and implications for enhancing LLM-based code generation.
最终分组全面覆盖了 Vibe Coding 从理论定义、技术实现到社会影响的全生命周期。研究确认了软件开发正从“语法导向”转向“意图导向”的范式转移;技术上,通过 Agent 架构、迭代自调试和精细化提示工程显著提升了生产力;应用上,已深入渗透至硬件设计、高性能计算等垂直领域;同时,研究也警示了安全性漏洞与技术债风险,并提出了 AI 时代编程教育与计算思维培养的新路径。