基于大语言模型的漏洞挖掘和检测
综述、基准测试与实证研究
该组文献对LLM在漏洞检测领域的表现进行了系统性回顾,提出了多个高质量基准数据集(如PyCodeVul、SafeGenBench、SBAN)和评估框架,并深入探讨了数据污染、提示词敏感性及模型在现实场景中的性能边界。
- Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead(Xin Zhou, Sicong Cao, Xiaobing Sun, David Lo, 2024, ACM Transactions on Software Engineering and Methodology)
- Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories(Alperen Yildiz, Sin G. Teo, Yiling Lou, Yebo Feng, Chong Wang, Dinil Mon Divakaran, 2025, ArXiv)
- Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask(Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, Sheng Zhong, 2025, ArXiv)
- SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis(Yansong Li, Paula Branco, Alexander M. Hoole, Manish Marwah, H. M. Koduvely, Guy-Vincent Jourdan, Stephan Jou, 2025, 2025 IEEE Symposium on Security and Privacy (SP))
- SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code(Xinghang Li, Jingzhe Ding, Chao Peng, Bing Zhao, Xiang Gao, Hongwan Gao, Xinchen Gu, 2025, ArXiv)
- Real-VulLLM: An LLM Based Assessment Framework in the Wild(Rijha Safdar, Danyail Mateen, Syed Taha Ali, Wajahat Hussain, 2025, ArXiv)
- CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics(Yikun Li, Ting Zhang, Ratnadira Widyasari, Yan Naing Tun, Huu Hung Nguyen, Tan Bui, I. Irsan, Yiran Cheng, Xiang Lan, Han Wei Ang, F. Liauw, M. Weyssow, Hong Jin Kang, Eng Lieh Ouh, Lwin Khin Shar, David Lo, 2024, ArXiv)
- Do LLMs consider security? an empirical study on responses to programming questions(Amirali Sajadi, B. Le, Anh Nguyen, Kostadin Damevski, Preetha Chatterjee, 2025, Empirical Software Engineering)
- Enhancing Vulnerability Mining with Large Language Model(Jinghua Lian, Zhigang Wang, Yuyu He, Yanli Chen, Youyu Liu, Hui Lu, 2025, 2025 IEEE 10th International Conference on Data Science in Cyberspace (DSC))
- LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights(Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, Jeff Huang, 2025, ACM Computing Surveys)
- Should We Evaluate LLM Based Security Analysis Approaches on Open Source Systems?(Kohei Dozono, Jonas Engesser, B. Hummel, Tobias Roehm, Alexander Pretschner, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
- It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective(Yunqian Wang, Xiaohong Li, Yao Zhang, Yuekang Li, Zhiping Zhou, Ruitao Feng, 2025, 2025 32nd Asia-Pacific Software Engineering Conference (APSEC))
- ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We?(Michael Fu, C. Tantithamthavorn, Van-Anh Nguyen, Trung Le, 2023, 2023 30th Asia-Pacific Software Engineering Conference (APSEC))
- LLM-based Vulnerability Discovery through the Lens of Code Metrics(Felix Weissberg, Lukas Pirch, Erik Imgrund, Jonas Möller, Thorsten Eisenhofer, Konrad Rieck, 2025, ArXiv)
- The Impact of Prompt Language and Representation on LLM Reasoning: A Multilingual Empirical Study(Lina Ji, Linghua Yao, Wei Xu, J. Min, Li Yuan, 2026, IEEE Access)
- An Empirical Evaluation of LLM-Based Approaches for Code Vulnerability Detection: RAG, SFT, and Dual-Agent Systems(Md Hasan Saju, M. Muhtadi, Akramul Azim, 2025, 2025 IEEE International Conference on Collaborative Advances in Software and COmputiNg (CASCON))
- A Comparative Study of Machine Learning and Large Language Models for SQL and NoSQL Injection Vulnerability Detection(Nargiz Maligazhdarova, Avinash B. M., A. Mukasheva, D. Yedilkhan, Aidos Askhatuly, A. Berdyshev, 2025, 2025 IEEE 5th International Conference on Smart Information Systems and Technologies (SIST))
- Software Vulnerability Detection Using LLM: Does Additional Information Help?(Samiha Shimmi, Yash Saini, M. Schaefer, Hamed Okhravi, Mona Rahimi, 2024, 2024 Annual Computer Security Applications Conference Workshops (ACSAC Workshops))
- Transformer-based models application for bug detection in source code(Illia Vokhranov, Bogdan Bulakh, 2024, Technology audit and production reserves)
- Multi-source cross-domain vulnerability detection based on code pre-trained model(Yang Cao, Yunwei Dong, 2025, Inf. Softw. Technol.)
- VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities(Weizhe Wang, Wei Ma, Qiang Hu, Yao Zhang, Jian-gang Sun, Bin Wu, Yang Liu, Guangquan Xu, Lingxiao Jiang, 2025, ArXiv)
- Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection(Yuejun Guo, Constantinos Patsakis, Qiang Hu, Qiang Tang, Fran Casino, 2024, No journal)
- CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection(Richard A. Dubniczky, Krisztofer Zolt'an Horv'at, Tamás Bisztray, M. Ferrag, L. C. Cordeiro, Norbert Tihanyi, 2025, No journal)
- How to Select Pre-Trained Code Models for Reuse? A Learning Perspective(Zhangqian Bi, Yao Wan, Zhaoyang Chu, Yufei Hu, Junyi Zhang, Hongyu Zhang, Guandong Xu, Hai Jin, 2025, 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER))
- When Code Crosses Borders: A Security-Centric Evaluation of LLM-based Code Translation(Hailong Chang, Guozhu Meng, Shuhui Xiao, Kai Chen, Kun Sun, Yilin Li, 2025, ArXiv)
结构感知、多模态表征与模型微调优化
这组文献探讨了如何通过引入代码结构信息(AST、CFG、CPG)、多模态融合(图像、图表征)以及高效微调技术(LoRA、SFT、RLHF)来增强模型对复杂代码逻辑的底层理解能力。
- GraBit: A Sequential Model-Based Framework for Smart Contract Vulnerability Detection(Huijuan Zhu, Kaixuan Yang, Liangmin Wang, Zhi-cheng Xu, Victor S. Sheng, 2023, 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE))
- Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models(Aidan Z. H. Yang, Haoye Tian, He Ye, Ruben Martins, Claire Le Goues, 2024, ArXiv)
- CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection(Ruijun Feng, Hammond Pearce, Pietro Liguori, Yulei Sui, 2025, IEEE Transactions on Software Engineering)
- DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism(Jin Wang, Zishan Huang, Hengli Liu, Nianyi Yang, Yinhao Xiao, 2023, ArXiv)
- StagedVulBERT: Multigranular Vulnerability Detection With a Novel Pretrained Code Model(Yuan Jiang, Yujian Zhang, Xiaohong Su, Christoph Treude, Tiantian Wang, 2024, IEEE Transactions on Software Engineering)
- MACD: Source Code Vulnerability Detection Method Integrating Mamba and Attention(Jingen Li, 2025, 2025 7th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT))
- Abundant Modalities Offer More Nutrients: Multi-Modal-Based Function-Level Vulnerability Detection(Chao Ni, Xin Yin, Xinrui Li, Xiaodan Xu, Zhi Yu, 2025, ACM Transactions on Software Engineering and Methodology)
- Enhancing vulnerability detection by fusing code semantic features with LLM-generated explanations(Zhenzhou Tian, Minghao Li, Jiaze Sun, Yanping Chen, Lingwei Chen, 2025, Inf. Fusion)
- Codesentry: Revolutionizing Real-Time Software Vulnerability Detection With Optimized GPT Framework(Angel Jones, Marwan Omar, 2024, Land Forces Academy Review)
- Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization(Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino, 2025, ArXiv)
- Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation(Weiliang Qi, Jiahao Cao, Darsh Poddar, Sophia Li, Xinda Wang, 2024, ArXiv)
- Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks(Zhongxin Liu, Zhijie Tang, Junwei Zhang, Xin Xia, Xiaohu Yang, 2024, 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE))
- VuL-MCBERT: A Vulnerability Detection Method Based on Self-Supervised Contrastive Learning(Yifan Wang, Penghao Liu, Yang Zhang, 2025, 2025 5th International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA))
- LPASS: Linear Probes as Stepping Stones for vulnerability detection using compressed LLMs(Luis Ibañez-Lissen, L. González-Manzano, José María de Fuentes, Nicolas Anciaux, 2025, J. Inf. Secur. Appl.)
- DB-CBIL: A DistilBert-Based Transformer Hybrid Model Using CNN and BiLSTM for Software Vulnerability Detection(Ahmed Bahaa, Aya El-Rahman Kamal, Hanan Fahmy, Amr S. Ghoneim, 2024, IEEE Access)
- AI-Powered Vulnerability Detection in Code Using BERT-Based LLM with Transparency Measures(Ali E. Takieldeen, Merna Gamal, Abdelrahman Salah, M. Walid, A. M. Mahmoud, K. Ahmed, Saleh Tamer Abdallah, Nada Samir Elsayed, Aida A. Nasr, 2025, 2025 International Telecommunications Conference (ITC-Egypt))
- TOOLS FOR IDENTIFYING INFORMATION SECURITY VULNERABILITIES BASED ON DATA FROM INTERNET RESOURCES(A. Samuilova, 2025, Herald of the Kazakh-British Technical University)
- HeVulD: A Static Vulnerability Detection Method Using Heterogeneous Graph Code Representation(Yuanming Huang, Mingshu He, Xiaojuan Wang, Jie Zhang, 2024, IEEE Transactions on Information Forensics and Security)
- Detecting Source Code Vulnerabilities Using Fine-Tuned Pre-Trained LLMs(Jin Zhu, Hui Ge, Yun Zhou, Xiao Jin, Rui Luo, Yanchen Sun, 2024, 2024 IEEE 17th International Conference on Signal Processing (ICSP))
- Fine-Tuning Pre-trained Model with Optimizable Prompt Learning for Code Vulnerability Detection(Wei Chang, Chunyang Ye, Hui Zhou, 2024, 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE))
- VLD-LP: Vulnerability Detection and Root Cause Localization with Large Language Model and Parameter-efficient Language Model Tuning(Huanyu Wu, Yunlu Tu, Fan Huang, Dongrui Wu, 2025, 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC))
- Enhanced Vulnerability Localization: Harmonizing Task-Specific Tuning and General LLM Prompting(Wentong Tian, Yuanzhang Lin, Xiang Gao, Hailong Sun, 2025, 2025 IEEE International Conference on Software Maintenance and Evolution (ICSME))
- Optimizing software vulnerability detection using RoBERTa and machine learning(Cho Xuan Do, N. Luu, Phuong Thi Lan Nguyen, 2024, Automated Software Engineering)
- Detecting Rust Code Vulnerabilities Through Transfer Learning(I. A. Khan, Yuxuan Luo, Weifeng Xu, Dianxiang Xu, 2026, International Journal of Software Engineering and Knowledge Engineering)
- Large language model based hybrid framework for automatic vulnerability detection with explainable AI for cybersecurity enhancement(Nihala Basheer, Shareeful Islam, Mohammed K. S. Alwaheidi, H. Mouratidis, Spyridon Papastergiou, 2025, Integrated Computer-Aided Engineering)
- XGV-BERT: Leveraging contextualized language model and graph neural network for efficient software vulnerability detection(Vu Le Anh Quan, C. Phat, Kiet Van Nguyen, Phan The Duy, V. Pham, 2023, The Journal of Supercomputing)
- CTVD: Collaborative Training of Deep Learning and Large Model for C/C++ Source Code Vulnerability Detection(Yaning Zheng, Dongxia Wang, Huayang Cao, Cheng Qian, Honglin Zhuang, 2025, 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC))
- MANDO-LLM: Heterogeneous Graph Transformers with Large Language Models for Smart Contract Vulnerability Detection(Nhat-Minh Nguyen, Hoang H. Nguyen, Long Le Thanh, Zahra Ahmadi, Thanh-Nam Doan, Daoyuan Wu, Lingxiao Jiang, 2025, ACM Transactions on Software Engineering and Methodology)
- Multimodal Fusion for Vulnerability Detection: Integrating Sequence and Graph-Based Analysis with LLM Augmentation(N. Ngan, Nghi Hoang Khoa, Van-Hau Pham, Phan The Duy, 2025, 2025 International Conference on Multimedia Analysis and Pattern Recognition (MAPR))
- Harnessing the Power of LLMs in Source Code Vulnerability Detection(A. A. Mahyari, 2024, MILCOM 2024 - 2024 IEEE Military Communications Conference (MILCOM))
- Automated Software Vulnerability Detection Using CodeBERT and Convolutional Neural Network(Rabaya Sultana Mim, Abdus Satter, Toukir Ahammed, Kazi Sakib, 2024, No journal)
- TransBug: Transformer-Assisted Bug Detection and Diagnosis in Deep Neural Networks(Abdul Haq Ayantayo, Johnson Chen, Muhammad Anas Raza, Mohammad Wardat, 2024, 2024 IEEE International Conference on Big Data (BigData))
- JSVulExplorer: a JavaScript vulnerability detection model based on transfer learning(S. Chen, Nan Jiang, Zheng Wu, Zichen Wang, 2023, No journal)
- On the Effect of Token Merging on Pre-trained Models for Code(M. Saad, Hao Li, Tushar Sharma, Ahmed E. Hassan, 2025, ArXiv)
- GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning(Guilong Lu, Xiaolin Ju, Xiang Chen, Wenlong Pei, Zhilong Cai, 2024, J. Syst. Softw.)
- Vulnerability Detection via Multiple-Graph-Based Code Representation(Fangcheng Qiu, Zhongxin Liu, Xing Hu, Xin Xia, Gang Chen, Xinyu Wang, 2024, IEEE Transactions on Software Engineering)
- FuncVul: An Effective Function Level Vulnerability Detection Model using LLM and Code Chunk(Sajal Halder, Muhammad Ejaz Ahmed, Seyit Ahmet Camtepe, 2025, No journal)
- AIDetectVul: Software Vulnerability Detection Method Based on Feature Fusion of Pre-trained Models(Shiying Xue, Lin Li, Tao Li, Haodong Chen, Jiapan Li, Yangqing Qin, 2025, 2025 5th International Conference on Consumer Electronics and Computer Engineering (ICCECE))
- VulTrLM: LLM-assisted vulnerability detection via AST decomposition and comment enhancement(Shaobo Zhang, Qianzhi Wang, Qin Liu, Entao Luo, Tao Peng, 2025, Empirical Software Engineering)
- The Richer Representation Fallacy: Are We Just Adding Noise to LLM-based Software Vulnerability Detectors?(Hazim Hanif, S. Maffeis, Nor Badrul Anuar, 2025, 2025 IEEE International Conference on Computing (ICOCO))
- DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection(Zhonghao Jiang, Weifeng Sun, Xiaoyan Gu, Jiaxi Wu, Tao Wen, Haibo Hu, Meng Yan, 2024, Proceedings of the 15th Asia-Pacific Symposium on Internetware)
- EFVD: A Framework of Source Code Vulnerability Detection via Fusion of Enhanced Graph Representation Learning and Pre-trained Transformer-Based Model(Lei Tian, Cheng Zhang, 2025, Proceedings of the 2025 5th International Conference on Computer Network Security and Software Engineering)
- LOSVER: Line-Level Modifiability Signal-Guided Vulnerability Detection and Classification(Doha Nam, Jong-Chan Baik, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
- RXF-CBC: a dual-transformer and ensemble-based framework for robust bug categorization and prioritization in imbalanced software repositories(Deepshikha Chhabra, Raman Chadha, 2025, International Journal of Information Technology)
- Vulnerability Detection Based on Pre-trained Code Language Model and Convolutional Neural Network(Tingfeng Liao, Lu Lu, Zhihong Liang, Siliang Suo, 2024, No journal)
提示工程、RAG 与多智能体协作推理
该组研究关注如何通过优化交互策略提升LLM的逻辑推理。包括设计代码特定的提示词、引入检索增强生成(RAG)以减少幻觉,以及构建多智能体(Multi-agent)协作流来模拟专家审计过程。
- Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents(Ratnadira Widyasari, M. Weyssow, I. Irsan, Han Wei Ang, F. Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, David Lo, 2025, ArXiv)
- VulnTeam: A Team Collaboration Framework for LLM-based Vulnerability Detection(Jiayuan Li, Lei Cui, Wenyan Yu, Haiqiang Fei, Feng Cheng, Hongsong Zhu, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- LLM Agentic Workflow for Automated Vulnerability Detection and Remediation in Infrastructure-as-Code(Dheer Toprani, Vijay K. Madisetti, 2025, IEEE Access)
- From LLMs to Agents: A Comparative Evaluation of LLMs and LLM-based Agents in Security Patch Detection(Junxiao Han, Zheng Yu, Lingfeng Bao, Jiakun Liu, Yao Wan, Jianwei Yin, Shuiguang Deng, Song Han, 2025, ArXiv)
- Exploration On Prompting LLM With Code-Specific Information For Vulnerability Detection(Zhihong Liu, Zezhou Yang, Qing Liao, 2024, 2024 IEEE International Conference on Software Services Engineering (SSE))
- Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG(Xueying Du, G. Zheng, Kaixin Wang, Yi Zou, Yujia Wang, Wentai Deng, Jiayi Feng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, Yiling Lou, 2024, ACM Transactions on Software Engineering and Methodology)
- MulVul: Retrieval-augmented Multi-Agent Code Vulnerability Detection via Cross-Model Prompt Evolution(Zihan Wu, Jie Xu, Yun Peng, Chun Yong Chong, Xiaohua Jia, 2026, ArXiv)
- LegacyGuard: Hybrid LLM, RAG, and Static Analysis for Multi-Lingual Vulnerability Detection in Legacy Codebases(Y. Potdar, 2025, International Journal for Research in Applied Science and Engineering Technology)
- Steering Large Language Models for Vulnerability Detection(Jiayuan Li, Lei Cui, Jie Zhang, Haiqiang Fei, Yu Chen, Hongsong Zhu, 2025, ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- VulSolver: Vulnerability Detection via LLM-Driven Constraint Solving(Xiang Li, Yue Su, Jiahao Liu, Zhiwei Lin, Yunqing Hou, Peiming Gao, Yuanchao Zhang, 2025, ArXiv)
- VulnLLM-R: Specialized Reasoning LLM with Agent Scaffold for Vulnerability Detection(Yuzhou Nie, Hongwei Li, Chengquan Guo, R. Jiang, Zhun Wang, Bo Li, D. Song, Wenbo Guo, 2025, ArXiv)
- Large Language Model for Vulnerability Detection: Emerging Results and Future Directions(Xin Zhou, Ting Zhang, David Lo, 2024, 2024 IEEE/ACM 46th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER))
- Detecting Code Vulnerabilities using LLMs(Larry Huynh, Yinghao Zhang, Djimon Jayasundera, Woojin Jeon, Hyoungshick Kim, Tingting Bi, Jin B. Hong, 2025, 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN))
- VulScribeR: Exploring RAG-based Vulnerability Augmentation with LLMs(Seyed Shayan Daneshvar, Yu Nong, Xu Yang, Shaowei Wang, Haipeng Cai, 2024, ACM Transactions on Software Engineering and Methodology)
LLM 与传统程序分析及模糊测试的混合集成
这组文献关注将LLM的语义推理能力与静态分析(Semgrep、LLVM IR)、动态分析(Fuzzing)、符号执行等传统工具相结合,以降低误报率并提升漏洞挖掘的自动化程度。
- DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection(Yanjing Yang, Xin Zhou, Runfeng Mao, Jinwei Xu, Lanxin Yang, Yu Zhangm, Haifeng Shen, He Zhang, 2024, J. Syst. Softw.)
- ResVul-LLM: A Neurosymbolic Framework Combining Large Language Models and Symbolic Reasoning for C/C++ Vulnerability Analysis(Md. Shazzad Hossain Shaon, Shapna Akter, Alfredo Cuzzocrea, 2025, 2025 IEEE International Conference on Big Data (BigData))
- Software Vulnerability Detection Based on LLM(Chengcheng Li, Bo Guan, Qilong Zheng, Binbin Liu, Ze Zhang, 2025, 2025 IEEE 7th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC))
- ♪ With a Little Help from My (LLM) Friends: Enhancing Static Analysis with LLMs to Detect Software Vulnerabilities(Amy Munson, Juanita Gomez, Alvaro A. Cárdenas, 2025, 2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code))
- LLM-Based Unknown Function Automated Modeling in Sensor-Driven Systems for Multi-Language Software Security Verification(Liangjun Deng, Qi Zhong, Jingcheng Song, Hang Lei, Wenjuan Li, 2025, Sensors (Basel, Switzerland))
- All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching(Ze Sheng, Qingxiao Xu, Jianwei Huang, M. Woodcock, Heqing Huang, A. Donaldson, Guofei Gu, Jeff Huang, 2025, ArXiv)
- One Bug, Hundreds Behind: LLMs for Large-Scale Bug Discovery(Qiushi Wu, Yue Xiao, Dhilung Kirat, Kevin Eykholt, Jiyong Jang, D. Schales, 2025, ArXiv)
- Research on the Vulnerability Identification Efficiency of Enhanced Reverse-Analyzed LLM Model in Binary Program Fuzzy Testing(Shiyin Lin, 2025, 2025 IEEE 5th International Conference on Data Science and Computer Application (ICDSCA))
- Large Language Models Based JSON Parser Fuzzing for Bug Discovery and Behavioral Analysis(Zhiyuan Zhong, Zhezhen Cao, Zhanwei Zhang, 2024, ArXiv)
- LProtector: An LLM-driven Vulnerability Detection System(Ze Sheng, Fenghua Wu, Xiangwu Zuo, Chao Li, Yuxin Qiao, Lei Hang, 2024, ArXiv)
- Enhancing Large Language Models with Faster Code Preprocessing for Vulnerability Detection(Jos'e Gonccalves, Miguel Silva, Eva Maia, Isabel Praça, 2025, ArXiv)
- Towards Generalizable Instruction Vulnerability Prediction via LLM-Enhanced Code Representation(Bao Wen, Jingjing Gu, Jingxuan Zhang, Yang Liu, Pengfei Yu, Yanchao Zhao, 2025, 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE))
- QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration(Junze Hu, Xiangyu Jin, Yi Zeng, Yuling Liu, Yunpeng Li, Dan Du, Kaiyu Xie, Hongsong Zhu, 2025, ArXiv)
- SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection(Xinjie Wen, Cuiyun Gao, Shuzheng Gao, Yang Xiao, Michael R. Lyu, 2024, Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis)
- Enhanced LLM-Based Framework for Predicting Null Pointer Dereference in Source Code(Md. Fahim Sultan, Tasmin Karim, Md. Shazzad Hossain Shaon, Mohammad Wardat, Mst. Shapna Akter, 2024, ArXiv)
面向智能合约与 Web3 安全的专项检测
这组论文专门针对区块链智能合约的安全性,利用LLM检测重入攻击、逻辑漏洞等,并结合事件驱动分析、双视图感知及强化学习(DPO)对Web3领域语义进行深度适配。
- Robust Vulnerability Detection in Solidity-Based Ethereum Smart Contracts Using Fine-Tuned Transformer Encoder Models(Thi-Thu-Huong Le, Jaehyun Kim, Sangmyeong Lee, Howon Kim, 2024, IEEE Access)
- LLM-SmartAudit: Advanced Smart Contract Vulnerability Detection(Zhiyuan Wei, Jing Sun, Zijian Zhang, Xianhao Zhang, 2024, ArXiv)
- HMF: Enhancing reentrancy vulnerability detection and repair with a hybrid model framework(Mengliang Li, Q. Shen, Xiaoxue Ren, Han Fu, Zhuo Li, Jianling Sun, 2025, Automated Software Engineering)
- SCoVerLLM: Smart Contract Vulnerability Detection via LLM-Based In-Context and Chain-of-Thought Prompts(Kaiqi Yang, Xiguo Gu, Weili Xu, Zhanqi Cui, Liwei Zheng, 2025, 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC))
- CKG-LLM: LLM-Assisted Detection of Smart Contract Access Control Vulnerabilities Based on Knowledge Graphs(Xiaoqi Li, Hailu Kuang, Wenkai Li, Zongwei Li, Shipeng Ye, 2025, ArXiv)
- Collaborative LLM Reasoning for Vulnerability Detection in Smart Contracts(Amirreza Samari, Parsa Hedayatnia, Seyyed Javad Bozorg Zadeh Razavi, Mohammad Allahbakhsh, Haleh Amintoosi, 2025, 2025 15th International Conference on Computer and Knowledge Engineering (ICCKE))
- Agent4Vul: multimodal LLM agents for smart contract vulnerability detection(Wanqing Jie, Wangjie Qiu, Haofu Yang, Muyuan Guo, Xinpeng Huang, Tianyu Lei, Qinnan Zhang, Hongwei Zheng, Zhiming Zheng, 2025, Science China Information Sciences)
- Smart Contract Vulnerability Detection: The Role of Large Language Model (LLM)(Biagio Boi, Christian Esposito, Sokjoon Lee, 2024, ACM SIGAPP Applied Computing Review)
- ETrace : Event-Driven Vulnerability Detection in Smart Contracts via LLM-Based Trace Analysis(Chenyang Peng, Haijun Wang, Yin Wu, Hao Wu, Ming Fan, Yitao Zhao, Ting Liu, 2025, Proceedings of the 16th International Conference on Internetware)
- Enhancing Smart Contract Vulnerability Detection in DApps Leveraging Fine-Tuned LLM(Jiuyang Bu, Wenkai Li, Zongwei Li, Zeng Zhang, Xiaoqi Li, 2025, ArXiv)
- LLM Assisted Dual-View Awareness Framework for Smart Contract Vulnerability Detection(Jianrong Wang, Yuru Yue, Dengcheng Hu, Qi Li, Jinghui Li, Wenyu Zhu, 2025, 2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE))
- Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection(Lei Yu, Zhirong Huang, Hang Yuan, Shiqi Cheng, Li Yang, Fengjun Zhang, Chenjie Shen, Jiajia Ma, Jingyuan Zhang, Junyi Lu, Chun Zuo, 2025, Proceedings of the ACM on Software Engineering)
- Prompt Engineering vs. Fine-Tuning for LLM-Based Vulnerability Detection in Solana and Algorand Smart Contracts(Biagio Boi, Christian Esposito, 2025, 2025 7th International Conference on Blockchain Computing and Applications (BCCA))
- Advanced Smart Contract Vulnerability Detection via LLM-Powered Multi-Agent Systems(Zhiyuan Wei, Jing Sun, Yuqiang Sun, Ye Liu, Daoyuan Wu, Zijian Zhang, Xianhao Zhang, Meng Li, Yang Liu, Chunmiao Li, Mingchao Wan, Jin Dong, Liehuang Zhu, 2025, IEEE Transactions on Software Engineering)
- Large Language Model-Powered Smart Contract Vulnerability Detection: New Perspectives(Sihao Hu, Tiansheng Huang, Fatih Ilhan, S. Tekin, Ling Liu, 2023, 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA))
特定工业场景与底层硬件漏洞挖掘
该组研究将LLM应用于特定工业或硬件环境,包括云原生IaC配置、工控协议(ICS)、反编译二进制代码以及SoC硬件描述语言(Verilog/RTL)的漏洞发现。
- VerilogLAVD: LLM-Aided Rule Generation for Vulnerability Detection in Verilog(X. Long, Yingjie Xia, Xiyuan Chen, Li Kuang, 2025, ArXiv)
- LLM-CloudSec: Large Language Model Empowered Automatic and Deep Vulnerability Analysis for Intelligent Clouds(Daipeng Cao, W. Jun, 2024, IEEE INFOCOM 2024 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS))
- LLMIF: Augmented Large Language Model for Fuzzing IoT Devices(Jincheng Wang, Le Yu, Xiapu Luo, 2024, 2024 IEEE Symposium on Security and Privacy (SP))
- Automated Bug Discovery in Cloud Infrastructure-as-Code Updates with LLM Agents(Yiming Xiang, Zhenning Yang, Jingjia Peng, H. Bauer, Patrick Tser Jern Kon, Yiming Qiu, Ang Chen, 2025, 2025 IEEE/ACM International Workshop on Cloud Intelligence & AIOps (AIOps))
- DecLLM: LLM-Augmented Recompilable Decompilation for Enabling Programmatic Use of Decompiled Code(Wai Kin Wong, Daoyuan Wu, Huaijin Wang, Zongjie Li, Zhibo Liu, Shuai Wang, Qiyi Tang, Sen Nie, Shi Wu, 2025, Proceedings of the ACM on Software Engineering)
- MALF: A Multi-Agent LLM Framework for Intelligent Fuzzing of Industrial Control Protocols(Bowei Ning, Xuejun Zong, Kan He, 2025, ArXiv)
- VulBinLLM: LLM-powered Vulnerability Detection for Stripped Binaries(Nasir Hussain, Hao Chen, Chanh Tran, Philip Huang, Zhuohao Li, Pravir Chugh, William Chen, Ashish Kundu, Yuan Tian, 2025, ArXiv)
- LLM-Guided Security Claim Generation for Autonomous Vehicle in Smart Urban Systems(Ali Louati, Hassen Louati, Elham Kariri, 2025, IEEE Access)
- BugWhisperer: Fine-Tuning LLMs for SoC Hardware Vulnerability Detection(Shams Tarek, Dipayan Saha, Sujan Kumar Saha, Farimah Farahmandi, 2025, 2025 IEEE 43rd VLSI Test Symposium (VTS))
- LLM-Based Approach for Buffer Overflow Detection in Source Code(Emran Kaanan, Tasmin Karim, Md. Shazzad Hossain Shaon, Md. Fahim Sultan, Alfredo Cuzzocrea, Mst. Shapna Akter, 2024, 2024 27th International Conference on Computer and Information Technology (ICCIT))
漏洞自动修复、分类与生命周期管理
这组文献扩展了检测任务,涵盖了漏洞发现后的自动补丁生成(AVR)、CWE/CVE分类、CVSS评分预测、漏洞传播影响分析以及缺陷报告的自动分诊。
- Pre-Trained Model-Based Automated Software Vulnerability Repair: How Far are We?(Quanjun Zhang, Chunrong Fang, Bo-Chen Yu, Weisong Sun, Tongke Zhang, Zhenyu Chen, 2023, IEEE Transactions on Dependable and Secure Computing)
- Automated Bug Report Classification Using BERT: A Transformer-Based Approach for Efficient Bug Triage in Large-Scale Software Projects(B. Caldeira, Nuno Pombo, 2025, 2025 25th International Conference on Software Quality, Reliability and Security (QRS))
- Can LLMs Classify CVEs? Investigating LLMs Capabilities in Computing CVSS Vectors(Francesco Marchiori, Denis Donadel, Mauro Conti, 2025, 2025 IEEE Symposium on Computers and Communications (ISCC))
- On the Effectiveness of Instruction-Tuning Local LLMs for Identifying Software Vulnerabilities(Sangryu Park, G. Ko, Homook Cho, 2025, ArXiv)
- Detecting Code Comment Inconsistencies using LLM and Program Analysis(Yichi Zhang, 2024, Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering)
- Revisiting Vulnerability Patch Localization: An Empirical Study and LLM-Based Solution(Haoran Xu, Chen Zhi, Junxiao Han, Xinkui Zhao, Jianwei Yin, Shuiguang Deng, 2025, ArXiv)
- Large Language Models-Driven Bug Localization Framework for Automated Debugging(J. Dhanalakshmi, Darshana A. Naik, C. Evangeline, A. Shanthi, Maya Eapen, JoshuvaArockia Dhanraj, 2025, 2025 2nd International Conference on Artificial Intelligence and Knowledge Discovery in Concurrent Engineering (ICECONF))
- VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model(Tianyu Chen, Lin Li, ZhuLiuchuan ZhuLiuchuan, Zongyang Li, Xueqing Liu, Guangtai Liang, Qianxiang Wang, Tao Xie, 2023, No journal)
- A Vulnerability Propagation Impact Analysis Approach Based on Code Semantics with LLM(Xun Long, Jun Ai, Jieyu Zhao, Yingxiang Huang, 2024, 2024 11th International Conference on Dependable Systems and Their Applications (DSA))
- VTT-LLM: Advancing Vulnerability-to-Tactic-and-Technique Mapping through Fine-Tuning of Large Language Model(Chenhui Zhang, Le Wang, Dunqiu Fan, Junyi Zhu, Tang Zhou, Liyi Zeng, Zhaohua Li, 2024, Mathematics)
- Improving Software Reliability Through Bug Detection and Automated Error Repair Using Transformer-Based Models(Nedal Nwasra, Jamal Zaraqu, Zahid Hussain Qaisar, 2025, 2025 1st International Conference on Computational Intelligence Approaches and Applications (ICCIAA))
- APPATCH: Automated Adaptive Prompting Large Language Models for Real-World Software Vulnerability Patching(Yu Nong, Haoran Yang, Long Cheng, Hongxin Hu, Haipeng Cai, 2024, No journal)
- Enhancing Automated Vulnerability Repair Through Dependency Embedding and Pattern Store(Qingao Dong, Yuanzhang Lin, Hailong Sun, Xiang Gao, 2025, 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER))
- Poster: A Case Study on Automated Vulnerability Repair Using Pre-Trained Language Models(Woorim Han, Miseon Yu, Younghan Lee, Hyungon Moon, Y. Paek, 2025, 2025 Silicon Valley Cybersecurity Conference (SVCC))
- Code Vulnerability Repair with Large Language Model Using Context-Aware Prompt Tuning(Arshiya Khan, Guannan Liu, Xing Gao, 2024, 2025 IEEE Security and Privacy Workshops (SPW))
- Fine Tuning Large Language Model for Secure Code Generation(Junjie Li, Aseem Sangalay, Cheng Cheng, Yuan Tian, Jinqiu Yang, 2024, 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge) Conference Acronym:)
LLM 自身的安全性、鲁棒性与对抗攻防
该组研究关注LLM作为安全工具时的脆弱性,包括对抗性攻击(代码混淆、提示注入)、数据投毒、GPU软错误可靠性,以及如何通过对抗训练提升模型自身的防御能力。
- Improving Large Language Model Safety with Contrastive Representation Learning(Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin, 2025, No journal)
- Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model(Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, Renjing Xu, 2024, ArXiv)
- Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study(Duo Chai, Zizhen Liu, Shuhuai Wang, Songwei Pei, Cheng Liu, Huawei Li, Shangguang Wang, 2025, ArXiv)
- Security Charter Effectiveness in Large Language Model Code Generation: A Multi-Phase Experimental Analysis Revealing Task-Dependent Responsiveness and Architectural Differences(Shivani Shukla, Himanshu Joshi, 2025, 2025 IEEE International Conference on Data Mining Workshops (ICDMW))
- CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent(Liang-bo Ning, Shijie Wang, Wenqi Fan, Qing Li, Xin Xu, Hao Chen, Feiran Huang, 2024, Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
- Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization(Zihui Wu, Haichang Gao, Ping Wang, Shudong Zhang, Zhaoxiang Liu, Shiguo Lian, 2024, ArXiv)
- Statement-Level Adversarial Attack on Vulnerability Detection Models via Out-of-Distribution Features(Xiaohu Du, Ming Wen, Haoyu Wang, Zichao Wei, Hai Jin, 2025, Proceedings of the ACM on Software Engineering)
- Adversarial Training for Robustness Enhancement in LLM-Based Code Vulnerability Detection(Ying Zhao, Xin Guan, 2025, 2025 IEEE 7th International Conference on Communications, Information System and Computer Engineering (CISCE))
- A Systematic Study of Code Obfuscation Against LLM-based Vulnerability Detection(Xiao Li, Yue Li, Hao Wu, Yue Zhang, Yechao Zhang, Fengyuan Xu, Sheng Zhong, 2025, ArXiv)
- TrustGLM: Evaluating the Robustness of GraphLLMs Against Prompt, Text, and Structure Attacks(Qihai Zhang, Xin Sheng, Yuanfu Sun, Qiaoyu Tan, 2025, Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2)
- Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering(Hwan Chang, Yumin Kim, Yonghyun Jun, Hwanhee Lee, 2025, No journal)
- SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model(Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang, 2025, No journal)
- An Engorgio Prompt Makes Large Language Model Babble on(Jianshuo Dong, Ziyuan Zhang, Qingjie Zhang, Han Qiu, Tianwei Zhang, Hao Wang, Hewu Li, Qi Li, Chao Zhang, Ke Xu, 2024, ArXiv)
- PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning(Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez, 2024, ArXiv)
- InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents(Qiusi Zhan, Zhixiang Liang, Zifan Ying, Daniel Kang, 2024, ArXiv)
- Milo: Attacking Deep Pre-trained Model for Programming Languages Tasks with Anti-analysis Code Obfuscation(Leo Song, Steven H. H. Ding, 2023, 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC))
- CodeBERT‐Attack: Adversarial attack against source code deep learning models via pre‐trained model(Huangzhao Zhang, Shuai Lu, Zhuo Li, Zhi Jin, Lei Ma, Yang Liu, Ge Li, 2023, Journal of Software: Evolution and Process)
知识增强、持续学习与跨领域基础研究
这组文献探讨了如何通过知识图谱、持续学习、知识蒸馏等手段优化LLM的长期演化能力,并包含了一些探讨Transformer基础架构及跨领域逻辑处理的通用性研究。
- Enhancing Continual Learning for Software Vulnerability Prediction: Addressing Catastrophic Forgetting via Hybrid‑Confidence‑Aware Selective Replay for Temporal LLM Fine-Tuning(Xuhui Dou, Hayretdin Bahşi, Alejandro Guerra-Manzanares, 2026, Proceedings of the 12th International Conference on Information Systems Security and Privacy)
- Resource-efficient automatic software vulnerability assessment via knowledge distillation and particle swarm optimization(Chaoyang Gao, Xiang Chen, Jiyu Wang, Jibin Wang, Guang Yang, 2025, ArXiv)
- Vulnerability to Stability: Scalable Large Language Model in Queue-Based Web Service(Md Abdul Barek, Md Bajlur Rashid, Md. Mostafizur Rahman, A.B.M Kamrul Islamc Riad, Guillermo A. Francia, Hossain Shahriar, S. Ahamed, 2025, 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC))
- Dynamic Vulnerability Knowledge Graph Construction via Multi-Source Data Fusion and Large Language Model Reasoning(Ruitong Liu, Yaxuan Xie, Zexu Dang, Jinyi Hao, Xiaowen Quan, Yongcai Xiao, Chunlei Peng, 2025, Electronics)
- Can An Old Fashioned Feature Extraction and A Light-weight Model Improve Vulnerability Type Identification Performance?(H. Vo, Son Nguyen, 2023, Inf. Softw. Technol.)
- Deep Semantic Modeling of Cyber Vulnerabilities via Ensemble Learning and Large-Scale Embeddings(Yiming Yu, 2025, 2025 International Conference on Artificial Intelligence, Human-Computer Interaction and Natural Language Processing (ICAHN))
- GPTVD: vulnerability detection and analysis method based on LLM’s chain of thoughts(Yinan Chen, Yuan Huang, Xiangping Chen, Pengfei Shen, Lei Yun, 2025, Automated Software Engineering)
- Attention Is All You Need for LLM-based Code Vulnerability Localization(Yue Li, Xiao Li, Hao Wu, Yue Zhang, Xiuzhen Cheng, Sheng Zhong, Fengyuan Xu, 2024, ArXiv)
- Large Language Model driven Policy Exploration for Recommender Systems(Jie Wang, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose, 2025, Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining)
- A fine-tuned large language model based molecular dynamics agent for code generation to obtain material thermodynamic parameters(Zhuo-Fan Shi, Chunxiao Xin, Tong Huo, Yun-Tao Jiang, Bowen Wu, Xing Chen, Wei Qin, Xinjian Ma, Gang Huang, Zhenyu Wang, Xiang Jing, 2025, Scientific Reports)
- Medical malpractice liability in large language model artificial intelligence: legal review and policy recommendations(David O. Shumway, Hayes J Hartman, 2024, Journal of Osteopathic Medicine)
- Advanced Deep Learning Models for Cloud-Based Bug Tracking and Software Defect Prediction: Integrating Transformer(Sathiyendran Ganesan, Venkata Sivakumar Musam, Nagendra Kumar Musham, 2025, International Journal of Multidisciplinary and Current Research)
- SBAN: A Framework & Multi-Dimensional Dataset for Large Language Model Pre-Training and Software Code Mining(Hamed Jelodar, Mohammad Meymani, Samita Bai, R. Razavi-Far, Ali A. Ghorbani, 2025, 2025 IEEE International Conference on Data Mining Workshops (ICDMW))
- Is a Large Language Model a Good Annotator for Event Extraction?(Ruirui Chen, Chengwei Qin, Weifeng Jiang, Dongkyu Choi, 2024, No journal)
- VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization(Youpeng Li, Fuxun Yu, Xinda Wang, 2025, ArXiv)
本报告系统性地整合了基于大语言模型(LLM)的漏洞挖掘与检测研究。当前研究已从早期的简单文本分类演进为深度语义与结构化逻辑理解的综合体系。核心趋势包括:1) 融合程序分析与图表征的混合架构成为主流,以弥补LLM在复杂逻辑上的短板;2) 提示工程与多智能体协作显著提升了推理的深度与准确性;3) 应用场景从通用软件扩展至智能合约、工控及硬件等垂直领域;4) 自动化闭环(从检测到修复)与模型自身的对抗鲁棒性成为新的研究热点。总体而言,LLM正引领漏洞分析从规则驱动向智能语义驱动的范式转型。
总计180篇相关文献
Decentralized applications (DApps) face significant security risks due to vulnerabilities in smart contracts, with traditional detection methods struggling to address emerging and machine-unauditable flaws. This paper proposes a novel approach leveraging fine-tuned Large Language Models (LLMs) to enhance smart contract vulnerability detection. We introduce a comprehensive dataset of 215 real-world DApp projects (4,998 contracts), including hard-to-detect logical errors like token price manipulation, addressing the limitations of existing simplified benchmarks. By fine-tuning LLMs (Llama3-8B and Qwen2-7B) with Full-Parameter Fine-Tuning (FFT) and Low-Rank Adaptation (LoRA), our method achieves superior performance, attaining an F1-score of 0.83 with FFT and data augmentation via Random Over Sampling (ROS). Comparative experiments demonstrate significant improvements over prompt-based LLMs and state-of-the-art tools. Notably, the approach excels in detecting non-machine-auditable vulnerabilities, achieving 0.97 precision and 0.68 recall for price manipulation flaws. The results underscore the effectiveness of domain-specific LLM fine-tuning and data augmentation in addressing real-world DApp security challenges, offering a robust solution for blockchain ecosystem protection.
No abstract available
Large Language Models (LLMs) have shown promise in software vulnerability detection, particularly on function-level benchmarks like Devign and BigVul. However, real-world detection requires interprocedural analysis, as vulnerabilities often emerge through multi-hop function calls rather than isolated functions. While repository-level benchmarks like ReposVul and VulEval introduce interprocedural context, they remain computationally expensive, lack pairwise evaluation of vulnerability fixes, and explore limited context retrieval, limiting their practicality. We introduce JitVul, a JIT vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits. Built from 879 CVEs spanning 91 vulnerability types, JitVul enables comprehensive evaluation of detection capabilities. Our results show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code. While prompting strategies like Chain-of-Thought help LLMs, ReAct Agents require further refinement. Both methods show inconsistencies, either misidentifying vulnerabilities or over-analyzing security guards, indicating significant room for improvement.
Large Language Models are a promising tool for automated vulnerability detection, thanks to their success in code generation and repair. However, despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities? Current evaluations, which often assess models on isolated functions or files, ignore the broader execution and data-flow context essential for understanding vulnerabilities. This oversight leads to two types of misleading outcomes: incorrect conclusions and flawed rationales, collectively undermining the reliability of prior assessments. Therefore, in this paper, we challenge three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales. We argue that these beliefs are artifacts of context-deprived evaluations. To address this, we propose CORRECT (Context-Rich Reasoning Evaluation of Code with Trust), a new evaluation framework that systematically incorporates contextual information into LLM-based vulnerability detection. We construct a context-rich dataset of 2,000 vulnerable-patched program pairs spanning 99 CWEs and evaluate 13 LLMs across four model families. Our framework elicits both binary predictions and natural-language rationales, which are further validated using LLM-as-a-judge techniques. Our findings overturn existing misconceptions. When provided with sufficient context, SOTA LLMs achieve significantly improved performance (e.g., 0.7 F1-score on key CWEs), with 0.8 precision. We show that most false positives stem from reasoning errors rather than misclassification, and that while model and test-time scaling improve performance, they introduce diminishing returns and trade-offs in recall. Finally, we uncover new flaws in current LLM-based detection systems, such as limited generalization and overthinking biases.
No abstract available
Detecting vulnerabilities in source code remains a critical yet challenging task, especially when benign and vulnerable functions share significant similarities. In this work, we introduce VulTrial, a courtroom-inspired multi-agent framework designed to identify vulnerable code and to provide explanations. It employs four role-specific agents, which are security researcher, code author, moderator, and review board. Using GPT-4o as the base LLM, VulTrial almost doubles the efficacy of prior best-performing baselines. Additionally, we show that role-specific instruction tuning with small quantities of data significantly further boosts VulTrial's efficacy. Our extensive experiments demonstrate the efficacy of VulTrial across different LLMs, including an open-source, in-house-deployable model (LLaMA-3.1-8B), as well as the high quality of its generated explanations and its ability to uncover multiple confirmed zero-day vulnerabilities in the wild.
Although LLMs have shown promising potential in vulnerability detection, this study reveals their limitations in distinguishing between vulnerable and similar-but-benign patched code (only 0.06 - 0.14 accuracy). It shows that LLMs struggle to capture the root causes of vulnerabilities during vulnerability detection. To address this challenge, we propose enhancing LLMs with multi-dimensional vulnerability knowledge distilled from historical vulnerabilities and fixes. We design a novel knowledge-level Retrieval-Augmented Generation framework Vul-RAG, which improves LLMs with an accuracy increase of 16% - 24% in identifying vulnerable and patched code. Additionally, vulnerability knowledge generated by Vul-RAG can further (1) serve as high-quality explanations to improve manual detection accuracy (from 60% to 77%), and (2) detect 10 previously-unknown bugs in the recent Linux kernel release with 6 assigned CVEs.
No abstract available
The significant increase in software production driven by automation and faster development lifecycles has resulted in a corresponding surge in software vulnerabilities. In parallel, the evolving landscape of software vulnerability detection, highlighting the shift from traditional methods to machine learning and large language models (LLMs), provides massive opportunities at the cost of resource-demanding computations. This paper thoroughly analyses LLMs' capabilities in detecting vulnerabilities within source code by testing models beyond their usual applications to study their potential in cybersecurity tasks. We evaluate the performance of six open-source models that are specifically trained for vulnerability detection against six general-purpose LLMs, three of which were further fine-tuned on a dataset that we compiled. Our dataset, alongside five state-of-the-art benchmark datasets, were used to create a pipeline to leverage a binary classification task, namely classifying code into vulnerable and non-vulnerable. The findings highlight significant variations in classification accuracy across benchmarks, revealing the critical influence of fine-tuning in enhancing the detection capabilities of small LLMs over their larger counterparts, yet only in the specific scenarios in which they were trained. Further experiments and analysis also underscore the issues with current benchmark datasets, particularly around mislabeling and their impact on model training and performance, which raises concerns about the current state of practice. We also discuss the road ahead in the field suggesting strategies for improved model training and dataset curation.
Software vulnerability detection is a critical challenge in cyber security. With the rise of deep learning and large language models (LLMs), numerous studies have applied these technologies to vulnerability detection. Existing approaches directly employ prompt engineering, chain-of-thought reasoning, and fine-tuning methods on LLMs, but achieve suboptimal results. To effectively leverage LLMs’ powerful reasoning capabilities for vulnerability detection, we propose VulnTeam, a novel team collaboration framework for LLM vulnerability detection inspired by human expert team collaboration. Specifically, we introduce a dual-stage fine-tuning approach where expert models are first fine-tuned using low-rank adaptation to detect vulnerabilities related to different vulnerability syntactic features, followed by instruction fine-tuning of a leader model responsible for the final decision-making. Ultimately, team members (expert models) and the team leader (leader model) collaborate to detect vulnerabilities. Our experimental evaluation across three LLMs and two datasets demonstrates that VulnTeam significantly enhances LLMs’ vulnerability detection performance (average F1-score improvement of 12.51%). Moreover, VulnTeam-enhanced LLMs substantially outperform previous state-of-the-art (SOTA) vulnerability detection methods (average F1-score improvement of 7.78%). Additionally, we analyze computational costs to validate VulnTeam’s practical applicability.
Smart contracts are susceptible to various vulnerabilities that can lead to significant financial losses. The usage of tools for vulnerabilities is reducing the threats but presents some limitations related to the approach used by the tool itself. This paper presents a novel approach to smart contract vulnerability detection utilizing Large Language Models (LLMs), as a tool to detect all the vulnerabilities at once. Our proposed tool leverages the advanced natural language processing capabilities of LLMs to analyze smart contract code and identify potential security flaws. By training the LLM on a diverse dataset of known smart contract vulnerabilities and secure coding practices, we enhance its ability to recognize subtle and complex vulnerabilities that traditional static analysis tools might miss. The evaluation of our tool demonstrates its effectiveness in detecting a wide range of vulnerabilities with satisfaction and accuracy, providing developers with a robust mechanism to improve the security of their smart contracts before deployment. This approach signifies a significant advancement in the application of artificial intelligence for blockchain security, highlighting the potential of LLMs to enhance the reliability and safety of decentralized applications.
No abstract available
We propose VulnLLM-R, the~\emph{first specialized reasoning LLM} for vulnerability detection. Our key insight is that LLMs can reason about program states and analyze the potential vulnerabilities, rather than simple pattern matching. This can improve the model's generalizability and prevent learning shortcuts. However, SOTA reasoning LLMs are typically ultra-large, closed-source, or have limited performance in vulnerability detection. To address this, we propose a novel training recipe with specialized data selection, reasoning data generation, reasoning data filtering and correction, and testing-phase optimization. Using our proposed methodology, we train a reasoning model with seven billion parameters. Through extensive experiments on SOTA datasets across Python, C/C++, and Java, we show that VulnLLM-R has superior effectiveness and efficiency than SOTA static analysis tools and both open-source and commercial large reasoning models. We further conduct a detailed ablation study to validate the key designs in our training recipe. Finally, we construct an agent scaffold around our model and show that it outperforms CodeQL and AFL++ in real-world projects. Our agent further discovers a set of zero-day vulnerabilities in actively maintained repositories. This work represents a pioneering effort to enable real-world, project-level vulnerability detection using AI agents powered by specialized reasoning models. The code is available at~\href{https://github.com/ucsb-mlsec/VulnLLM-R}{github}.
Improving and understanding the training dynamics and reasoning of Large Language Models (LLMs) has become essential for their deployment in AI-based security tools, such as software vulnerability detection. In this work, we present an extensive study aimed at advancing recent RL-based finetuning techniques for LLMs in the context of vulnerability detection. We start by highlighting key limitations of commonly adopted LLMs, such as their tendency to over-predict certain types of vulnerabilities while failing to detect others. To address this challenge, we explore the use of Group Relative Policy Optimization (GRPO), a recent policy-gradient method, for guiding LLM behavior through structured, rule-based rewards. We enable its application to the vulnerability detection task by redefining its advantage functions and reward signals using annotations from widely used datasets in the field, including BigVul, DiverseVul, and CleanVul. The proposed methodology enables an extensive set of experiments, addressing multiple research questions regarding the impact of GRPO on generalization, reasoning capabilities, and performance improvements over standard supervised finetuning (SFT). Our findings offer valuable insights into the potential of RL-based training to enhance both the performance and reasoning abilities of LLMs in the context of software vulnerability detection.
Recognizing vulnerabilities in stripped binary files presents a significant challenge in software security. Although some progress has been made in generating human-readable information from decompiled binary files with Large Language Models (LLMs), effectively and scalably detecting vulnerabilities within these binary files is still an open problem. This paper explores the novel application of LLMs to detect vulnerabilities within these binary files. We demonstrate the feasibility of identifying vulnerable programs through a combined approach of decompilation optimization to make the vulnerabilities more prominent and long-term memory for a larger context window, achieving state-of-the-art performance in binary vulnerability analysis. Our findings highlight the potential for LLMs to overcome the limitations of traditional analysis methods and advance the field of binary vulnerability detection, paving the way for more secure software systems. In this paper, we present Vul-BinLLM , an LLM-based framework for binary vulnerability detection that mirrors traditional binary analysis workflows with fine-grained optimizations in decompilation and vulnerability reasoning with an extended context. In the decompilation phase, Vul-BinLLM adds vulnerability and weakness comments without altering the code structure or functionality, providing more contextual information for vulnerability reasoning later. Then for vulnerability reasoning, Vul-BinLLM combines in-context learning and chain-of-thought prompting along with a memory management agent to enhance accuracy. Our evaluations encompass the commonly used synthetic dataset Juliet to evaluate the potential feasibility for analysis and vulnerability detection in C/C++ binaries. Our evaluations show that Vul-BinLLM is highly effective in detecting vulnerabilities on the compiled Juliet dataset.
Software supply chain vulnerabilities arise when attackers exploit weaknesses by injecting vulnerable code into widely used packages or libraries within software repositories. While most existing approaches focus on identifying vulnerable packages or libraries, they often overlook the specific functions responsible for these vulnerabilities. Pinpointing vulnerable functions within packages or libraries is critical, as it can significantly reduce the risks associated with using open-source software. Identifying vulnerable patches is challenging because developers often submit code changes that are unrelated to vulnerability fixes. To address this issue, this paper introduces FuncVul, an innovative code chunk-based model for function-level vulnerability detection in C/C++ and Python, designed to identify multiple vulnerabilities within a function by focusing on smaller, critical code segments. To assess the model's effectiveness, we construct six code and generic code chunk based datasets using two approaches: (1) integrating patch information with large language models to label vulnerable samples and (2) leveraging large language models alone to detect vulnerabilities in function-level code. To design FuncVul vulnerability model, we utilise GraphCodeBERT fine tune model that captures both the syntactic and semantic aspects of code. Experimental results show that FuncVul outperforms existing state-of-the-art models, achieving an average accuracy of 87-92% and an F1 score of 86-92% across all datasets. Furthermore, we have demonstrated that our code-chunk-based FuncVul model improves 53.9% accuracy and 42.0% F1-score than the full function-based vulnerability prediction. The FuncVul code and datasets are publicly available on GitHub at https://github.com/sajalhalder/FuncVul.
Detecting vulnerabilities in source code remains a challenging task due to the complex and diverse ways security flaws can manifest. This study investigates how to effectively combine sequential code semantics and graph-based structural features for improved vulnerability detection. We hypothesize that the choice of fusion strategy plays a critical role in leveraging the complementary strengths of these two modalities. To address the limitation of labeled data, we employ the CodeQwen2.5-3B-Instruct large language model to generate augmented vulnerable samples, enriching the original PrimeVul dataset. The resulting dataset, MegaVul+, consists of both human-labeled and LLM- augmented functions, formatted in a standardized JSON structure. Our primary research question centers on identifying the most effective strategy for fusing sequential and structural representations of code. We conduct a comparative evaluation of three multimodal fusion techniques: simple concatenation, gated multi-modal units (GMU), and cross-attention mechanisms. Experimental results show that the concatenation-based fusion achieves the best F1-score of 31.34%, outperforming GMU (25.45%), cross-attention (25.72%), the sequence-only model (25.01%), and the graph-only model (16.45%). We hypothesize that this advantage arises from the simplicity of concatenation, which preserves raw information from both modalities without introducing additional complexity or overfitting risks—particularly important in settings with imbalanced or noisy data. These findings highlight the potential of combining diverse code representations and demonstrate the value of LLM-driven data augmentation in improving software vulnerability detection.
No abstract available
Timely detection of hardware vulnerabilities during the early design stage is critical for reducing remediation costs. Existing early detection techniques often require specialized security expertise, limiting their usability. Recent efforts have explored the use of large language models (LLMs) for Verilog vulnerability detection. However, LLMs struggle to capture the structure in Verilog code, resulting in inconsistent detection results. To this end, we propose VerilogLAVD, the first LLM-aided graph traversal rule generation approach for Verilog vulnerability detection. Our approach introduces the Verilog Property Graph (VeriPG), a unified representation of Verilog code. It combines syntactic features extracted from the abstract syntax tree (AST) with semantic information derived from control flow and data dependency graphs. We leverage LLMs to generate VeriPG-based detection rules from Common Weakness Enumeration (CWE) descriptions. These rules guide the rule executor that traversal VeriPG for potential vulnerabilities. To evaluate VerilogLAVD, we build a dataset collected from open-source repositories and synthesized data. In our empirical evaluation on 77 Verilog designs encompassing 12 CWE types, VerilogLAVD achieves an F1-score of 0.54. Compared to the LLM-only and LLM with external knowledge baselines, VerilogLAVD improves F1-score by 0.31 and 0.27, respectively.
Traditional vulnerability detection methods rely heavily on predefined rule matching, which often fails to capture vulnerabilities accurately. With the rise of large language models (LLMs), leveraging their ability to understand code semantics has emerged as a promising direction for achieving more accurate and efficient vulnerability detection. However, current LLM-based approaches face significant challenges: instability in model outputs, degraded performance with long context, and hallucination. As a result, many existing solutions either use LLMs merely to enrich predefined rule sets, thereby keeping the detection process fundamentally rule-based, or over-rely on them, leading to poor robustness. To address these challenges, we propose a constraint-solving approach powered by LLMs named VULSOLVER. By modeling vulnerability detection as a constraint-solving problem, and by integrating static application security testing (SAST) with the semantic reasoning capabilities of LLMs, our method enables the LLM to act like a professional human security expert. We assess VULSOLVER on the OWASP Benchmark (1,023 labeled samples), achieving 97.85% accuracy, 97.97% F1-score, and 100% recall. Applied to widely-used open-source projects, VULSOLVER identified 15 previously unknown high-severity vulnerabilities (CVSS 7.5-9.8), demonstrating its effectiveness in real-world security analysis.
Detecting vulnerabilities in smart contracts is vital for the security and reliability of decentralized apps. To facilitate vulnerability detection, contract codes, including bug patterns, are represented as heterogeneous graphs with various nodes and edges, like control-flow and function-call graphs. However, existing graph learning techniques struggle with large, complex graphs. This paper presents MANDO-LLM, a novel framework that combines heterogeneous graph transformers (HGTs) with large language models (LLMs) for detecting vulnerabilities in smart contracts represented as heterogeneous contract graphs built upon control-flow and call graphs. MANDO-LLM uses LLMs to capture code features from control-flow and call data, customizes HGTs to learn embeddings with specific node-edge meta relations, and employs classifiers for vulnerability detection in Solidity code at both contract and line levels. Our evaluation shows that MANDO-LLM significantly outperforms existing methods on real-world large-scale imbalanced datasets, with F1-score improvements from 0.59% to 80.72% at the contract level. It is also one of the first effective methods for identifying line-level vulnerabilities, with performance boosts ranging from 3.09% to over 95% across different vulnerability types. MANDO-LLM’s versatility allows easy retraining for various vulnerabilities without needing manually defined patterns.
Our team, All You Need Is A Fuzzing Brain, was one of seven finalists in DARPA's Artificial Intelligence Cyber Challenge (AIxCC), placing fourth in the final round. During the competition, we developed a Cyber Reasoning System (CRS) that autonomously discovered 28 security vulnerabilities - including six previously unknown zero-days - in real-world open-source C and Java projects, and successfully patched 14 of them. The complete CRS is open source at https://github.com/o2lab/afc-crs-all-you-need-is-a-fuzzing-brain. This paper provides a detailed technical description of our CRS, with an emphasis on its LLM-powered components and strategies. Building on AIxCC, we further introduce a public leaderboard for benchmarking state-of-the-art LLMs on vulnerability detection and patching tasks, derived from the AIxCC dataset. The leaderboard is available at https://o2lab.github.io/FuzzingBrain-Leaderboard/.
No abstract available
This paper proposes a new approach to code vulnerability detection that uses large language models (LLMs) and incorporates adversarial training techniques. It aims to enhance the robustness of the vulnerability detection model and improve the performance of vulnerability detection. The effect of adversarial training on model performance enhancement is explored through experimental validation. It is proved experimentally that the method proposed we propose can significantly improve the robustness and accuracy and of the vulnerability detection model. It provides effective technical support for security problems in software development.
This paper presents LProtector, an automated vulnerability detection system for C/C++ codebases driven by the large language model (LLM) GPT-4o and Retrieval-Augmented Generation (RAG). As software complexity grows, traditional methods face challenges in detecting vulnerabilities effectively. LProtector leverages GPT-4o's powerful code comprehension and generation capabilities to perform binary classification and identify vulnerabilities within target codebases. We conducted experiments on the Big-Vul dataset, showing that LProtector outperforms two state-of-the-art baselines in terms of F1 score, demonstrating the potential of integrating LLMs with vulnerability detection.
The immutable nature of blockchain technology, while revolutionary, introduces significant security challenges, particularly in smart contracts. These security issues can lead to substantial financial losses. Current tools and approaches often focus on specific types of vulnerabilities. However, a comprehensive tool capable of detecting a wide range of vulnerabilities with high accuracy is lacking. This paper introduces LLM-SmartAudit, a novel framework leveraging the advanced capabilities of Large Language Models (LLMs) to detect and analyze vulnerabilities in smart contracts. Using a multi-agent conversational approach, LLM-SmartAudit employs a collaborative system with specialized agents to enhance the audit process. To evaluate the effectiveness of LLM-SmartAudit, we compiled two distinct datasets: a labeled dataset for benchmarking against traditional tools and a real-world dataset for assessing practical applications. Experimental results indicate that our solution outperforms all traditional smart contract auditing tools, offering higher accuracy and greater efficiency. Furthermore, our framework can detect complex logic vulnerabilities that traditional tools have previously overlooked. Our findings demonstrate that leveraging LLM agents provides a highly effective method for automated smart contract auditing.
As a key application of the blockchain technology, smart contracts have been adopted in various domains such as finance and the Internet of Things. However, their potential vulnerabilities can lead to significant economic losses, efficient and accurate vulnerability detection methods are essential to guarantee their security. Existing methods mostly rely on predefined rules or classification models, which suffer from high maintenance costs and limited semantic understanding about the smart contracts. To address this issue, this paper proposes SCoVerLLM (Smart Contract Vulnerability Detection via LLM-Based In-Context and Chain-of-Thought Prompts), which is designed to enhance the performance of smart contract vulnerability detection by using LLMs. SCoVerLLM combines prediction information generated by deep learning models with similar contract examples, and leverages In-Context Learning prompts and structured Chain-of-Thought templates to guide LLMs in step-by-step analyzing the logic of contracts for vulnerability detection. Experimental results show that SCoVerLLM outperforms existing four methods, including MANDO and Mythril, in terms of multiple metrics, with improvements of 10.72% to 19.20% in Accuracy, 8.70% to 18.51% in Precision, and 10.09% to 25.08% in F1.
The widespread reliance on open-source software dramatically increases the risk of vulnerability exploitation, underscoring the need for effective and scalable vulnerability detection (VD). Existing VD techniques, whether traditional machine learning-based or LLM-based approaches like prompt engineering, supervised fine-tuning, or off-policy preference optimization, remain fundamentally limited in their ability to perform context-aware analysis: They depend on fixed inputs or static preference datasets, cannot adaptively explore repository-level dependencies, and are constrained by function-level benchmarks that overlook critical vulnerability context. This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware VD. To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information. We then design multi-dimensional reward structuring that jointly captures prediction correctness, vulnerability localization accuracy, and the semantic relevance of vulnerability analysis, thereby guiding the model toward comprehensive contextual reasoning. To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling, encouraging the model to explore challenging cases while maintaining balanced reward distribution. Extensive experiments demonstrate the superiority of our VULPO framework in context-aware VD: Our VULPO-4B substantially outperforms existing VD baselines based on prompt engineering and off-policy optimization, improving F1 by 85% over Qwen3-4B and achieving performance comparable to a 150x larger-scale model, DeepSeek-R1-0528.
Smart contracts play a pivotal role in decentralized applications but are subject to security vulnerabilities often difficult to detect. Traditional static and symbolic analysis tools cannot handle intricate logic and are limited in adaptability and explainability. Recent development of large language models (LLMs) provide new opportunities for vulnerability detection, but single-model methods often suffer from inconsistency and prompt sensitivity. This paper introduces a collaborative LLM-based model that enhances detection robustness through semantic similarity-based few-shot prompting and multi-LLM reasoning. Our model integrates diverse LLMs (ChatGPT, Gemini, Grok) as worker nodes, along with an aggregator model to resolve disagreements via justification analysis and final prediction consolidation. Experimental evaluations on the SmartBugs benchmark demonstrated a remarkable enhancement in detection accuracy (96.25%) and response time compared to other models. The proposed model provides a scalable and explainable solution for smart contract auditing, illustrating the strength of LLM collaboration in security-critical applications.
As large language models (LLMs) are increasingly adopted for code vulnerability detection, their reliability and robustness across diverse vulnerability types have become a pressing concern. In traditional adversarial settings, code obfuscation has long been used as a general strategy to bypass auditing tools, preserving exploitability without tampering with the tools themselves. Numerous efforts have explored obfuscation methods and tools, yet their capabilities differ in terms of supported techniques, granularity, and programming languages, making it difficult to systematically assess their impact on LLM-based vulnerability detection. To address this gap, we provide a structured systematization of obfuscation techniques and evaluate them under a unified framework. Specifically, we categorize existing obfuscation methods into three major classes (layout, data flow, and control flow) covering 11 subcategories and 19 concrete techniques. We implement these techniques across four programming languages (Solidity, C, C++, and Python) using a consistent LLM-driven approach, and evaluate their effects on 15 LLMs spanning four model families (DeepSeek, OpenAI, Qwen, and LLaMA), as well as on two coding agents (GitHub Copilot and Codex). Our findings reveal both positive and negative impacts of code obfuscation on LLM-based vulnerability detection, highlighting conditions under which obfuscation leads to performance improvements or degradations. We further analyze these outcomes with respect to vulnerability characteristics, code properties, and model attributes. Finally, we outline several open problems and propose future directions to enhance the robustness of LLMs for real-world vulnerability detection.
Legacy systems, characterized by their heterogeneity and outdated coding practices, present significant security challenges in modern software infrastructure. Recent advances in Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) offer promising solutions for vulnerability detection, as demonstrated by successful implementations of knowledge-level retrieval frameworks [1]. This research proposes LegacyGuard, a hybrid framework that integrates state-of-theart code-specific LLMs with traditional static analysis and RAG-enhanced knowledge retrieval to detect vulnerabilities in multilingual legacy codebases. The framework leverages LLM- based semantic analysis for deep code understanding, while incorporating external vulnerability intelligence through RAG to enhance detection accuracy. Through systematic evaluation using precision, recall, and F1-score metrics, this work aims to demonstrate improved vulnerability detection rates and provide actionable insights through chain-of-thought reasoning. The modular architecture ensures extensibility and adaptability for future security analysis applications, contributing to both theoretical foundations and practical implementations of AI-driven vulnerability detection in legacy systems.
The rapid advancement of Large Language Models (LLMs) presents new opportunities for automated software vulnerability detection, a crucial task in securing modern codebases. This paper presents a comparative study on the effectiveness of LLM-based techniques for detecting software vulnerabilities. The study evaluates three approaches, Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and a Dual-Agent LLM framework, against a baseline LLM model. A curated dataset was compiled from Big-Vul [1] and real-world code repositories from GitHub, focusing on five critical Common Weakness Enumeration (CWE) categories: CWE-119, CWE399, CWE-264, CWE-20, and CWE-200. Our RAG approach, which integrated external domain knowledge from the internet and the MITRE CWE database, achieved the highest overall accuracy (0.86) and F1 score (0.85), highlighting the value of contextual augmentation. Our SFT approach, implemented using parameter-efficient QLoRA adapters, also demonstrated strong performance. Our Dual-Agent system, an architecture in which a secondary agent audits and refines the output of the first, showed promise in improving reasoning transparency and error mitigation, with reduced resource overhead. These results emphasize that incorporating a domain expertise mechanism significantly strengthens the practical applicability of LLMs in real-world vulnerability detection tasks.
Smart contract vulnerability detection is an important task in securing the blockchain. However, existing detection methods primarily extract single view features, such as semantic or structural features, which ignores the synergistic supplementation of them to smart contract, remaining room for improvement in feature representation. To this end, this paper proposes the LLM-assisted dual-view awareness framework for smart contract vulnerability detection, which incorporates significantly different semantic features and structural features. To address the limitation of large language model (LLM) in domain-specific expertise, we design semantic awareness module based on Retrieval-Augmented Generation (RAG), construct vulnerability knowledge base, and perform semantic reasoning on smart contracts. To capture crucial structural information, we propose structural awareness module based on Graph Neural Network (GNN), construct contract graphs, and perform structural analysis on smart contracts. We evaluated four types of vulnerabilities, and the experimental results show that our approach significantly outperforms state-of-the-art approaches, achieving 4.80% improvement in accuracy for timestamp dependence detection.
With the advance application of blockchain technology in various fields, ensuring the security and stability of smart contracts has emerged as a critical challenge. Current security analysis methodologies in vulnerability detection can be categorized into static analysis and dynamic analysis methods. However, these existing traditional vulnerability detection methods predominantly rely on analyzing original contract code, not all smart contracts provide accessible code. We present ETrace, a novel event-driven vulnerability detection framework for smart contracts, which uniquely identifies potential vulnerabilities through LLM-powered trace analysis without requiring source code access. By extracting fine-grained event sequences from transaction logs, the framework leverages Large Language Models (LLMs) as adaptive semantic interpreters to reconstruct event analysis through chain-of-thought reasoning. ETrace implements pattern-matching to establish causal links between transaction behavior patterns and known attack behaviors. Furthermore, we validate the effectiveness of ETrace through preliminary experimental results.
Vulnerabilities in source code are major risk in software-intensive systems, making their effective detection essential. Artificial Intelligence (AI) supports this process by analyzing large datasets and identifying threat patterns. However, AI-based methods face challenges in handling big data and understanding context. This research introduces a novel approach to enhance transparency in vulnerability detection using BERT-based Large Language Models (LLMs), integrated with eXplainable AI (XAI) techniques like SHAP, LIME, and attention heatmaps. This architecture ensures transparency throughout the model's lifecycle. An experiment on a large source code dataset achieved 85% accuracy, with XAI tools highlighting influential tokens such as “vulnerable,” “function,” “mysql_tmpdir_list,” and “strmov.” Attention heatmaps also provided insights into token-level interactions, improving the interpretability of the model's decisions.
With the development of software systems, the security threat posed by software vulnerabilities has become very serious. Static analysis is an extremely powerful technique that plays a crucial role in software vulnerability detection. However, the scalability of static analysis is poor, it must rely on experienced software engineers. Large language models(LLMs) are powerful tools that exhibit strong performance in natural language understanding and automated code generation. We find that this capability can effectively compensate for the lack of scalability in static analysis methods. Therefore, we attempt to combine the advantages of both, and provide an intelligent software vulnerability detection solution. We construct a vulnerability feature table based on vulnerability patterns, and LLMs extract vulnerability information from the project's historical patches according to the table. We need to ensure that the vulnerability feature and the vulnerability pattern are paired. Then, we construct a series of static analyzer templates based on vulnerability patterns, and fill the vulnerability information into the templates to build new static analyzers. Finally, we use these new static analyzers to detect vulnerabilities in the project. To date, we have discovered 44 bugs in the Linux kernel; 37 confirmed, 30 fixed. The evaluation of the Linux kernel demonstrates that this system can generate high-precision analyzers to detect hidden bugs.
No abstract available
Cybersecurity text classification is difficult because vulnerability descriptions are complex and specific to the domain. Traditional models cannot capture the semantic features needed for accurate classification. This paper presents VULNERNet. It is a deep learning framework that uses semantic embeddings from a fine-tuned Qwen-7B model. It also uses a multi-layer ensemble architecture that combines convolutional neural net-works, long short-term memory networks, and Transformers. The framework extracts local, sequential, and global features. This helps understand vulnerability texts in a complete way. The training uses cross-entropy and L2 regularization. This improves model stability and reduces overfitting. Experiments show that VULNERNet improves over existing methods. It is effective and works well in cybersecurity text mining.
The rapidly advancing capabilities of Large Language Models (LLMs) have garnered significant attention from both academia and industry. These models demonstrate immense potential across diverse domains. In the realm of software security, LLMs are increasingly pivotal, with active research exploring their capacity to enhance vulnerability discovery efficiency and strengthen defenses against escalating cyber threats. This article provides a systematic exposition of LLMs' applications in software vulnerability discovery, categorizing current approaches into three core paradigms: code vulnerability detection, LLMs-enhanced static analysis, and LLMs-guided fuzzing. We examine the foundational principles, key implementation methodologies, and significant technical advancements within each domain. Furthermore, we critically analyze the inherent challenges that persist and delineate promising avenues for future research, contributing to the development of more secure and reliable LLMs' applications for cybersecurity.
As cyber threats become more complex, traditional vulnerability detection methods lose their effectiveness. The purpose of this work is to develop and test an approach to identifying vulnerabilities based on the analysis of data from thematic Internet resources: forums, blogs and social networks. These sources contain a large amount of unstructured information, which requires the use of data mining methods. The work uses the integration of modern technologies: the pre-trained SecBERT language model (Security Bidirectional Encoder Representations from Transformers), designed for cybersecurity tasks, and the adaptive neuro-fuzzy inference system DENFIS (Dynamic Evolving Neural-Fuzzy Inference System). The proposed system allows you to filter irrelevant messages, highlight indicators of compromise and potential threats. The use of fuzzy logic makes it possible to efficiently process vague and incomplete information. Experiments confirmed high classification accuracy and stable fuzzy clustering performance (FPC = 0.93; PE = 0.28; XB = 0.042). The system demonstrated the ability to promptly detect signs of cyber threats and has scalability potential for monitoring and attack prediction tasks. The results indicate its potential in increasing the speed of response to cyber threats and strengthening the protection of information systems.
Smart contract vulnerability detection is a critical challenge in the rapidly evolving blockchain landscape. Existing vulnerability detection methods face two main issues: (1) Existing datasets lack comprehensiveness and sufficient quality, with limited vulnerability type coverage and insufficient distinction between high-quality and low-quality explanations for preference learning. (2) Large language models (LLMs) often struggle with accurately interpreting specific concepts in smart contract security. Through our empirical analysis, we found that even after continual pre-training and supervised fine-tuning, LLMs still exhibit limitations in precisely understanding the execution order of state changes in smart contracts, which can lead to incorrect vulnerability explanations despite making correct detection decisions. These limitations result in poor detection performance, leading to potentially severe financial losses. To address these challenges, we propose Smart-LLaMA-DPO, an advanced detection method based on the LLaMA-3.1-8B. First, we construct a comprehensive dataset covering four vulnerability types and machine-unauditable vulnerabilities, containing labels, detailed explanations, and precise vulnerability locations for Supervised Fine-Tuning (SFT), as well as paired high-quality and low-quality outputs for Direct Preference Optimization (DPO). Second, we perform continual pre-training using large-scale smart contract code to enhance the LLM's understanding of specific security practices in smart contracts. Futhermore, we conduct supervised fine-tuning with our comprehensive dataset. Finally, we apply DPO, which leverages human feedback to improve the quality of generated explanations. Smart-LLaMA-DPO utilizes a specially designed loss function that encourages the LLM to increase the probability of preferred outputs while decreasing the probability of non-preferred outputs, thereby enhancing the LLM's ability to generate high-quality explanations. We evaluate Smart-LLaMA-DPO on four major vulnerability types: reentrancy, timestamp dependence, integer overflow/underflow, and delegatecall, as well as machine-unauditable vulnerabilities. Our method significantly outperforms state-of-the-art baselines, with average improvements of 10.43% in F1 score and 7.87% in accuracy. Moreover, both LLM evaluation and human evaluation demonstrate the superior quality of explanations generated by Smart-LLaMA-DPO in terms of correctness, thoroughness, and clarity.
Previous learning-based vulnerability detection methods relied on either medium-sized pretrained models or smaller neural networks from scratch. Recent advancements in Large Pre-Trained Language Models (LLMs) have showcased remarkable few-shot learning capabilities in various tasks. However, the effectiveness of LLMs in detecting software vulnerabilities is largely unexplored. This paper aims to bridge this gap by exploring how LLMs perform with various prompts, particularly focusing on two state-of-the-art LLMs: GPT-3.5 and GPT-4. Our experimental results showed that GPT-3.5 achieves competitive performance with the prior state-of-the-art vulnerability detection approach and GPT-4 consistently outperformed the state-of-the-art.
With the increasing number of network security threats and the frequent occurrence of software vulnerability attacks, the effective management and large-scale retrieval of vulnerability data have become urgent needs. Existing vulnerability information is scattered across heterogeneous sources and is difficult to integrate, which in turn makes it hard for security analysts to quickly retrieve and analyze relevant security knowledge. To address this problem, this paper proposes a method to construct a vulnerability knowledge graph by integrating multi-source vulnerability data, combining graph embedding technology with large language model reasoning to aggregate, infer, and enrich vulnerability knowledge. Experiments demonstrated that our domain-tuned Bidirectional Long Short-Term Memory–Conditional Random Field (BiLSTM-CRF) named entity recognition (NER), enhanced with a cybersecurity dictionary, achieved a 90.1% F1-score for entity extraction. For link prediction, a hybrid Graph Attention Network fused with GPT-3 reasoning boosted Hits1 by 0.137, Hits3 by 0.116, and Hits10 by 0.101 over the baseline. These results confirm that our approach markedly enhanced entity identification and relationship inference, yielding a more complete and dynamically updatable cybersecurity knowledge graph.
Software vulnerability detection is generally supported by automated static analysis tools, which have recently been reinforced by deep learning (DL) models. However, despite the superior performance of DL-based approaches over rule-based ones in research, applying DL approaches to software vulnerability detection in practice remains a challenge due to the complex structure of source code, the black-box nature of DL, and the domain knowledge required to understand and validate the black-box results for addressing tasks after detection. Conventional DL models are trained by specific projects and, hence, excel in identifying vulnerabilities in these projects but not in others. These models with poor performance in vulnerability detection would impact the downstream tasks such as location and repair. More importantly, these models do not provide explanations for developers to comprehend detection results. In contrast, Large Language Models (LLMs) have made lots of progress in addressing these issues by leveraging prompting techniques. Unfortunately, their performance in identifying vulnerabilities is unsatisfactory. This paper contributes \textbf{\DLAP}, a \underline{\textbf{D}}eep \underline{\textbf{L}}earning \underline{\textbf{A}}ugmented LLMs \underline{\textbf{P}}rompting framework that combines the best of both DL models and LLMs to achieve exceptional vulnerability detection performance. Experimental evaluation results confirm that \DLAP outperforms state-of-the-art prompting frameworks, including role-based prompts, auxiliary information prompts, chain-of-thought prompts, and in-context learning prompts, as well as fine-turning on multiple metrics.
This paper introduces SBAN (Source code, Binary, Assembly, and Natural Language Description), a large-scale, multi-dimensional dataset designed to advance the pre-training and evaluation of large language models (LLMs) for software code analysis. SBAN comprises more than 3 million samples, including 2.9 million benign and 672,000 malware respectively, each represented across four complementary layers: binary code, assembly instructions, natural language descriptions, and source code. This unique multimodal structure enables research on cross-representation learning, semantic understanding of software, and automated malware detection. Beyond security applications, SBAN supports broader tasks such as code translation, code explanation, and other software mining tasks involving heterogeneous data. It is particularly suited for scalable training of deep models, including transformers and other LLM architectures. By bridging low-level machine representations and high-level human semantics, SBAN provides a robust foundation for building intelligent systems that reason about code. We believe that this dataset opens new opportunities for mining software behavior, improving security analytics, and enhancing LLM capabilities in pre-training and fine-tuning tasks for software code mining.
The advance of intelligent cloud applications has brought attention to potential security vulnerabilities. Vulnerability detection is a critical step in ensuring the security of cloud applications. However, traditional techniques for vulnerability detection, such as static and dynamic analysis, are challenging to apply in heterogeneous cloud environments. Using data-driven methods such as Machine Learning (ML) to automate vulnerability detection in cloud applications shows promise. However, current ML solutions are limited to coarse-grained vulnerability categorization and function-level analysis. Therefore, we propose LLM-CloudSec, an unsupervised approach to fine-grained vulnerability analysis based on the Large Language Model (LLM). LLM-CloudSec uses Retrieval Augmented Generation (RAG) and the Common Weakness Enumeration (CWE) as an external knowledge base to improve its ability to detect and analyze vulnerabilities. We conduct experiments on the Juliet $\mathrm{C}++$ test suite, and the results show that LLM-CloudSec enables CWE-based vulnerability classification and line-level vulnerability analysis. Additionally, we applied LLM-CloudSec to the D2A dataset, which was collected from real-world scenarios. We obtained 1230 data entries labelled with CWE and detailed vulnerability analysis. To foster related research, we publish our work on https://github.com/DPCa0/LLM-CloudSec.
Vulnerabilities are often accompanied by cyberattacks. CVE is the largest repository of open vulnerabilities, which keeps expanding. ATT&CK models known multi-step attacks both tactically and technically and remains up to date. It is valuable to correlate the vulnerability in CVE with the corresponding tactic and technique of ATT&CK which exploit the vulnerability, for active defense. Mappings manually is not only time-consuming but also difficult to keep up-to-date. Existing language-based automated mapping methods do not utilize the information associated with attack behaviors outside of CVE and ATT&CK and are therefore ineffective. In this paper, we propose a novel framework named VTT-LLM for mapping Vulnerabilities to Tactics and Techniques based on Large Language Models, which consists of a generation model and a mapping model. In order to generate fine-tuning instructions for LLM, we create a template to extract knowledge of CWE (a standardized list of common weaknesses) and CAPEC (a standardized list of common attack patterns). We train the generation model of VTT-LLM by fine-tuning the LLM according to the above instructions. The generation model correlates vulnerability and attack through their descriptions. The mapping model transforms the descriptions of ATT&CK tactics and techniques into vectors through text embedding and further associates them with attacks through semantic matching. By leveraging the knowledge of CWE and CAPEC, VTT-LLM can eventually automate the process of linking vulnerabilities in CVE to the attack techniques and tactics of ATT&CK. Experiments on the latest public dataset, ChatGPT-VDMEval, show the effectiveness of VTT-LLM with an accuracy of 85.18%, which is 13.69% and 54.42% higher than the existing CVET and ChatGPT-based methods, respectively. In addition, compared to fine-tuning without outside knowledge, the accuracy of VTT-LLM with chain fine-tuning is 9.24% higher on average across different LLMs.
Large Language Models (LLMs) have shown significant challenges in detecting and repairing vulnerable code, particularly when dealing with vulnerabilities involving multiple aspects, such as variables, code flows, and code structures. In this study, we utilize GitHub Copilot as the LLM and focus on buffer overflow vulnerabilities. Our experiments reveal a notable gap in GitHub Copilot's vulnerability repair abilities, with a 76% vulnerability detection rate but only a 15% vulnerability repair rate. To address this issue, we propose a context-aware prompt tuning technique to enhance Copilot's performance in repairing buffer overflow. By injecting a sequence of domain knowledge about the vulnerability, including various security and code contexts, we demonstrate that Copilot's vulnerability repair rate increases to 63%, representing more than four times the improvement compared to repairs without domain knowledge.
This paper provides a systematic analysis of the opportunities, challenges, and potential solutions of harnessing Large Language Models (LLMs) such as GPT-4 to dig out vulnerabilities within smart contracts based on our ongoing research. For the task of smart contract vulnerability detection, achieving practical usability hinges on identifying as many true vulnerabilities as possible while minimizing the number of false positives. Nonetheless, our empirical study reveals contradictory yet interesting findings: generating more answers with higher randomness largely boosts the likelihood of producing a correct answer but inevitably leads to a higher number of false positives. To mitigate this tension, we propose an adversarial framework dubbed GPTLENS that breaks the conventional one-stage detection into two synergistic stages - generation and discrimination, for progressive detection and refinement, wherein the LLM plays dual roles, i.e., AUDITOR and CRITIC, respectively. The goal of AUDITOR is to yield a broad spectrum of vulnerabilities with the hope of encompassing the correct answer, whereas the goal of CRITIC that evaluates the validity of identified vulnerabilities is to minimize the number of false positives. Experimental results and illustrative examples demonstrate that AUDITOR and CRITIC work together harmoniously to yield pronounced improvements over the conventional one-stage detection. GPTLENS is intuitive, strategic, and entirely LLM-driven without relying on specialist expertise in smart contracts, showcasing its methodical generality and potential to detect a broad spectrum of vulnerabilities. Our code is available at: https://github.com/git-disl/GPTLens.
Security practitioners maintain vulnerability reports (e.g., GitHub Advisory) to help developers mitigate security risks. An important task for these databases is automatically extracting structured information mentioned in the report, e.g., the affected software packages, to accelerate the defense of the vulnerability ecosystem. However, it is challenging for existing work on affected package identification to achieve a high accuracy. One reason is that all existing work focuses on relatively smaller models, thus they cannot harness the knowledge and semantic capabilities of large language models. To address this limitation, we propose VulLibGen, the first method to use LLM for affected package identification. In contrast to existing work, VulLibGen proposes the novel idea to directly generate the affected package. To improve the accuracy, VulLibGen employs supervised fine-tuning (SFT), retrieval augmented generation (RAG) and a local search algorithm. The local search algorithm is a novel postprocessing algorithm we introduce for reducing the hallucination of the generated packages. Our evaluation results show that VulLibGen has an average accuracy of 0.806 for identifying vulnerable packages in the four most popular ecosystems in GitHub Advisory (Java, JS, Python, Go) while the best average accuracy in previous work is 0.721. Additionally, VulLibGen has high value to security practice: we submitted 60pairs to GitHub Advisory (covers four ecosystems). 34 of them have been accepted and merged and 20 are pending approval. Our code and dataset can be found in the attachments.
Despite the efficacy of fuzzing in verifying the implementation correctness of network protocols, existing IoT protocol fuzzing approaches grapple with several limitations, including obfuscated message formats, unresolved message dependencies, and a lack of evaluations on the testing cases. These limitations significantly curtail the capabilities of IoT fuzzers in vulnerability identification. In this work, we show that the protocol specification contains fruitful descriptions of protocol messages, which can be used to overcome the above limitations and guide IoT protocol fuzzing. To automate the specification analysis, we augment the large language model with the specification contents, and drive it to perform two tasks (i.e., protocol information extraction, and device response reasoning). We further design and implement a fuzzing algorithm, LLMIF, which incorporates the LLM into IoT fuzzing. Finally, we select Zigbee as the target protocol and initiate comprehensive evaluations. The evaluation result shows that LLMIF successfully addressed the above limitations. Compared with the existing Zigbee fuzzers, it increases the protocol message coverage and code coverage by 55.2% and 53.9%, respectively. Besides the enhanced coverage, LLMIF unearthed 11 vulnerabilities on real-world Zigbee devices, which include eight previously unknown vulnerabilities. Seven of them are not covered by the existing Zigbee fuzzers.
AI pair programmers, such as GitHub's Copilot, have shown great success in automatic code generation. However, such large language model-based code generation techniques face the risk of introducing security vulnerabilities to codebases. In this work, we explore the direction of fine-tuning large language models for generating more secure code. We use real-world vulnerability fixes as our fine-tuning dataset. We craft a code-generation scenario dataset (C/C++) for evaluating and comparing the pre-trained and fine-tuned models. Our experiments on GPT-J show that the fine-tuned GPT-J achieved 70.4% and 64.5% ratios of non-vulnerable code generation for C and C++, respectively, which has a 10% increase for C and a slight increase for C++ compared with the pre-trained large language model.
Various approaches are proposed to help under-resourced security researchers to detect and analyze software vulnerabilities. It is still incredibly time-consuming and labor-intensive for security researchers to fix such reported vulnerabilities due to the increasing size and complexity of modern software systems. The time lag between the reporting and fixing of a security vulnerability causes software systems to suffer from significant exposure to possible attacks. Very recently, some techniques propose to apply pre-trained models to fix security vulnerabilities and have proved their success in improving repair accuracy. However, the effectiveness of existing pre-trained models has not been systematically compared and little is known about their advantages and disadvantages. To bridge this gap, we perform the first extensive study on applying various pre-trained models to automated vulnerability repair. The experimental results on two vulnerability datasets show that all studied pre-trained models consistently outperform the state-of-the-art technique VRepair with a prediction accuracy of 32.94%$\sim$∼44.96%. We also investigate the impact of three major phases (i.e., data pre-processing, model training and repair inference) in the vulnerability repair workflow. Inspired by the findings, we construct a simplistic vulnerability repair approach that adopts the transfer learning from bug fixing. Surprisingly, such a simplistic approach can further improve the prediction accuracy of pre-trained models by 9.40% on average. Besides, we provide additional discussion from different aspects (e.g., code representation and a preliminary study with ChatGPT) to illustrate the capacity and limitation of pre-trained model-based techniques. Finally, we further pinpoint various practical guidelines (e.g., the improvement of fine-tuning) for advanced pre-trained model-based vulnerability repair in the near future. Our study highlights the promising future of adopting pre-trained models to patch real-world security vulnerabilities and reduce the manual debugging effort of security experts in practice.
Data-driven deep learning models are constrained by the scale and diversity of training data, making them vulnerable to data bias. While large language models (LLMs) exhibit superior generalization in vulnerability detection, their low inference efficiency and high computational costs hinder practical deployment in industrial settings. To address these limitations, we propose AIDetectVul, a novel vulnerability detection framework leveraging feature fusion from pre-trained models. Our approach concurrently utilizes encoder-only and decoder-only architectures to extract complementary code embeddings, with feature fusion enhancing semantic diversity. These enriched representations are then processed by a Transformer model, where the self-attention mechanism effectively captures long-range code dependencies, ultimately improving both detection accuracy and generalization capability. Comprehensive evaluations on proprietary enterprise datasets and open-source benchmarks demonstrate that AIDetectVul achieves comparable detection accuracy to the state-of-the-art LineVul model while demonstrating measurable improvements in generalization performance. Compared to LLM-based approaches, our solution maintains significantly lower computational overhead and training costs, making it particularly suitable for industrial applications.
No abstract available
Software vulnerabilities represent one of the most pressing threats to computing systems. Identifying vulnerabilities in source code is crucial for protecting user privacy and reducing economic losses. Traditional static analysis tools rely on experts with knowledge in security to manually build rules for operation, a process that requires substantial time and manpower costs and also faces challenges in adapting to new vulnerabilities. The emergence of pre-trained code language models has provided a new solution for automated vulnerability detection. However, code pre-training models are typically based on token-level large-scale pre-training, which hampers their ability to effectively capture the structural and dependency relationships among code segments. In the context of software vulnerabilities, certain types of vulnerabilities are related to the dependency relationships within the code. Consequently, identifying and analyzing these vulnerability samples presents a significant challenge for pre-trained models. In this paper, we propose a data flow embedding technique to enhance the performance of pre-trained models in vulnerability detection tasks, named DFEPT, which provides effective vulnerability data flow information to pre-trained models. Specifically, we parse data flow graphs (DFG) from function-level source code, and use the data type of the variable as the node characteristics of the DFG. By applying graph learning techniques, we embed the data flow graph and incorporate relative positional information into the graph embedding using sine positional encoding to ensure the completeness of vulnerability data flow information. Our research shows that DFEPT can provide effective vulnerability semantic information to pre-trained models, achieving an accuracy of 64.97% on the Devign dataset and an F1-Score of 47.9% on the Reveal dataset. Compared with the pre-trained model that is only fine-tuned, the performance increases by 1.96%-17.26%.
The increasing software vulnerabilities underscore the urgency of effective vulnerability detection for safeguarding security and socio-economic stability. Existing pre-trained model-based methods for vulnerability detection often suffer from limitations such as the compression of high-dimensional features and the omission of valuable semantic label information, which collectively compromise performance metrics and fine-tuning efficacy. To address these issues, we propose a novel methodology for fine-tuning pre-trained models using prompt learning. By projecting classification labels as prompts into a high-dimensional feature space, our model preserves essential spatial structures and optimizes the fine-tuning process more effectively. To further enhance classification performance, we design a specialized prompt network for learning adaptable prompts, along with a code encoder network that synergistically captures the semantic and structural nuances of code. Extensive empirical evaluation on two publicly available datasets, SARD and CodeXGLUE demonstrates significant improvements in classification performance and prompt optimization compared to existing state-of-the-art models. Specifically, our method achieves an accuracy of 87.98% on the SARD dataset and 70.97% on the CodeXGLUE dataset, marking a remarkable improvement over state-of-the-art solutions.
With the increasing number of publicly disclosed security vulnerabilities, the threat of software vulnerabilities to national security and network security can no longer be ignored. The rise of open-source software has provided a wealth of code resources, facilitating the application of deep learning in the domain of source code vulnerability detection. Research on vulnerability detection focuses on how to effectively represent the syntactic and structural information of the code and how to make the deep learning model learn the relevant features of the vulnerable code effectively. Existing methods often hard to balance the local dependencies and global information of code when performing code representation, and it is easy to learn features unrelated to vulnerabilities when faced with unbalanced datasets. To solve these problems, we propose a vulnerability detection model EFVD, which combines sequence-based and graph-based methods. EFVD consists of three main parts: (1) a sequence-based module that uses pre-trained Codebert to learn the global information of the code; (2) Graph-based module, we propose a GGNN with edge attention mechanism (EA-GGNN). EA-GGNN can dynamically assign weights to different edge types, fuse heterogeneous edge information into node representation, and effectively capture local and remote dependencies between nodes. (3) Vulnerability classification module, we used focal loss to replace the cross-entropy loss to enhance the model's focus on minority classes. Compared to the state-of-the art methods, EFVD achieves a maximum absolute improvement of 35.63% in accuracy and 289.32% in F1 scores over the three publicly available benchmark datasets.
No abstract available
With the rapid development and widespread use of advanced network systems, software vulnerabilities pose a significant threat to secure communications and networking. Learning-based vulnerability detection systems, particularly those leveraging pre-trained language models, have demonstrated significant potential in promptly identifying vulnerabilities in communication networks and reducing the risk of exploitation. However, the shortage of accurately labeled vulnerability datasets hinders further progress in this field. Failing to represent real-world vulnerability data variety and preserve vulnerability semantics, existing augmentation approaches provide limited or even counterproductive contributions to model training. In this paper, we propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection. Given the vulnerability dataset, our method performs natural semantic-preserving program transformation to generate a large volume of new samples with enriched data diversity and variety. By incorporating our augmented dataset in fine-tuning a series of representative code pre-trained models (i.e., CodeBERT, GraphCodeBERT, UnixCoder, and PDBERT), up to 10.1% increase in accuracy and 23.6% increase in F1 can be achieved in the vulnerability detection task. Comparison results also show that our proposed method can substantially outperform other prominent vulnerability augmentation approaches.
Software vulnerabilities are among the significant causes of security breaches. Vulnerabilities can severely compromise software security if exploited by malicious attacks and may result in catastrophic losses. Hence, Automatic vulnerability detection methods promise to mitigate attack risks and safeguard software security. This paper introduces a novel model for automatic vulnerability detection of source code vulnerabilities dubbed DB-CBIL using a hybrid deep learning model based on Distilled Bidirectional Encoder Representations from Transformers (DistilBERT). The proposed model considers contextualized word embeddings using the language model for the syntax and semantics of source code functions based on the Abstract Syntax Tree (AST) representation. The model includes two main phases. First, using a vulnerable code dataset, the pre-trained DistilBert transformer is fine-tuned for word embedding. Second, a hybrid deep learning model detects which code functions are vulnerable. The hybrid model is built on two Deep Neural Networks (DNN). The first model is the Convolutional Neural Network (CNN), which is used for extracting features. The second model is Bidirectional-LSTM (BiLSTM), which has been used to maintain the sequential order of the data as it can handle lengthy token sequences. The utilized source code dataset is derived from the Software Assurance Reference Database (SARD) benchmark dataset. Final experimental findings show that the proposed model outperforms the state-of-the-art approaches’ performance by improving precision, recall, F1-score, and False Negative Rate (FNR) by 2.41%-8.95%, 4.0%-16.28%, 1.85%-12.74%, and 18% respectively. The proposed model reports the lowest FNR in the literature, a significant achievement given the cost-based nature of vulnerability detectors.
Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pre-training language models on a large-scale code corpus is compu-tationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pre-training, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model's latent features and the task's labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used open-source PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks.
Recently, there has been a growing interest in automatic software vulnerability detection. Pre-trained model-based approaches have demonstrated superior performance than other Deep Learning (DL)-based approaches in detecting vulnerabilities. However, the existing pre-trained model-based approaches generally employ code sequences as input during prediction, and may ignore vulnerability-related structural information, as reflected in the following two aspects. First, they tend to fail to infer the semantics of the code statements with complex logic such as those containing multiple operators and pointers. Second, they are hard to comprehend various code execution sequences, which is essential for precise vulnerability detection. To mitigate the challenges, we propose a Structured Natural Language Comment tree-based vulnerAbiLity dEtection framework based on the pre-trained models, named . The proposed Structured Natural Language Comment Tree (SCT) integrates the semantics of code statements with code execution sequences based on the Abstract Syntax Trees (ASTs).Specifically, comprises three main modules: (1) Comment Tree Construction, which aims at enhancing the model’s ability to infer the semantics of code statements by first incorporating Large Language Models (LLMs) for comment generation and then adding the comment node to ASTs. (2) Structured Natural Language Comment Tree Construction, which aims at explicitly involving code execution sequence by combining the code syntax templates with the comment tree. (3) SCT-Enhanced Representation, which finally incorporates the constructed SCTs for well capturing vulnerability patterns. Experimental results demonstrate that outperforms the best-performing baseline, including the pre-trained model and LLMs, with improvements of 2.96%, 13.47%, and 3.75% in terms of F1 score on the FFMPeg+Qemu, Reveal, and SVulD datasets, respectively. Furthermore, can be applied to different pre-trained models, such as CodeBERT and UniXcoder, yielding the F1 score performance enhancements ranging from 1.37% to 10.87%.
The increasing complexity of software systems has led to a surge in cybersecurity vulnerabilities, necessitating efficient and scalable solutions for vulnerability assessment. However, the deployment of large pre-trained models in real-world scenarios is hindered by their substantial computational and storage demands. To address this challenge, we propose a novel resource-efficient framework that integrates knowledge distillation and particle swarm optimization to enable automated vulnerability assessment. Our framework employs a two-stage approach: First, particle swarm optimization is utilized to optimize the architecture of a compact student model, balancing computational efficiency and model capacity. Second, knowledge distillation is applied to transfer critical vulnerability assessment knowledge from a large teacher model to the optimized student model. This process significantly reduces the model size while maintaining high performance. Experimental results on an enhanced MegaVul dataset, comprising 12,071 CVSS (Common Vulnerability Scoring System) v3 annotated vulnerabilities, demonstrate the effectiveness of our approach. Our approach achieves a 99.4% reduction in model size while retaining 89.3% of the original model's accuracy. Furthermore, it outperforms state-of-the-art baselines by 1.7% in accuracy with 60% fewer parameters. The framework also reduces training time by 72.1% and architecture search time by 34.88% compared to traditional genetic algorithms.
With the advancement of deep learning in various fields, there are many attempts to reveal software vulnerabilities by data-driven approach. Natural language processing has emerged as a powerful tool for bridging the semantic gap between programming languages and natural language. However, a significant disparity between the two still exists. In this work, we propose XGV-BERT, a framework that combines the pre-trained CodeBERT model and graph neural network to detect software vulnerabilities. By jointly training the CodeBERT and graph neural network modules within XGV-BERT, the proposed model leverages the advantages of large-scale pre-training, harnessing vast raw data, and transfer learning by learning representations for training data through graph convolution. The research results demonstrate that the XGV-BERT method significantly improves vulnerability detection accuracy compared to two existing methods such as VulDeePecker and SySeVR. For the VulDeePecker dataset, XGV-BERT achieves an impressive F1-score of 97.5%, significantly outperforming VulDeePecker, which achieved an F1-score of 78.3%. Again, with the SySeVR dataset, XGV-BERT achieves an F1-score of 95.5%, surpassing the results of SySeVR with an F1-score of 83.5%.
Tokenization is a fundamental component of language models for code. It involves breaking down the input into units that are later passed to the language model stack to learn high-dimensional representations used in various contexts, from classification to generation. However, the output of these tokenizers is often longer than that traditionally used in compilers and interpreters. This could result in undesirable effects, such as increased computational overhead. In this work, we investigate the effect of merging the hidden representations of subtokens that belong to the same semantic unit, such as subtokens that form a single identifier. We propose two strategies: one based on averaging the representations and another that leverages a learning-based approach. Both methods can be seamlessly integrated with existing language models for code. We conduct experiments using six language models for code: CodeBERT, GraphCodeBERT, UniXCoder, CdoeT5, CodeT5+ (220M), and CodeT5+ (770M), across three software engineering tasks: vulnerability detection, code classification, and code translation. Results show that these strategies can reduce the number of floating-point operations by $1\%$ to $19\%$. Regarding downstream performance, the most significant degradation was observed in the vulnerability detection task, where the F1 score decreased by $1.82$ points compared to the baseline. In contrast, for code translation, we observed an improvement of $2.47$ points in CodeBLEU. This work contributes to the broader effort of improving language models for code across multiple dimensions, including both computational efficiency and downstream performance.
Pre-trained large language models (LLMs) have advanced capabilities in feature extraction and pattern discovery. Utilizing fine-tuning techniques effectively adapts LLMs to specific scenarios. Detecting vulnerabilities during the coding phase is crucial. In this paper, we collected the Java CWE vulnerability from the SARD dataset and then supervised fine-tuned the open-source Qwen2-7B model using the LoRA technique. We compared the results with vulnerability detection models based on Graph Neural Networks (GNN) and Long Short-Term Memory (LSTM). It is demonstrated that the approach of fine-tuning for LLMs can effectively detect source code vulnerabilities.
Abstract The escalating complexity and sophistication of software vulnerabilities demand innovative approaches in cybersecurity. This study introduces a groundbreaking framework, named “CodeSentry”, employing a transformer-based model for vulnerability detection in software code. “CodeSentry” leverages a finely-tuned version of the Generative Pre-trained Transformer (GPT), optimized for pinpointing vulnerable code patterns across various benchmark datasets. This approach stands apart by its remarkable computational efficiency, making it suitable for real-time applications − a significant advancement over traditional, resource-intensive deep learning models like CNNs and LSTMs. Empirical results showcase “CodeSentry” achieving an impressive 92.65% accuracy in vulnerability detection, surpassing existing state-of-the-art methods such as SyseVR and VulDeBERT. This novel methodology marks a paradigm shift in vulnerability detection, blending advanced AI with practical application efficiency.
No abstract available
: As software programs continue to grow in size and complexity, the prevalence of software vulnerabilities has emerged as a significant security threat. Detecting these vulnerabilities has become a major concern due to the potential security risks they pose. Though Deep Learning (DL) approaches have shown promising results, previous studies have encountered challenges in simultaneously maintaining detection accuracy and scalability. In response to this challenge, our research proposes a method of automated software Vulnerability detection using CodeBERT and Convolutional Neural Network called VulBertCNN. The aim is to achieve both accuracy and scalability when identifying vulnerabilities in source code. This approach utilizes pre-trained codebert embedding model in graphical analysis of source code and then applies complex network analysis theory to convert a function’s source code into an image taking into account both syntactic and semantic information. Subsequently, a text convolutional neural network is employed to detect vulnerabilities from the generated images of code. In comparison to three existing CNN based methods TokenCNN, VulCNN and ASVD, our experimental results demonstrate a noteworthy improvement in accuracy from 78.6% to 95.7% and F1 measure increasing from 62.6% to 89% which is a significant increase of 21.7% and 26.3%. This underscores the effectiveness of our approach in detecting vulnerabilities in large-scale source code. Hence, developers can employ these findings to promptly apply effective patches on vulnerable functions.
Vulnerability analysis is crucial for software security. Inspired by the success of pre-trained models on software engineering tasks, this work focuses on using pre-training techniques to enhance the understanding of vulnerable code and boost vulnerability analysis. The code understanding ability of a pre-trained model is highly related to its pre-training objectives. The semantic structure, e.g., control and data dependencies, of code is important for vulnerability analysis. However, existing pre-training objectives either ignore such structure or focus on learning to use it. The feasibility and benefits of learning the knowledge of analyzing semantic structure have not been investigated. To this end, this work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP), which aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code. During pre-training, CDP and DDP can guide the model to learn the knowledge required for analyzing fine-grained dependencies in code. After pre-training, the pre-trained model can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions. To demonstrate the benefits of our pre-training objectives, we pre-train a Transformer model named PDBERT with CDP and DDP, fine-tune it on three vulnerability analysis tasks, i.e., vulnerability detection, vulnerability classification, and vulnerability assessment, and also evaluate it on program dependence analysis. Experimental results show that PDBERT benefits from CDP and DDP, leading to state-of-the-art performance on the three downstream tasks. Also, PDBERT achieves F1-scores of over 99% and 94% for predicting control and data dependencies, respectively, in partial and complete functions.
As software systems grow in complexity, source code vulnerability detection becomes crucial for software security. Existing methods, whether sequence-based or graph-based, face limitations in accurately detecting vulnerabilities. Sequence-based models often struggle with capturing code structure, while graph-based models have difficulty handling long-distance contextual relationships. To overcome these challenges, we propose a collaborative training framework that unifies a graph-based deep learning module and a semantic-rich large model module. The deep learning module, based on graph neural networks (GNNs), captures code structural information, and the large model module, leveraging pre-trained large language models (LLMs), understands code semantics. Through an iterative collaborative training mechanism, the two modules exchange information and learn from each other.Experimental results on three public datasets (Big-Vul, Reveal, and Devign) demonstrate the superiority of our approach. Compared with baseline models, our collaborative training model (CTVD) achieves significant improvements in accuracy, recall, precision, and F1-score. For example, on the Big-Vul dataset, our model’s accuracy reaches 86.5%, outperforming the deep learning module alone by 8.3% and the large model module alone by 6.4%. Compared with the latest co-training method-Vul-LMGNN, CTVD outperforms Vul-LMGNN in the DiverseVul dataset. We applied CTVD in real projects and found seven undisclosed vulnerabilities, all of which were reported and included in the CNNVD. In conclusion, our proposed collaborative training framework effectively combines the strengths of deep learning and large model modules, providing a more accurate and reliable solution for source code vulnerability detection.
The emergence of pre-trained model-based vulnerability detection methods has significantly advanced the field of automated vulnerability detection. However, these methods still face several challenges, such as difficulty in learning effective feature representations of statements for fine-grained predictions and struggling to process overly long code sequences. To address these issues, this study introduces StagedVulBERT, a novel vulnerability detection framework that leverages a pre-trained code language model and employs a coarse-to-fine strategy. The key innovation and contribution of our research lies in the development of the CodeBERT-HLS component within our framework, specialized in hierarchical, layered, and semantic encoding. This component is designed to capture semantics at both the token and statement levels simultaneously, which is crucial for achieving more accurate multi-granular vulnerability detection. Additionally, CodeBERT-HLS efficiently processes longer code token sequences, making it more suited to real-world vulnerability detection. Comprehensive experiments demonstrate that our method enhances the performance of vulnerability detection at both coarse- and fine-grained levels. Specifically, in coarse-grained vulnerability detection, StagedVulBERT achieves an F1 score of 92.26%, marking a 6.58% improvement over the best-performing methods. At the fine-grained level, our method achieves a Top-5% accuracy of 65.69%, which outperforms the state-of-the-art methods by up to 75.17%.
The security of smart contracts has garnered considerable attention given the potential for substantial financial losses and erosion of trust in blockchain platforms. Numerous methods have been proposed to detect vulnerabilities in smart contracts. Notably, as the number of smart contracts continues to proliferate, automated techniques based on deep learning (DL) are making remarkable progress. However, a significant challenge persists in acquiring an efficient embedding representation that is compatible with DL models with input length restrictions. In this paper, we propose a novel detection method named GraBit for identifying reentrancy vulnerability-one of the most critical vulnerabilities in smart contracts. GraBit leverages the pre-trained model GraphCodeBERT to embed both the source code and concise key data flow graphs extracted from the code. Additionally, we customize a sequential model based on Bi-directional Long Short-Term Memory and attention mechanism to effectively capture contextual semantic information. To evaluate the performance of GraBit, we conduct extensive experiments on a public large-scale dataset. Our experimental results reveal that GraBit achieves a remarkable F1-score of 94.44% in detecting reentrancy vulnerability, outperforming state-of-the-art methods.
Recent advances in automated vulnerability detection have achieved potential results in helping developers determine vulnerable components. However, after detecting vulnerabilities, investigating to fix vulnerable code is a non-trivial task. In fact, the types of vulnerability, such as buffer overflow or memory corruption, could help developers quickly understand the nature of the weaknesses and localize vulnerabilities for security analysis. In this work, we investigate the problem of vulnerability type identification (VTI). The problem is modeled as the multi-label classification task, which could be effectively addressed by"pre-training, then fine-tuning"framework with deep pre-trained embedding models. We evaluate the performance of the well-known and advanced pre-trained models for VTI on a large set of vulnerabilities. Surprisingly, their performance is not much better than that of the classical baseline approach with an old-fashioned bag-of-word, TF-IDF. Meanwhile, these deep neural network approaches cost much more resources and require GPU. We also introduce a lightweight independent component to refine the predictions of the baseline approach. Our idea is that the types of vulnerabilities could strongly correlate to certain code tokens (distinguishing tokens) in several crucial parts of programs. The distinguishing tokens for each vulnerability type are statistically identified based on their prevalence in the type versus the others. Our results show that the baseline approach enhanced by our component can outperform the state-of-the-art deep pre-trained approaches while retaining very high efficiency. Furthermore, the proposed component could also improve the neural network approaches by up to 92.8% in macro-average F1.
Software vulnerabilities will make the system vulnerable to attack, affect the reliability of the software and cause information leakage, which will have a huge impact on enterprises or individuals. Vulnerabilities are inevitable in software development engineering. Therefore, relying on some methods or tools for continuous vulnerability analysis of code is the solution to minimize software vulnerabilities. We propose a neural network model, JSVulExplorer, for static vulnerability analysis of the dynamic programming language JavaScript. The JSVulExplorer focuses on feature enhancement of data. We use pre-training to learn the semantic similarity between code slices, utilize abstract syntax trees to generate path information, and design positional encoding to use the path information. Based on transfer learning, we combine the pre-trained model with path information to improve vulnerability detection performance. Experiments show that JSVulExplorer has significantly improved precision and recall compared to previous models. It is verified that the dynamic event-based programming language can also use the static analysis method for vulnerability detection.
Open-source software vulnerability patch detection is a critical component for maintaining software security and ensuring software supply chain integrity. Traditional manual detection methods face significant scalability challenges when processing large volumes of commit histories, while being prone to human errors and omissions. Existing automated approaches, including heuristic-based methods and pre-trained model solutions, suffer from limited accuracy, poor generalization capabilities, and inherent methodological constraints that hinder their practical deployment. To address these fundamental challenges, this paper conducts a comprehensive empirical study of existing vulnerability patch detection methods, revealing four key insights that guide the design of effective solutions: the critical impact of search space reduction, the superiority of pre-trained semantic understanding over architectural complexity, the temporal limitations of web crawling approaches, and the advantages of knowledge-driven methods. Based on these insights, we propose a novel two-stage framework that combines version-driven candidate filtering with large language model-based multi-round dialogue voting to achieve accurate and efficient vulnerability patch identification. Extensive experiments on a dataset containing 750 real vulnerabilities demonstrate that our method outperforms current approaches.
Code vulnerability detection is crucial to ensure software security. Recent advancements, particularly with the emergence of Code Pre-Trained Models (CodePTMs) and Large Language Models (LLMs), have led to significant progress in this area. However, these models are easily susceptible to adversarial attacks, where even slight input modifications can lead the models to generate opposite results. Existing adversarial approaches, such as identifier replacement, code transformation, and dead code insertion, demonstrate promising performance but still face several limitations. First, the perturbations applied to the target code are relatively constrained (e.g., identifier replacement can only be applied to a small subset of tokens within the entire codebase). Second, the design of perturbed tokens lacks specificity in forcing the model to make incorrect predictions (e.g., they are generated by random selection or context-based prediction). Such limitations lead to the inefficiency and ineffectiveness of existing attacks. To address these issues, we propose SLODA (Statement-level OOD Features driven Adversarial Attack), which introduces two types of out-of-distribution (OOD) features: universal features via code deoptimization and label-specific features extracted from existing mispredicted and adversarial examples. These statement-level OOD features not only expand the perturbation scope, but can also significantly reduce the search space due to their inherently adversarial nature. Moreover, since the OOD features are extracted from existing code and the attack considers the context of the target code, they are more difficult to detect. Our extensive experiments across 15 models demonstrate that SLODA surpasses existing five state-of-the-art approaches in terms of the effectiveness, efficiency, and detection resistance. Furthermore, the adversarial examples generated by SLODA also exhibit promising performance to enhance model robustness.
Software vulnerabilities are weaknesses in software systems that can lead to significant cybersecurity risks. Recently, several deep learning (DL)-based approaches have been proposed to detect vulnerabilities at the function level. These approaches typically utilize one or a few different modalities (e.g., text representation and graph-based representation) of the function, and have shown promising performance. However, existing studies have not fully leveraged diverse modalities, particularly those that use images to represent functions for vulnerability detection. These approaches often fail to make sufficient use of the important graph structure underlying the images. In this article, we propose MVulD+, a multi-modal-based function-level vulnerability detection approach, which fuses multi-modal features of the function (i.e., text representation, graph representation, and image representation) to detect vulnerabilities. Specifically, MVulD+ leverages a pre-trained model (i.e., UniXcoder) to capture the semantic information of the textual source code, uses a graph neural network to extract graph representations, and employs computer vision techniques to obtain image representations while preserving the graph structure of the function. To investigate the effectiveness of MVulD+, we conduct a large-scale experiment by comparing our approach with nine state-of-the-art baselines. Experimental results demonstrate that MVulD+ improves the DL-based baselines by 24.3–125.7%, 5.2–31.4%, 40.6–192.2%, and 22.3–186.9% in terms of F1-score, Accuracy, Precision, and PR-AUC, respectively.
Large language models (LLMs) like ChatGPT (i.e., gpt-3.5-turbo and gpt-4) exhibited remarkable advancement in a range of software engineering tasks associated with source code such as code review and code generation. In this paper, we undertake a comprehensive study by instructing ChatGPT for four prevalent vulnerability tasks: function and line-level vulnerability prediction, vulnerability classification, severity estimation, and vulnerability repair. We compare ChatGPT with state-of-the-art language models designed for software vulnerability purposes. Through an empirical assessment employing extensive real-world datasets featuring over 190,000 C/C++ functions, we found that ChatGPT achieves limited performance, trailing behind other language models in vulnerability contexts by a significant margin. The experimental outcomes highlight the challenging nature of vulnerability prediction tasks, requiring domain-specific expertise. Despite ChatGPT's substantial model scale, exceeding that of source code-pre-trained language models (e.g., CodeBERT) by a factor of 14,000, the process of fine-tuning remains imperative for ChatGPT to generalize for vulnerability prediction tasks. We publish the studied dataset, experimental prompts for ChatGPT, and experimental results at https://github.com/awsm-research/ChatGPT4Vul.
Software code defects are extremely harmful to society. Therefore, automated code vulnerability detection has become an increasingly critical technology in the field of software engineering. In recent studies, pre-trained models are widely used in software engineering tasks. We aim to enhance the understanding of vulnerable code through pre-training techniques. Most of the existing pre-trained models focus on the analysis of code text and perform vulnerability detection by fine-tuning. However, they do not highlight the specialized learning of the vulnerability code, and are not able to detect some vulnerabilities caused by subtle differences. To address these challenges, we propose VuL-MCBERT, a new pre-trained model. We augment the original code sample data with a designed heuristic method, and train the model in a dynamically updated way, so that it can learn vulnerable code patterns in the pre-training phase. Experimental results show that VuL-MCBERT achieves the best performance on the CodeXGLUE benchmark dataset and outperforms the best baseline by 2.78% on Acc, demonstrating the effectiveness of our approach.
With the escalating threat of software vulnerabilities to the security of modern software systems, an increasing number of deep learning (DL) model-based vulnerability detectors have been developed for vulnerability detection. However, their practical reliability, consistency in usage, and adaptability across diverse software contexts remain unclear. This uncertainty may lead to unreliable detection results in practical applications, increased false positives and false negatives, and limited adaptability to newly emerged vulnerabilities. Conducting a large-scale and in-depth analysis of DL-based vulnerability detectors can help uncover critical factors influencing detection performance, improve the design and training of these models, and enhance their practical deployment in real-world scenarios. In this paper, we present VulTegra, a novel evaluation framework that, for the first time, conducts a multidimensional assessment comparing scratch-trained models and pre-trained-based models for vulnerability detection, while verifying key factors influencing detection performance. Our framework reveals that state-of-the-art (SOTA) detectors still suffer from low consistency, limited practical detection capabilities, and limited adaptability. Moreover, comparative results indicate that the increasingly favored pre-trained-based models are not universally superior to scratch-trained models; instead, they exhibit distinct strengths and application scenarios. Most importantly, our study highlights the limitations of relying solely on CWE-based classification and reveals a set of critical factors that significantly influence detection performance. Experimental validation shows that these factors have a substantial impact: modifying only any single factor led to recall improvements across all seven evaluated SOTA detectors, with six detectors also achieving higher F1 scores. Our findings provide deep insights into model behavior, highlighting the need to consider both vulnerability types and inherent code features to ensure practical applicability in real-world software environments.
With the rapid development of information technology, software security has become increasingly critical. Source code vulnerability detection is essential for maintaining system stability and data security. While deep learning has shown promise through advances in code representation and pre-trained models, existing RNNs struggle with long-range dependencies, and Transformers, though powerful, are resource-intensive. To address these issues, this paper proposes a vulnerability detection method that integrates Mamba with self-attention. The approach uses BPE tokenizer, embeds code semantics, and extracts features via a Mamba-enhanced encoder with multi-head attention, ultimately predicting vulnerabilities accurately. Experiments show the model achieves strong performance across accuracy, F1 score, precision, and recall, combining high effectiveness with efficient inference.
The prevalence of software vulnerabilities necessitates accurate and scalable detection techniques. While Pre-trained Language Models (PLMs) have shown strong potential in vulnerability analysis, most existing methods provide no explicit guidance on which parts of the input code are more likely to be vulnerable. As a result, the model must infer token-level relevance without any indication of which parts are important, making it harder to learn the characteristics of vulnerable code during training. We address this by proposing LOSVER (Line-level mOdifiability Signal-guided VulnERability analyzer), a novel two-stage framework that enhances PLM-based vulnerability analysis using line-level modifiability signals. LOSVER first localizes modifiable lines, which are code segments likely to be changed in the future and often associated with vulnerabilities, and then assigns them greater importance, allowing the PLM to focus on potentially vulnerable regions during both training and inference. We evaluated LOSVER across three benchmark datasets (Devign, Big-Vul, and PrimeVul) for vulnerability detection, classification, and patch-pair analysis. Experimental results demonstrate that LOSVER consistently improves performance, increasing detection accuracy by 4 percentage points and the weighted F1-score for classification by over 2 points when applied on top of UniXcoder. These results demonstrate that integrating line-level modifiability signals significantly enhances the effectiveness of PLM-based software vulnerability analysis across both detection and classification tasks.
One of the most time-consuming and errorprone phases of software maintenance is bug localization. The recent developments in Large Language Models (LLMs) have created new possibilities of automating the process of debugging by using intelligent code understanding and reasoning. This paper presents a bug localization model based on LLM, which uses CodeBERT, a transformer-based model trained on source code, to identify faults accurately. The framework combines Information Retrieval (IR)-based approaches to prioritize suspicious code segments and uses the semantic representations trained by CodeBERT to improve contextual knowledge of source code. The retrieval-based ranking, together with deep semantic embeddings, allows the approach to greatly narrow down the search space and enhance the precision of detecting faulty lines of code. Experimental analysis on benchmark bug data shows better performance than traditional IR-only methods. The suggested framework not only speeds up the process of debugging but also demonstrates the promise of LLMs in promoting the practice of automated software engineering.
Fixing bugs in large programs is a challenging task that demands substantial time and effort. Once a bug is found, it is reported to the project maintainers, who work with the reporter to fix it and eventually close the issue. However, across the program, there are often similar code segments, which may also contain the bug, but were missed during discovery. Finding and fixing each recurring bug instance individually is labor intensive. Even more concerning, bug reports can inadvertently widen the attack surface as they provide attackers with an exploitable pattern that may be unresolved in other parts of the program. In this paper, we explore these Recurring Pattern Bugs (RPBs) that appear repeatedly across various code segments of a program or even in different programs, stemming from a same root cause, but are unresolved. Our investigation reveals that RPBs are widespread and can significantly compromise the security of software programs. This paper introduces BugStone, a program analysis system empowered by LLVM and a Large Language Model (LLM). The key observation is that many RPBs have one patched instance, which can be leveraged to identify a consistent error pattern, such as a specific API misuse. By examining the entire program for this pattern, it is possible to identify similar sections of code that may be vulnerable. Starting with 135 unique RPBs, BugStone identified more than 22K new potential issues in the Linux kernel. Manual analysis of 400 of these findings confirmed that 246 were valid. We also created a dataset from over 1.9K security bugs reported by 23 recent top-tier conference works. We manually annotate the dataset, identify 80 recurring patterns and 850 corresponding fixes. Even with a cost-efficient model choice, BugStone achieved 92.2% precision and 79.1% pairwise accuracy on the dataset.
Cloud environments are increasingly managed by Infrastructure-as-Code (IaC) platforms (e.g., Terraform), which allow developers to define their desired infrastructure as a configuration program that describes cloud resources and their dependencies. This shields developers from low-level operations for creating and maintaining resources, since they are automatically performed by IaC platforms when compiling and deploying the configuration. However, while IaC platforms are rigorously tested for initial deployments, they exhibit myriad errors for runtime updates, e.g., adding/removing resources and dependencies. IaC updates are common because cloud infrastructures are long-lived but user requirements fluctuate over time. Unfortunately, our experience shows that updates often introduce subtle yet impactful bugs. The update logic in IaC frameworks is hard to test due to the vast and evolving search space, which includes diverse infrastructure setups and a wide range of provided resources with new ones frequently added. We introduce TerraFault, an automated, efficient, LLM-guided system for discovering update bugs, and report our findings with an initial prototype. TerraFault incorporates various optimizations to navigate the large search space efficiently and employs techniques to accelerate the testing process. Our prototype has successfully identified bugs even in simple IaC updates, showing early promise in systematically identifying update bugs in today's IaC frameworks to increase their reliability.
Fuzzing has been incredibly successful in uncovering bugs and vulnerabilities across diverse software systems. JSON parsers play a vital role in modern software development, and ensuring their reliability is of great importance. This research project focuses on leveraging Large Language Models (LLMs) to enhance JSON parser testing. The primary objectives are to generate test cases and mutants using LLMs for the discovery of potential bugs in open-source JSON parsers and the identification of behavioral diversities among them. We aim to uncover underlying bugs, plus discovering (and overcoming) behavioral diversities.
Automated bug detection and repair are critical in determining the reliability and cost of software development. To address these concerns, this research uses a transformer-based sequence-to-sequence model called CodeT5, together with the Defects4J dataset, which contains real Java programs with officially identified buggy and fixed versions. The approach involves retrieving buggy and fixed code pairs, pretraining CodeT5 for sequence-to-sequence learning, and then evaluating the model using the BLEU, EM, and ED measures. This model got translated sentences' BLEU 78.4%, Exact Match, rate 64.2% and accuracy 82.0%. While having an average edit distance of 12.3 operations, the generated fixes suggest how slight modifications can be made to produce semantically and syntactically correct fixes with little human intuition or input. This work shows that program repair can benefit from applying state-of-the-art deep learning algorithms and realistic benchmark datasets. The outcomes point to possible uses for enhancing directions of software creation and shortening the hours spent on debugging. Future work will be dedicated to using similar techniques in different languages and handling different types of bugs.
Efficient bug triage is a critical aspect of large-scale software development, yet it remains a labor-intensive and error-prone task. This paper presents a novel approach to automated bug report classification by leveraging BERT, a state-of-the-art transformer-based LLM, to categorize bug reports into well-defined software defect types. We introduce a structured and scalable classification taxonomy designed to reflect the complexities of real-world bug reports. The proposed method incorporates fine-tuning of BERT on domain-specific datasets and evaluates performance across multiple bug categories using accuracy, precision, recall, and F1-score metrics. Empirical results demonstrate that our approach outperforms traditional machine learning methods, achieving an overall accuracy of 72% and delivering particularly strong performance in critical categories such as security and performance bugs. Using both the bug title and description as input produced the best results, underscoring the importance of contextual detail in effective bug triage. This work contributes to the field of software defect classification by providing a replicable and adaptable methodology for automated bug triage, with practical implications for enhancing software maintenance and quality in large-scale development projects.
No abstract available
Software defect prediction is crucial for maintaining software quality by detecting defective modules at an early stage of the development process. Conventional models suffer from problems like high-dimensionality, noisy features, and poor hyperparameters, resulting in lower accuracy. Even with the growth of machine learning and deep learning, accurate and efficient defect prediction is still a problem. This research suggests a more advanced deep learning model that combines (SSO) with (DNN) to solve feature selection and hyperparameter optimization problems. The suggested model is trained on the Kaggle Software Defect Prediction Dataset, which contains different software metrics like lines of code, cyclomatic complexity, and past bug reports. SSO is utilized for hyperparameter tuning and feature selection, tuning parameters such as number of layers, learning rate, and dropout percentages. The DNN is thereafter utilized for classifying defects. The model produced a classification rate of 94.4% compared to customary models. Quantitative measures in the form of Precision (98.6%), Recall (98%), and F1-score (98.4%) also assert its efficiency. Deployment testing at the cloud stage revealed latency as between 180ms and 250ms, throughput as from 38 Mbps to 45 Mbps, and availability as in excess of 99.7% for seven days. Scalability tests showed a linear increase in response time from 1.2s to 3.0s as the user count increased from 100 to 700. Optimization of resources was also witnessed between training iterations with CPU utilization decreasing from 65% to 50% and memory usage decreasing from 70% to 59%. This proves that integrating SSO and DNN leads to a correct, efficient, and scalable defect prediction model, well suited for real-time cloud-based applications
This paper explores the use of transformer-based models for bug detection in source code, aiming to better understand the capacity of these models to learn complex patterns and relationships within the code. Traditional static analysis tools are highly limited in their ability to detect semantic errors, resulting in numerous defects passing through to the code execution stage. This research represents a step towards enhancing static code analysis using neural networks. The experiments were designed as binary classification tasks to detect buggy code snippets, each targeting a specific defect type such as NameError, TypeError, IndexError, AttributeError, ValueError, EOFError, SyntaxError, and ModuleNotFoundError. Utilizing the «RunBugRun» dataset, which relies on code execution results, the models – BERT, CodeBERT, GPT-2, and CodeT5 – were fine-tuned and compared under identical conditions and hyperparameters. Performance was evaluated using F1-Score, Precision, and Recall. The results indicated that transformer-based models, especially CodeT5 and CodeBERT, were effective in identifying various defects, demonstrating their ability to learn complex code patterns. However, performance varied by defect type, with some defects like IndexError and TypeError being more challenging to detect. The outcomes underscore the importance of high-quality, diverse training data and highlight the potential of transformer-based models to achieve more accurate early defect detection. Future research should further explore advanced transformer architectures for detecting complicated defects, and investigate the integration of additional contextual information to the detection process. This study highlights the potential of modern machine learning architectures to advance software engineering practices, leading to more efficient and reliable software development.
Deep neural networks (DNNs) are increasingly used in critical applications like autonomous vehicles and medical diagnosis, where accuracy and reliability are crucial. However, debugging DNNs is challenging and expensive, often leading to unpredictable behavior and performance issues. Identifying and diagnosing bugs in DNNs is difficult due to complex and obscure failure symptoms, which are data-driven and compute-intensive. To address this, we propose TransBug a framework that combines transformer models for feature extraction with deep learning models for classification to detect and diagnose bugs in DNNs. We employ a pre-trained transformer model, which has been trained in programming languages, to extract semantic features from both faulty and correct DNN models. We then use these extracted features in a separate deep-learning model to determine whether the code contains bugs. If a bug is detected, the model further classifies the type of bug. By leveraging the powerful feature extraction capabilities of transformers, we capture relevant characteristics from the code, which are then used by a deep learning model to identify and classify various types of bugs. This combination of transformer-based feature extraction and deep learning classification allows our method to accurately link bug symptoms to their causes, enabling developers to take targeted corrective actions. Empirical results show that the TransBug shows an accuracy of 81% for binary classification and 91% for classifying bug types.
Recent work applies Large Language Models (LLMs) to source-code vulnerability detection, but most evaluations still rely on random train-test splits that ignore time and overestimate real-world performance. In practice, detectors are deployed on evolving code bases and must recognise future vulnerabilities under temporal distribution shift. This paper investigates continual fine-tuning of a decoder-style language model (microsoft/phi-2 with LoRA) on a CVE-linked dataset spanning 2018-2024, organised into bi-monthly windows. We evaluate eight continual learning strategies, including window-only and cumulative training, replay-based baselines and regularisation-based variants. We propose Hybrid Class-Aware Selective Replay (Hybrid-CASR), a confidence-aware replay method for binary vulnerability classification that prioritises uncertain samples while maintaining a balanced ratio of VULNERABLE and FIXED functions in the replay buffer. On bi-monthly forward evaluation Hybrid-CASR achieves a Macro-F1 of 0.667, improving on the window-only baseline (0.651) by 0.016 with statistically significant gains ($p = 0.026$) and stronger backward retention (IBR@1 of 0.741). Hybrid-CASR also reduces training time per window by about 17 percent compared to the baseline, whereas cumulative training delivers only a minor F1 increase (0.661) at a 15.9-fold computational cost. Overall, the results show that selective replay with class balancing offers a practical accuracy-efficiency trade-off for LLM-based temporal vulnerability detection under continuous temporal drift.
Smart contracts have emerged as key components within decentralized environments, enabling the automation of transactions through self-executing programs. While these innovations offer significant advantages, they also present potential drawbacks if the smart contract code is not carefully designed and implemented. This paper investigates the capability of large language models (LLMs) to detect OWASP-inspired vulnerabilities in smart contracts beyond the Ethereum Virtual Machine (EVM) ecosystem, focusing specifically on Solana and Algorand. Given the lack of labeled datasets for nonEVM platforms, we design a synthetic dataset of annotated smart contract snippets in Rust (for Solana) and PyTeal (for Algorand), structured around a vulnerability taxonomy derived from OWASP. We evaluate LLMs under three configurations: prompt engineering, fine-tuning, and a hybrid of both, comparing their performance on different vulnerability categories. Experimental results show that prompt engineering achieves general robustness, while fine-tuning improves precision and recall on less semantically rich languages such as TEAL. Additionally, we analyze how the architectural differences of Solana and Algorand influence the manifestation and detectability of vulnerabilities, offering platform-specific mappings that highlight limitations in existing security tooling. Our findings suggest that LLM-based approaches are viable for static vulnerability detection in smart contracts, provided domain-specific data and categorization are integrated into training pipelines.
The current landscape of system-on-chips (SoCs) security verification faces challenges due to manual, labor-intensive, and inflexible methodologies. These issues limit the scalability and effectiveness of security protocols, making bug detection at the Register-Transfer Level (RTL) difficult. This paper proposes a new framework named BugWhisperer that utilizes a specialized, fine-tuned Large Language Model (LLM) to address these challenges. By enhancing the LLM’s hardware security knowledge and leveraging its capabilities for text inference and knowledge transfer, this approach automates and improves the adaptability and reusability of the verification process. We introduce an open-source, fine-tuned LLM specifically designed for detecting security vulnerabilities in SoC designs. Our findings demonstrate that this tailored LLM effectively enhances the efficiency and flexibility of the security verification process. Additionally, we introduce a comprehensive hardware vulnerability database that supports this work and will further assist the research community in enhancing the security verification process.
Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models
Software security vulnerabilities allow attackers to perform malicious activities to disrupt software operations. Recent Transformer-based language models have significantly advanced vulnerability detection, surpassing the capabilities of static analysis based deep learning models. However, language models trained solely on code tokens do not capture either the explanation of vulnerability type or the data flow structure information of code, both of which are crucial for vulnerability detection. We propose a novel technique that integrates a multitask sequence-to-sequence LLM with pro-gram control flow graphs encoded as a graph neural network to achieve sequence-to-classification vulnerability detection. We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction. Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul), with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset. By training LLMs and GNNs simultaneously using a combination of code and explanatory metrics of a vulnerable program, MSIVD represents a promising direction for advancing LLM-based vulnerability detection that generalizes to unseen data. Based on our findings, we further discuss the necessity for new labelled security vulnerability datasets, as recent LLMs have seen or memorized prior datasets' held-out evaluation data.
Large Language Models (LLMs) have shown significant potential for vulnerability localization in software security. However, current LLM-based approaches face a critical dilemma: direct application of general-purpose LLMs lacks crucial domainspecific expertise, while fine-tuning suffers from limited robustness when faced with unfamiliar data. These problems result in subpar performance in vulnerability localization and weak generalization capabilities. To address these limitations, we introduce ENVUL, a novel domain adaptation framework for vulnerability localization. ENVUL improves vulnerability localization by synergizing enhanced task-specific tuning with prompt engineering of general-purpose LLMs. ENVUL incorporates three key innovations for addressing two problems: (1) how to optimize fine-tuning for localization task, and (2) when to wisely choose tuning and prompting. To solve the first problem, we introduce: (a). a context Consolidator that captures rich statement-level code semantic, improving the model's understanding of code context; (b). a semantic Indicator employing attention rectification to highlight patterns indicative of vulnerabilities, focusing the model on critical security signals. To solve the second problem, we introduce a dynamic routing mechanism based on joint-representation similarity analysis that strategically delegates tasks between the fine-tuned model and the general LLM. It ensures ENVUL's robust performance across diverse real-world vulnerability types. Real-world evaluations demonstrate ENVUL's robust expertise in outperforming state-of-the-art vulnerability localization baselines, achieving absolute improvements of $\mathbf{2 2. 7 \% - 3 0. 3 \%}$ in top-1 accuracy. Notably, ENVul exhibits exceptional generalization, achieving $\mathbf{4 3. 6 \% - 5 0 \%}$ higher accuracy on unfamiliar vulnerability types.
Large language models (LLMs) have been proposed as powerful tools for detecting software vulnerabilities, where task-specific fine-tuning is typically employed to provide vulnerability-specific knowledge to the LLMs. However, existing fine-tuning techniques often treat source code as plain text, losing the graph-based structural information inherent in code. Graph-enhanced soft prompt tuning addresses this by translating the structural information into contextual cues that the LLM can understand. However, current methods are primarily designed for general graph-related tasks and focus more on adjacency information, they fall short in preserving the rich semantic information (e.g., control/data flow) within code graphs. They also fail to ensure computational efficiency while capturing graph-text interactions in their cross-modal alignment module. This paper presents CGP-Tuning, a new code graph-enhanced, structure-aware soft prompt tuning method for vulnerability detection. CGP-Tuning introduces type-aware embeddings to capture the rich semantic information within code graphs, along with an efficient cross-modal alignment module that achieves linear computational costs while incorporating graph-text interactions. It is evaluated on the latest DiverseVul dataset and three advanced open-source code LLMs, CodeLlama, CodeGemma, and Qwen2.5-Coder. Experimental results show that CGP-Tuning delivers model-agnostic improvements and maintains practical inference speed, surpassing the best graph-enhanced soft prompt tuning baseline by an average of four percentage points and outperforming non-tuned zero-shot prompting by 15 percentage points.
To address the issues of insufficient semantic understanding and low vulnerability recognition efficiency in traditional binary program fuzzy testing through reverse analysis, this paper proposes a Large Language Model (LLM) optimization scheme to enhance reverse analysis capabilities, and constructs an "encoding enhancement decoding" framework suitable for binary analysis scenarios. By designing a binary instruction structured representation mechanism and reverse analysis oriented pre training tasks, multi granularity fusion of instruction level, basic block level, and function level features is achieved, and model fine-tuning is completed based on vulnerability annotation datasets. Integrate the optimized LLM model into a fuzzy testing system and conduct experiments on the standard test set (LAVA-M) and three real binary programs (Libtif, FFmpeg, OpenSSL). The results show that compared with the traditional fuzzy testing tool AFL, the accuracy of system vulnerability identification has been improved by 38.2%, the testing coverage has been increased by 25.7%, and the average time for vulnerability mining has been reduced by 41.3%; Compared with the unenhanced CodeLlama model, the accuracy and efficiency were improved by 19.5% and 28.1%, respectively, verifying the effectiveness of the enhanced reverse analysis LLM model in improving the performance of fuzzy testing vulnerability identification.
As software complexity rises, research on vulnerability detection becomes increasingly important. Deep learning-based vulnerability detection, an emerging approach, can segment code and identify hidden vulnerability patterns. However, challenges remain: (1) accurately correlating code slices with scripts to minimize false positives; (2) enhancing the precision of root cause localization for vulnerable scripts. To alleviate these, this paper introduces a vulnerability detection and root cause localization approach leveraging large language models (LLMs). The approach preprocesses C/C++ source code, extracts graph structures, and combines them with the script to form prompts. A novel Hierarchical Regulation for Parameter-Efficient Language Model Tuning (HR-PELT) approach fine-tunes the LLM for vulnerability detection by maintaining parameters of shallow layers substantially preserved while enhancing the adaptability of deep layers. For root cause localization, we similarly create prompts and fine-tune another LLM. Experimental results on three datasets demonstrated improvements: in vulnerability detection, our approach boosted average Accuracy (ACC) by 4.82% and Macro-F1 (M-F1) by 5.29% over the state-of-the-art (SOTA); in root cause localization, it enhanced ACC10% by 4.80% and ACC20% by 3.93%.
The rapid expansion of blockchain technology, particularly Ethereum, has driven widespread adoption of smart contracts. However, the security of these contracts remains a critical concern due to the increasing frequency and complexity of vulnerabilities. This paper presents a comprehensive approach to detecting vulnerabilities in Ethereum smart contracts using pre-trained Large Language Models (LLMs). We apply transformer-based LLMs, leveraging their ability to understand and analyze Solidity code to identify potential security flaws. Our methodology involves fine-tuning eight distinct pre-trained LLM models on curated datasets varying in types and distributions of vulnerabilities, including multi-class vulnerabilities. The datasets-SB Curate, Benmark Solidity Smart Contract, and ScrawID-were selected to ensure a thorough evaluation of model performance across different vulnerability types. We employed over-sampling techniques to address class imbalances, resulting in more reliable training outcomes. We extensively evaluate these models using precision, recall, accuracy, F1 score, and Receiver Operating Characteristics (ROC) curve metrics. Our results demonstrate that the transformer encoder architecture, with its multi-head attention and feed-forward mechanisms, effectively captures the nuances of smart contract vulnerabilities. The models show promising potential in enhancing the security and reliability of Ethereum smart contracts, offering a robust solution to challenges posed by software vulnerabilities in the blockchain ecosystem.
The rapid expansion of the Internet of Things (IoT) has made software security and reliability a critical concern. With multi-language programs running on edge computing, embedded systems, and sensors, each connected device represents a potential attack vector, threatening data integrity and privacy. Symbolic execution is a key technique for automated vulnerability detection. However, unknown function interfaces, such as sensor interactions, limit traditional concrete or concolic execution due to uncertain function returns and missing symbolic expressions. Compared with system simulation, the traditional method is to construct an interface abstraction layer for the symbolic execution engine to reduce the cost of simulation. Nevertheless, the disadvantage of this solution is that the manual modeling of these functions is very inefficient and requires professional developers to spend hundreds of hours. In order to improve efficiency, we propose an LLM-based automated approach for modeling unknown functions. By fine-tuning a 20-billion-parameter language model, it automatically generates function models based on annotations and function names. Our method improves symbolic execution efficiency, reducing reliance on manual modeling, which is a limitation of existing frameworks like KLEE. Experimental results primarily focus on comparing the usability, accuracy, and efficiency of LLM-generated models with human-written ones. Our approach was integrated into one verification platform project and applied to the verification of smart contracts with distributed edge computing characteristics. The application of this method directly reduces the manual modeling effort from a month to just a few minutes. This provides a foundational validation of our method’s feasibility, particularly in reducing modeling time while maintaining quality. This work is the first to integrate LLMs into formal verification, offering a scalable and automated verification solution for sensor-driven software, blockchain smart contracts, and WebAssembly systems, expanding the scope of secure IoT development.
Industrial control systems (ICS) are vital to modern infrastructure but increasingly vulnerable to cybersecurity threats, particularly through weaknesses in their communication protocols. This paper presents MALF (Multi-Agent LLM Fuzzing Framework), an advanced fuzzing solution that integrates large language models (LLMs) with multi-agent coordination to identify vulnerabilities in industrial control protocols (ICPs). By leveraging Retrieval-Augmented Generation (RAG) for domain-specific knowledge and QLoRA fine-tuning for protocol-aware input generation, MALF enhances fuzz testing precision and adaptability. The multi-agent framework optimizes seed generation, mutation strategies, and feedback-driven refinement, leading to improved vulnerability discovery. Experiments on protocols like Modbus/TCP, S7Comm, and Ethernet/IP demonstrate that MALF surpasses traditional methods, achieving a test case pass rate (TCPR) of 88-92% and generating more exception triggers (ETN). MALF also maintains over 90% seed coverage and Shannon entropy values between 4.2 and 4.6 bits, ensuring diverse, protocol-compliant mutations. Deployed in a real-world Industrial Attack-Defense Range for power plants, MALF identified critical vulnerabilities, including three zero-day flaws, one confirmed and registered by CNVD. These results validate MALF's effectiveness in real-world fuzzing applications. This research highlights the transformative potential of multi-agent LLMs in ICS cybersecurity, offering a scalable, automated framework that sets a new standard for vulnerability discovery and strengthens critical infrastructure security against emerging threats.
Organizations nowadays rely on intensive software systems to support their business operations but vulnerabilities within these systems can cause potential risks for major disruption. AI-based techniques are now widely considered for vulnera-bility identification; however effectiveness heavily relies on the dataset’s size and quality. These techniques often lack contextual information while processing data and pose challenges in resource-constrained environments. AI models are generally black box in nature which creates additional challenges to understand decision making processes. This work proposes a novel hybrid framework using LLM model based on CodeBERT with integration of fine-tuning and Model-Agnostic Meta-Learning for performing effective vulnerability detection. It includes few-shot learning technique for new vulnerability detection tasks while maintaining high performance on known cases. The approach adopts Explainable AI techniques from four dimensions including attention mechanisms, layer-wise analysis, feature contribution, and model confidence scores to explain model decision making. An experiment demonstrates the framework’s effectiveness, show-ing steady decrease in meta-loss from 0.45 to 0.14, accompanied by increase in support accuracy from 85.2% to 92.5%. These findings establish the proposed framework as a robust and interpretable solution for vulnerability detection and management.
Large Language Models (LLMs) are being extensively used for cybersecurity purposes. One of them is the detection of vulnerable codes. For the sake of efficiency and effectiveness, compression and fine-tuning techniques are being developed, respectively. However, they involve spending substantial computational efforts. In this vein, we analyse how Linear Probes (LPs) can be used to provide an estimation on the performance of a compressed LLM at an early phase -- before fine-tuning. We also show their suitability to set the cut-off point when applying layer pruning compression. Our approach, dubbed $LPASS$, is applied in BERT and Gemma for the detection of 12 of MITRE's Top 25 most dangerous vulnerabilities on 480k C/C++ samples. LPs can be computed in 142.97 s. and provide key findings: (1) 33.3 \% and 72.2\% of layers can be removed, respectively, with no precision loss; (2) they provide an early estimate of the post-fine-tuning and post-compression model effectiveness, with 3\% and 8.68\% as the lowest and average precision errors, respectively. $LPASS$-based LLMs outperform the state of the art, reaching 86.9\% of accuracy in multi-class vulnerability detection. Interestingly, $LPASS$-based compressed versions of Gemma outperform the original ones by 1.6\% of F1-score at a maximum while saving 29.4 \% and 23.8\% of training and inference time and 42.98\% of model size.
Unlike conventional machine learning (ML) or deep learning (DL) methods, Large Language Models (LLM) possess the ability to tackle complex tasks through intricate chains of reasoning, a facet often overlooked in existing work on vulnerability detection. Nevertheless, these models have demonstrated variable performance when presented with different prompts (inputs), motivating a surge of research into prompt engineering – the process of optimizing prompts to enhance their performance. This paper studies different prompt settings (zero-shot and few-shot) when using LLMs for software vulnerability detection. Our exploration involves harnessing the power of both natural language (NL) unimodal and NL-PL (programming language) bimodal models within the prompt engineering process. Experimental results indicate that LLM, when provided only with source code or zero-shot prompts, tends to classify most code snippets as vulnerable, resulting in unacceptably high recall. These findings suggest that, despite their advanced capabilities, LLMs may not inherently possess the knowledge for vulnerability detection tasks. However, few-shot learning benefits from additional domain-specific knowledge, offering a promising direction for future research in optimizing LLMs for vulnerability detection.
Large Language Models (LLMs) struggle to automate real-world vulnerability detection due to two key limitations: the heterogeneity of vulnerability patterns undermines the effectiveness of a single unified model, and manual prompt engineering for massive weakness categories is unscalable. To address these challenges, we propose \textbf{MulVul}, a retrieval-augmented multi-agent framework designed for precise and broad-coverage vulnerability detection. MulVul adopts a coarse-to-fine strategy: a \emph{Router} agent first predicts the top-$k$ coarse categories and then forwards the input to specialized \emph{Detector} agents, which identify the exact vulnerability types. Both agents are equipped with retrieval tools to actively source evidence from vulnerability knowledge bases to mitigate hallucinations. Crucially, to automate the generation of specialized prompts, we design \emph{Cross-Model Prompt Evolution}, a prompt optimization mechanism where a generator LLM iteratively refines candidate prompts while a distinct executor LLM validates their effectiveness. This decoupling mitigates the self-correction bias inherent in single-model optimization. Evaluated on 130 CWE types, MulVul achieves 34.79\% Macro-F1, outperforming the best baseline by 41.5\%. Ablation studies validate cross-model prompt evolution, which boosts performance by 51.6\% over manual prompts by effectively handling diverse vulnerability patterns.
This paper explores the integration of Large Language Models (LLMs) with static analysis tools, specifically Semgrep, to enhance vulnerability detection in Java applications. Through a series of experiments, we evaluate the performance of various LLMs in triaging security weaknesses identified by Semgrep. We also study how LLMs perform across different types of vulnerabilities and assess the impact of various prompt engineering strategies. Our results reveal that while some LLM models reduce the accuracy of baseline results with static analysis, they show a consistent improvement with each new model released. In particular, o1-mini significantly outperformed others in our experiments in terms of their accuracy and false positive reduction. Although LLMs might not be ready for prime time in vulnerability detection yet, this study highlights their growing potential to complement existing tools and paves the way for future research to further optimize LLM-based vulnerability detection systems.
This paper presents a novel evaluation framework that uses Large Language Models (LLMs) to automatically generate formal security assertions for autonomous vehicle (AV) subsystems an area that remains insufficiently explored in the context of hardware-level safety verification. While LLMs have demonstrated capabilities in tasks such as perception, route planning, and user interaction, their role in generating formal assertions for low-level AV hardware components has not been systematically studied. The proposed framework addresses this gap by guiding LLMs to generate SystemVerilog Assertions (SVAs) across four AV-related benchmarks, each corresponding to a distinct hardware vulnerability classified by Common Weakness Enumerations (CWEs). These benchmarks include traffic signal controllers, AES encryption modules, privilege-level register controllers, and ADC reset logic. The framework incorporates structured prompt engineering, syntax correction, and simulation-based validation to assess the correctness of generated assertions. Experimental results using OpenAI’s Codex show that LLMs can produce correct assertions in more than 50% of cases when complete design context is provided, while performance drops significantly with minimal input. This study introduces the first comprehensive benchmark and evaluation pipeline for LLM-generated SVAs in AV systems and offers new insights into the potential of generative models to support formal verification in intelligent transportation and smart city infrastructures.
The widespread adoption of open-source software (OSS) has accelerated software innovation but also increased security risks due to the rapid propagation of vulnerabilities and silent patch releases. In recent years, large language models (LLMs) and LLM-based agents have demonstrated remarkable capabilities in various software engineering (SE) tasks, enabling them to effectively address software security challenges such as vulnerability detection. However, systematic evaluation of the capabilities of LLMs and LLM-based agents in security patch detection remains limited. To bridge this gap, we conduct a comprehensive evaluation of the performance of LLMs and LLM-based agents for security patch detection. Specifically, we investigate three methods: Plain LLM (a single LLM with a system prompt), Data-Aug LLM (data augmentation based on the Plain LLM), and the ReAct Agent (leveraging the thought-action-observation mechanism). We also evaluate the performance of both commercial and open-source LLMs under these methods and compare these results with those of existing baselines. Furthermore, we analyze the detection performance of these methods across various vulnerability types, and examine the impact of different prompting strategies and context window sizes on the results. Our findings reveal that the Data-Aug LLM achieves the best overall performance, whereas the ReAct Agent demonstrates the lowest false positive rate (FPR). Although baseline methods exhibit strong accuracy, their false positive rates are significantly higher. In contrast, our evaluated methods achieve comparable accuracy while substantially reducing the FPR. These findings provide valuable insights into the practical applications of LLMs and LLM-based agents in security patch detection, highlighting their advantage in maintaining robust performance while minimizing false positive rates.
Artificial Intelligence (AI) and more specifically Large Language Models (LLMs) have demonstrated exceptional progress in multiple areas including software engineering, however, their capability for vulnerability detection in the wild scenario and its corresponding reasoning remains underexplored. Prompting pre-trained LLMs in an effective way offers a computationally effective and scalable solution. Our contributions are (i)varied prompt designs for vulnerability detection and its corresponding reasoning in the wild. (ii)a real-world vector data store constructed from the National Vulnerability Database, that will provide real time context to vulnerability detection framework, and (iii)a scoring measure for combined measurement of accuracy and reasoning quality. Our contribution aims to examine whether LLMs are ready for wild deployment, thus enabling the reliable use of LLMs stronger for the development of secure software's.
The exponential growth of data in relational (SQL) and non-relational (NoSQL) databases has led to an increase in injection attacks, ranking them among the top cybersecurity threats. This study evaluates the effectiveness of machine learning (ML) models and large language models (LLMs) in detecting SQL and NoSQL injection vulnerabilities in source code. We train and compare various ML models—including Logistic Regression, Naïve Bayes, Decision Tree, K-Nearest Neighbors, Random Forest, Multilayer Perceptron (MLP), and Convolutional Neural Networks (CNN)—against a fine-tuned GPT-4o-mini-202407-18 model using both prompt engineering and supervised learning techniques. Our results demonstrate that the Random Forest model achieves 99% accuracy in detecting SQL injection vulnerabilities, while the fine-tuned LLM achieves 97% accuracy in detecting NoSQL injection vulnerabilities. These findings indicate that ML models and LLMs are not mutually exclusive; rather, they excel in different aspects of injection vulnerability detection. The research suggests a hybrid approach that leverages both ML and LLMs for more comprehensive security automation. By providing an open-source NoSQL injection dataset and benchmarking results, this study contributes to the automation of vulnerability detection in the software development lifecycle. The datasets used in this study are publicly available on GitHub.
Detecting vulnerabilities is vital for software security, yet deep learning-based vulnerability detectors (DLVD) face a data shortage, which limits their effectiveness. Data augmentation can potentially alleviate the data shortage, but augmenting vulnerable code is challenging and requires a generative solution that maintains vulnerability. Previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Recently, large language models (LLMs) have been used to solve various code generation and comprehension tasks with inspiring results, especially when fused with retrieval augmented generation (RAG). Therefore, we propose VulScribeR, a novel LLM-based solution that leverages carefully curated prompt templates to augment vulnerable datasets. More specifically, we explore three strategies to augment both single and multi-statement vulnerabilities, with LLMs, namely Mutation, Injection, and Extension. Our extensive evaluation across four vulnerability datasets and DLVD models, using three LLMs, show that our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling (ROS) by 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples on average, and 53.84%, 54.10%, 69.90%, and 40.93% with 15K generated vulnerable samples. Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US$ 1.88.
No abstract available
Finite-State Machines (FSMs) are critical for modeling the operational logic of network protocols, enabling verification, analysis, and vulnerability discovery. However, existing FSM extraction techniques face limitations such as scalability, incomplete coverage, and ambiguity in natural language specifications. In this paper, we propose FlowFSM, a novel agentic framework that leverages Large Language Models (LLMs) combined with prompt chaining and chain-of-thought reasoning to extract accurate FSMs from raw RFC documents. FlowFSM systematically processes protocol specifications, identifies state transitions, and constructs structured rule-books by chaining agent outputs. Experimental evaluation across FTP and RTSP protocols demonstrates that FlowFSM achieves high extraction precision while minimizing hallucinated transitions, showing promising results. Our findings highlight the potential of agent-based LLM systems in the advancement of protocol analysis and FSM inference for cybersecurity and reverse engineering applications.
The widespread adoption of conversational LLMs for software development has raised new security concerns regarding the safety of LLM-generated content. Our motivational study outlines ChatGPT’s potential in volunteering context-specific information to the developers, promoting safe coding practices. Motivated by this finding, we conduct a study to evaluate the degree of security awareness exhibited by three prominent LLMs: Claude 3, GPT-4, and Llama 3. We prompt these LLMs with Stack Overflow questions that contain vulnerable code to evaluate whether they merely provide answers to the questions or if they also warn users about the insecure code, thereby demonstrating a degree of security awareness. Further, we assess whether LLM responses provide information about the causes, exploits, and the potential fixes of the vulnerability, to help raise users’ awareness. Our findings show that all three models struggle to accurately detect and warn users about vulnerabilities, achieving a detection rate of only 12.6% to 40% across our datasets. We also observe that the LLMs tend to identify certain types of vulnerabilities related to sensitive information exposure and improper input neutralization much more frequently than other types, such as those involving external control of file names or paths. Furthermore, when LLMs do issue security warnings, they often provide more information on the causes, exploits, and fixes of vulnerabilities compared to Stack Overflow responses. Finally, we provide an in-depth discussion on the implications of our findings, and demonstrated a CLI-based prompting tool that can be used to produce more secure LLM responses.
Common Vulnerability and Exposure (CVE) records are fundamental to cybersecurity, offering unique identifiers for publicly known software and system vulnerabilities. Each CVE is typically assigned a Common Vulnerability Scoring System (CVSS) score to support risk prioritization and remediation. However, score inconsistencies often arise due to subjective interpretations of certain metrics. As the number of new CVEs continues to grow rapidly, automation is increasingly necessary to ensure timely and consistent scoring. While prior studies have explored automated methods, the application of Large Language Models (LLMs), despite their recent popularity, remains relatively underexplored.In this work, we evaluate the effectiveness of LLMs in generating CVSS scores for newly reported vulnerabilities. We investigate various prompt engineering strategies to enhance their accuracy and compare LLM-generated scores against those from embedding-based models, which use vector representations classified via supervised learning. Our results show that while LLMs demonstrate potential in automating CVSS evaluation, embedding-based methods outperform them in scoring more subjective components, particularly confidentiality, integrity, and availability impacts. These findings underscore the complexity of CVSS scoring and suggest that combining LLMs with embedding-based methods could yield more reliable results across all scoring components.
Large language models (LLMs) have emerged as a promising tool for detecting code vulnerabilities, potentially offering advantages over traditional rule-based methods. This paper proposes an enhanced framework for vulnerability detection using LLMs, incorporating various prompt engineering strategies to improve performance. We evaluate several techniques, including role-based prompting, zero-shot chain-of-thought, and structured prompting approaches, on the DiverseVul dataset of C/C++ vulnerabilities. Our experiments assess the framework’s performance across different code structures, contextual information levels, and LLM capabilities. Our results show that using our dynamic prompt engineering technique, you can improve the F1 score by up to 100% with GPT-3.5, a widely used LLM model. We also observe that GPT-4o, Gemini 2.0 Flash, and Meta Llama 3.1 generally outperform GPT-3.5, and all models are very poor when it comes to correctly identifying the type of vulnerability in the code, with the best F1 score of 0.16 observed. However, our follow-up experiments on LLM-based vulnerability correction (i.e., patching) show a 45.77% success rate using GPT-4o, demonstrating promising results in leveraging LLMs for enhancing software security and providing insights into optimizing prompt engineering for vulnerability detection tasks.
Software security is crucial in any field where breaches can exploit sensitive data, and lead to financial losses. As a result, vulnerability detection becomes an essential part of the software development process. One of the key steps in maintaining software integrity is identifying vulnerabilities in the source code before deployment. A security breach like CWE-476, which stands for NULL pointer dereferences (NPD), is crucial because it can cause software crashes, unpredictable behavior, and security vulnerabilities. In this scientific era, there are several vulnerability checkers, where, previous tools often fall short in analyzing specific feature connections of the source code, which weakens the tools in real-world scenarios. In this study, we propose another novel approach using a fine-tuned Large Language Model (LLM) termed"DeLLNeuN". This model leverages the advantage of various layers to reduce both overfitting and non-linearity, enhancing its performance and reliability. Additionally, this method provides dropout and dimensionality reduction to help streamline the model, making it faster and more efficient. Our model showed 87% accuracy with 88% precision using the Draper VDISC dataset. As software becomes more complex and cyber threats continuously evolve, the need for proactive security measures will keep growing. In this particular case, the proposed model looks promising to use as an early vulnerability checker in software development.
In today’s high-risk environments, software security is critical for companies, as vulnerabilities can expose sensitive information. Robust cybersecurity measures are essential to reducing the risk of unauthorized access. One major security issue is CWE-120, which deals with buffer overflow vulnerabilities. This paper explores the use of automated tools for detecting such vulnerabilities. Traditional detection tools, which often rely on conventional computational models, struggle to identify complex, context-dependent flaws in source code. In this study, we propose an LLM-based method for detecting CWE-120 vulnerabilities. By integrating techniques like dropout and dimensionality reduction, our model improves speed and efficiency. The proposed approach achieved an accuracy of 0.90 and a precision rate of 0.91, demonstrating strong potential for early detection of vulnerabilities, particularly those related to CWE-120, during the software development process.
Existing research has demonstrated promising results when applying large language models (LLMs) to detect security vulnerabilities in source code. However, these studies have been exclusively evaluated on benchmarks from open-source systems, using publicly known vulnerabilities that are likely part of the LLMs’ training data. This raises concerns that reported performance metrics may be inflated due to data contamination, providing a misleading view of the models’ actual capabilities.In this paper, we quantify this effect with a case study that evaluates five frontier LLMs on two carefully curated datasets: CWE-Bench-Java (an open-source dataset) and TS-Vuls (based on a closed-source commercial codebase). To provide a second angle, we also split CWE-Bench-Java by CVE record date to explore temporal contamination based on LLM knowledge cutoff dates.Our results reveal that the average F1 score dropped by approximately 20 percentage points when comparing the open-source to the closed-source dataset. Additionally, the precision drops from 56% to 34% on average, which is statistically significant (p < 0.05) for four of five models. This declining trend is consistent across all tested LLMs and metrics. In contrast, the results for the temporal split on the open-source data are inconclusive, suggesting that using a knowledge cutoff may reduce but does not ensure the elimination of contamination effects.Although our study is based on a single closed-source system and thus not generalizable, these findings provide the first empirical evidence that evaluating LLM-based vulnerability detection on open-source benchmarks may lead to overly optimistic results. This motivates the inclusion of closed-source datasets in future LLM evaluations.
We present PyCode_Vul, a Python-based software vulnerability dataset constructed from 15 open-source GitHub projects. The corpus comprises 17,811 function-level instances, including 7,899 vulnerable and 9,912 non-vulnerable samples. Our pipeline mines commit histories, extracts code changes, and recovers complete functions with AST-validated parsing. Labels are assigned via CWE mapping that combines heuristic patterns with the Bandit static analysis tool, followed by rigorous deduplication to reduce leakage and near-duplicates. We benchmark ten large language models (LLMs) on PyCode_Vul and evaluate cross-dataset generalization on CVEfixes, VUDENC, PyData, Cod_Vulnerability_Python, Buggy_Python, and PCV+Merge, alongside our PyCode_Vul Test split. Results indicate that UniXcoder and CodeT5+ consistently achieve the best overall performance on our proposed test set and the merged split, indicating that PyCode_Vul exhibits a coherent, learnable distribution for LLM-based vulnerability detection. Dataset can be found in: https://github.com/TasminKarim-19/PyCode_Vul/tree/main
Large Language Models (LLMs) show significant promise in automating software vulnerability analysis, a critical task given the impact of security failure of modern software systems. However, current approaches in using LLMs to automate vulnerability analysis mostly rely on using online API-based LLM services, requiring the user to disclose the source code in development. Moreover, they predominantly frame the task as a binary classification(vulnerable or not vulnerable), limiting potential practical utility. This paper addresses these limitations by reformulating the problem as Software Vulnerability Identification (SVI), where LLMs are asked to output the type of weakness in Common Weakness Enumeration (CWE) IDs rather than simply indicating the presence or absence of a vulnerability. We also tackle the reliance on large, API-based LLMs by demonstrating that instruction-tuning smaller, locally deployable LLMs can achieve superior identification performance. In our analysis, instruct-tuning a local LLM showed better overall performance and cost trade-off than online API-based LLMs. Our findings indicate that instruct-tuned local models represent a more effective, secure, and practical approach for leveraging LLMs in real-world vulnerability management workflows.
As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes in-creasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often over-look the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TRUSTEVAL-C, a benchmark designed to evaluate LLMs' abilities for vulnerability analysis of code written in the C programming language through two key di-mensions: structure reasoning-assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning-examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TRUSTEVAL-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our ini-tial benchmark dataset is available at https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0
Identifying vulnerabilities in source code is crucial, especially in critical software components. Existing methods such as static analysis, dynamic analysis, formal verification, and recently Large Language Models are widely used to detect security flaws. This paper introduces CASTLE (CWE Automated Security Testing and Low-Level Evaluation), a benchmarking framework for evaluating the vulnerability detection capabilities of different methods. We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs. We propose the CASTLE Score, a novel evaluation metric to ensure fair comparison. Our results reveal key differences: ESBMC (a formal verification tool) minimizes false positives but struggles with vulnerabilities beyond model checking, such as weak cryptography or SQL injection. Static analyzers suffer from high false positives, increasing manual validation efforts for developers. LLMs perform exceptionally well in the CASTLE dataset when identifying vulnerabilities in small code snippets. However, their accuracy declines, and hallucinations increase as the code size grows. These results suggest that LLMs could play a pivotal role in future security solutions, particularly within code completion frameworks, where they can provide real-time guidance to prevent vulnerabilities. The dataset is accessible at https://github.com/CASTLE-Benchmark.
No abstract available
Software vulnerabilities, caused by unintentional flaws in source code, are a primary root cause of cyberattacks. Static analysis of source code has been widely used to detect these unintentional defects introduced by software developers. Large Language Models (LLMs) have demonstrated human-like conversational abilities due to their capacity to capture complex patterns in sequential data, such as natural languages. In this paper, we harness LLMs’ capabilities to analyze source code and detect known vulnerabilities. To ensure the proposed vulnerability detection method is universal across multiple programming languages, we convert source code to LLVM IR and train LLMs on these intermediate representations. We conduct extensive experiments on various LLM architectures and compare their accuracy. Our comprehensive experiments on real-world and synthetic codes from NVD and SARD demonstrate high accuracy in identifying source code vulnerabilities.
Code comments are the most important medium for documenting program logic and design. Nevertheless, as modern software undergoes frequent updates and modifications, maintaining the accuracy and relevance of comments becomes a labor-intensive endeavor. Drawing inspiration from the remarkable performance of Large Language Model (LLM) in comprehending software programs, this paper introduces a program analysis based and LLM-driven methodology for identifying inconsistencies in code comments. Our approach capitalizes on LLMs' ability to interpret natural language descriptions within code comments, enabling the extraction of design constraints. Subsequently, we employ program analysis techniques to accurately identify the implementation of these constraints. We instantiate this methodology using GPT 4.0, focusing on three prevalent types of constraints. In the experiment on 13 open-source projects, our approach identified 160 inconsistencies, and 23 of them have been confirmed and fixed by the developers.
We introduce QLPro, a vulnerability detection framework that systematically integrates LLMs and static analysis tools to enable comprehensive vulnerability detection across entire open-source projects.We constructed a new dataset, JavaTest, comprising 10 open-source projects from GitHub with 62 confirmed vulnerabilities. CodeQL, a state-of-the-art static analysis tool, detected only 24 of these vulnerabilities while QLPro detected 41. Furthermore, QLPro discovered 6 previously unknown vulnerabilities, 2 of which have been confirmed as 0-days.
Discovering potential vulnerabilities has long been a fundamental goal in software security. Among them, bit flips, caused by hardware or environmental disturbances, are increasingly recognized as a new type of vulnerabilities that threaten program reliability at the instruction level. However, existing work is often restricted to individual programs and requires retraining when applied to unseen code, severely limiting their practicality and responsiveness. In this paper, we propose CIVP, a novel framework for context-aware instruction vulnerability prediction, generalizing to unseen programs without retraining. Specifically, to capture the rich contextual semantics of instructions, CIVP first leverages Large Language Models (LLMs) to accurately extract semantic embeddings of instructions. Then, CIVP further constructs an instruction execution graph containing complex relations of program execution, which implicates the potential path of error propagation. To improve instruction representation for vulnerability prediction, CIVP enhances GraphSAGE with multi-hop diffusion to capture inter-program structural patterns and contextual dependencies, and adopts pseudo-labeling to improve the model’s generalization for vulnerable instructions. Extensive experiments on a dataset of 26 real-world programs demonstrate that CIVP significantly outperforms the state-of-the-art approaches, achieving up to 20.5%↑ Recall and 18.5%↑ F1-score improvements. Notably, CIVP generalizes well to unseen programs, offering an efficient and scalable solution for proactive instruction-level hardening before software deployment.
Large Language Models (LLMs) have established strong baselines for software vulnerability detection, leading to a common assumption that their performance can be enhanced by augmenting them with supplementary information such as Abstract Syntax Trees (ASTs), software metrics, or expanded pre-training data. However, the actual efficacy of these computationally expensive techniques over a robust LLM baseline remains unevaluated, potentially misdirecting research efforts. This paper aims to empirically test this "more is better" assumption by conducting a large-scale study that evaluates four supplementary techniques: multi-task learning, software metrics injection, data expansion, and hybrid graph representations against a high-performing LLM baseline, VulBERTa, on the CodeXGLUE benchmark for C/C++ code. Our findings demonstrate that none of these complex techniques provides a statistically significant performance improvement, as the baseline model's tokenization and attention mechanisms already capture the necessary information, rendering the additions redundant. However, we identify software metrics injection as an effective method for tuning the precision-recall trade-off, a critical capability for practitioners needing to minimize false negatives. This paper concludes that for LLM-based vulnerability detection, adding external complexity offers diminishing returns, and future efforts should focus on core model improvements, supporting a "less is more" approach.
Large Language Models (LLMs) have reasoning abilities comparable to humans; however, they remain challenged when addressing complex logical operations, especially in fields such as programming languages. In this study, we present the ResVul-LLM framework that combines Large Language Models (LLMs) with neurosymbolic solvers to improve logical problem solving in C / C++ vulnerability identification. To assess its effectiveness, we explore three different neurosymbolic approaches: First-order logic (FOL) mapping using the Abstract Syntax Tree (AST), Program Dependence Graph (PDG), and event trace-based logical representation. Each technique is analyzed using three different input settings: source code only, FOL representation solely, and a combination of FOL and source code representation, over seven benchmark datasets. Our comparison research demonstrates that, when applied with source code-only inputs, the event trace-based logical representation generates the most effective results. In particular, it obtains 0.6655 accuracy with CodeBERT, 0.6437 with GraphCodeBERT, and 0.6608 with UnixCoder. These findings show that ResVul-LLM, by combining LLMs with symbolic reasoning, provides a more consistent and appropriate framework for logical reasoning in the identification of C / C++ vulnerabilities.
The application of Artificial Intelligence has become a powerful approach to detecting software vulnerabilities. However, effective vulnerability detection relies on accurately capturing the semantic structure of code and its contextual relationships. Given that the same functionality can be implemented in various forms, a preprocessing tool that standardizes code representation is important. This tool must be efficient, adaptable across programming languages, and capable of supporting new transformations. To address this challenge, we build on the existing SCoPE framework and introduce SCoPE2, an enhanced version with improved performance. We compare both versions in terms of processing time and memory usage and evaluate their impact on a Large Language Model (LLM) for vulnerability detection. Our results show a 97.3\% reduction in processing time with SCoPE2, along with an improved F1-score for the LLM, solely due to the refined preprocessing approach.
Large language models (LLMs) have emerged as promising tools for automated vulnerability detection (VD), yet their effectiveness is strongly shaped by prompt design and input representation. Existing studies largely focus on detection accuracy, leaving open how factors such as prompt language, structural form, and code representation influence performance. We conduct a systematic empirical study on these dimensions by building a curated dataset from patched NVD vulnerabilities in C and C++ projects, annotated with vulnerable functions, lines, and types. Six representative LLM families: GPT, Claude, Gemini, Qwen, LLaMA, and DeepSeek are evaluated. Our experiments span three axes: 1) prompt languages (high-resource, low-resource); 2) prompt structures (sequential, structural, chain-of-thought, role-based); and 3) input representations (source code, code graphs). Results show that VD performance degrades significantly in low-resource languages, while structured prompts, especially chain-of-thought, improve interpretability and robustness. Furthermore, graph-based representations such as PDGs and CPGs enhance precision for vulnerabilities. These findings highlight the sensitivity of LLMs for vulnerability detection to prompt and representation choices.
Traditional approaches for smart contract analysis often rely on intermediate representations such as abstract syntax trees, control-flow graphs, or static single assignment form. However, these methods face limitations in capturing both semantic structures and control logic. Knowledge graphs, by contrast, offer a structured representation of entities and relations, enabling richer intermediate abstractions of contract code and supporting the use of graph query languages to identify rule-violating elements. This paper presents CKG-LLM, a framework for detecting access-control vulnerabilities in smart contracts. Leveraging the reasoning and code generation capabilities of large language models, CKG-LLM translates natural-language vulnerability patterns into executable queries over contract knowledge graphs to automatically locate vulnerable code elements. Experimental evaluation demonstrates that CKG-LLM achieves superior performance in detecting access-control vulnerabilities compared to existing tools. Finally, we discuss potential extensions of CKG-LLM as part of future research directions.
Vulnerability detection remains a critical challenge in the field of security. Many existing approaches extract code representations for vulnerability detection. However, these methods often focus on the overall semantics of the code, neglecting to specifically target vulnerability-related semantics. To address this limitation, we propose a novel LLM steering method designed to steer LLMs to focus on vulnerability concepts, thereby enhancing their performance in vulnerability detection. Specifically, we introduce a vulnerability steering vector that represents the concept of vulnerability in the representation space. This vector is generated using a paired vulnerability-patch function dataset, effectively capturing the essence of vulnerabilities. Experimental results demonstrate that the proposed method significantly improves LLMs' performance and notably outperforms existing SOTA methods in vulnerability detection tasks. Furthermore, we validate the cross-language transferability of the steering vector and explore the explainability of vulnerability detection.
Rust is designed to prevent common memory safety issues, yet it remains susceptible to various security vulnerabilities. The limited availability of labeled vulnerability data in Rust presents a significant challenge for applying machine learning (ML) techniques. To address this, we propose TL-HGNN, a novel transfer learning framework that employs heterogeneous graph neural networks (HGNN) and LLMs to detect Rust vulnerabilities without the need for Rust-specific training. TL-HGNN utilizes Inter-procedural Compressed Code Property Graphs (ICCPGs) to represent source code from high-resource languages (C or Java), training HGNN models that capture semantic and structural relationships. Rust code is translated into the source language using an iterative LLM-based approach to ensure correct syntax. The translated code is then converted into an ICCPG representation and analyzed with pre-trained HGNN models. Evaluations on 82 real-world Rust CVE pairs across 44 CWEs indicate that TL-HGNN achieves higher performance than the evaluated baselines on this dataset, achieving an F1 score of 66.24%. Additionally, an ablation study suggests the advantages of ICCPG representation over prior graph methods and the effectiveness of LLM translation over dictionary-based mapping for transfer learning. Our results also indicate that Java serves as a more effective source language than C in this context, and GPT-4.0 achieves higher performance than several other LLMs in translation quality. TL-HGNN represents an early application of transfer learning to Rust vulnerability detection.
Large language models (LLMs) excel in many tasks of software engineering, yet progress in leveraging them for vulnerability discovery has stalled in recent years. To understand this phenomenon, we investigate LLMs through the lens of classic code metrics. Surprisingly, we find that a classifier trained solely on these metrics performs on par with state-of-the-art LLMs for vulnerability discovery. A root-cause analysis reveals a strong correlation and a causal effect between LLMs and code metrics: When the value of a metric is changed, LLM predictions tend to shift by a corresponding magnitude. This dependency suggests that LLMs operate at a similarly shallow level as code metrics, limiting their ability to grasp complex patterns and fully realize their potential in vulnerability discovery. Based on these findings, we derive recommendations on how research should more effectively address this challenge.
During software development and maintenance, vulnerability detection is an essential part of software quality assurance. Even though many program-analysis-based and machine-learning-based approaches have been proposed to automatically detect vulnerabilities, they rely on explicit rules or patterns defined by security experts and suffer from either high false positives or high false negatives. Recently, an increasing number of studies leverage deep learning techniques, especially Graph Neural Network (GNN), to detect vulnerabilities. These approaches leverage program analysis to represent the program semantics as graphs and perform graph analysis to detect vulnerabilities. However, they suffer from two main problems: (i) Existing GNN-based techniques do not effectively learn the structural and semantic features from source code for vulnerability detection. (ii) These approaches tend to ignore fine-grained information in source code. To tackle these problems, in this paper, we propose a novel vulnerability detection approach, named MGVD (M ultiple-G raph-Based V ulnerability D etection), to detect vulnerable functions. To effectively learn the structural and semantic features from source code, MGVD uses three different ways to represent each function into multiple forms, i.e., two statement graphs and a sequence of tokens. Then we encode such representations to a three-channel feature matrix. The feature matrix contains the structural feature and the semantic feature of the function. And we add a weight allocation layer to distribute the weights between structural and semantic features. To overcome the second problem, MGVD constructs each graph representation of the input function using multiple different graphs instead of a single graph. Each graph focuses on one statement in the function and its nodes denote the related statements and their fine-grained code elements. Finally, MGVD leverages CNN to identify whether this function is vulnerable based on such feature matrix. We conduct experiments on 3 vulnerability datasets with a total of 30,341 vulnerable functions and 127,931 non-vulnerable functions. The experimental results show that our method outperforms the state-of-the-art by 9.68% – 10.28% in terms of F1-score.
LLM Agentic Workflow for Automated Vulnerability Detection and Remediation in Infrastructure-as-Code
This paper presents a multi-agent, AI-driven strategy employing Large Language Models (LLMs), retrieval-augmented generation, and a continuously updated knowledge base for the detection and remediation of security vulnerabilities win cloud frameworks. By examining Infrastructure as Code (IaC) templates alongside pertinent best-practice snippets, the system discerns context-specific misconfigurations commonly overlooked by static tools, achieving a detection rate of 85% with some occurrences of false positives. Automated remediation guidance, anchored in current security standards, provides actionable solutions that seamlessly integrate into standard continuous integration/continuous development (CI/CD) workflows. Experimental results indicate the solution’s efficacy and scalability, heralding a proactive, context-aware approach to IaC security.
No abstract available
No abstract available
Software vulnerability detection is a software se- curity analysis technique that aims to recognize possible code vulnerabilities and weaknesses. The majority of previous research has primarily concentrated on deep learning models. Recently, with the rapid development of large language mod- els, researchers are exploring the use of G PT in the field of vulnerability detection. However, these works inadequately account for the features of vulnerability detection, lacking specific prompts and code-specific information. This paper primarily investigates the impact of prompts for GPT. We design the following prompts step by step: basic prompts, prompts with code-specific information, and chain-of-thought (CoT) prompts. Firstly, we explore the performance using basic prompts in the field of vulnerability detection. Subsequently, we incorporate code-specific information into basic prompts, focusing primarily on similar code and data flow graph (DFG). The similar code is obtained through our designed code search algorithm, which takes into account both code semantic and structural information. Finally, we further introduce CoT prompts to inspire GPT's ability for gradually detecting vulnerability. We also conduct experiments to explore the impact of the position of code- specific information in the prompts and the suitable temperature value for the vulnerability detection task. Our experiment results demonstrate that, combined with code-specific information and CoT, GPT can detect vulnerabilities more effectively.
Vulnerability propagation in software systems is always one of the most important problems in software reliability analysis. Previous methods primarily relied on the calling relationships between functions, which failed to accurately capture the vulnerability propagation process, resulting in a high false positive rate. To resolve this issue, this paper proposes a vulnerability propagation impact analysis method based on code semantics, aimed at providing a fine-grained analysis. Specifically, the research designs a prompt template for generating prompt for each function in the vulnerability propagation chain, enabling the extraction of intra-function constraint information through a Large Language Model (LLM). Additionally, the study proposes a constraint combination method based on inter-function data transfer relationships, which is used to aggregate the complete constraint information within the vulnerability propagation chain. Finally, the research incorporates a vulnerability trigger determination method based on Satisfiability Modulo Theory (SMT) and a vulnerability trigger probability estimation method based on Monte Carlo simulation. The result of case study demonstrates the effectiveness of the proposed method.
No abstract available
One of the most pressing threats to computing systems is software vulnerabilities, which can compromise both hardware and software components. Existing methods for vulnerability detection remain suboptimal. Traditional techniques are both time-consuming and labor-intensive, while machine-learning-based approaches often underperform when applied to complex datasets, due to their inability to capture high-dimensional relationships. Previous deep-learning strategies also fall short in capturing sufficient feature information. Although self-attention mechanisms can process information over long distances, they fail to capture structural information. In this paper, we introduce DefectHunter, an innovative model for vulnerability identification that employs the Conformer mechanism. This mechanism fuses self-attention with convolutional networks to capture both local, position-wise features and global, content-based interactions. Furthermore, we optimize the self-attention mechanisms to mitigate the issue of excessive attention heads introducing extraneous noise by adjusting the denominator. We evaluated DefectHunter against ten baseline methods using six industrial and two highly complex datasets. On the QEMU dataset, DefectHunter exhibited a 20.62\% improvement in accuracy over Pongo-70B, and for the CWE-754 dataset, its accuracy was 14.64\% higher. To investigate how DefectHunter comprehends vulnerabilities, we conducted a case study, which revealed that our model effectively understands the mechanisms underlying vulnerabilities.
Decompilers are widely used in reverse engineering (RE) to convert compiled executables into human-readable pseudocode and support various security analysis tasks. Existing decompilers, such as IDA Pro and Ghidra, focus on enhancing the readability of decompiled code rather than its recompilability, which limits further programmatic use, such as for CodeQL-based vulnerability analysis that requires compilable versions of the decompiled code. Recent LLM-based approaches for enhancing decompilation results, while useful for human RE analysts, unfortunately also follow the same path. In this paper, we explore, for the first time, how off-the-shelf large language models (LLMs) can be used to enable recompilable decompilation—automatically correcting decompiler outputs into compilable versions. We first show that this is non-trivial through a pilot study examining existing rule-based and LLM-based approaches. Based on the lessons learned, we design DecLLM, an iterative LLM-based repair loop that utilizes both static recompilation and dynamic runtime feedback as oracles to iteratively fix decompiler outputs. We test DecLLM on popular C benchmarks and real-world binaries using two mainstream LLMs, GPT-3.5 and GPT-4, and show that off-the-shelf LLMs can achieve an upper bound of around 70% recompilation success rate, i.e., 70 out of 100 originally non-recompilable decompiler outputs are now recompilable. We also demonstrate the practical applicability of the recompilable code for CodeQL-based vulnerability analysis, which is impossible to perform directly on binaries. For the remaining 30% of hard cases, we further delve into their errors to gain insights for future improvements in decompilation-oriented LLM design.
The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code. In this work, we introduce SafeGenBench, a benchmark specifically designed to assess the security of LLM-generated code. The dataset encompasses a wide range of common software development scenarios and vulnerability types. Building upon this benchmark, we develop an automatic evaluation framework that leverages both static application security testing(SAST) and LLM-based judging to assess the presence of security vulnerabilities in model-generated code. Through the empirical evaluation of state-of-the-art LLMs on SafeGenBench, we reveal notable deficiencies in their ability to produce vulnerability-free code. Our findings highlight pressing challenges and offer actionable insights for future advancements in the secure code generation performance of LLMs. The data and code will be released soon.
Large language models (LLMs) are highly compute- and memory-intensive, posing significant demands on high-performance GPUs. At the same time, advances in GPU technology driven by shrinking transistor sizes and lower operating voltages have made these devices increasingly susceptible to soft errors. While prior work has examined GPU reliability, most studies have focused on general-purpose applications or conventional neural networks mostly used for vision tasks such as classification and detection. In contrast, systematic analysis of modern large-scale LLMs remains limited, despite their rapid adoption in diverse application scenarios. Given the unique characteristics of LLMs, their resilience to soft errors may differ substantially from earlier models. To bridge this gap, we conduct the first instruction-level fault injection study of LLM inference. Our approach reveals reliability characteristics from multiple perspectives, highlighting the effects of model architecture, parameter scale, and task complexity. These findings provide new insights into LLM reliability and inform the design of more effective fault tolerance mechanisms.
Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection. Traditional methods, including static and dynamic analysis, face limitations in efficiency, false-positive rates, and scalability with modern software complexity. Through code structure analysis, pattern identification, and repair suggestion generation, LLMs demonstrate a novel approach to vulnerability mitigation. This survey examines LLMs in vulnerability detection, analyzing problem formulation, model selection, application methodologies, datasets, and evaluation metrics. We investigate current research challenges, emphasizing cross-language detection, multimodal integration, and repository-level analysis. Based on our findings, we propose solutions addressing dataset scalability, model interpretability, and low-resource scenarios. Our contributions include: (1) a systematic analysis of LLM applications in vulnerability detection; (2) a unified framework examining patterns and variations across studies; and (3) identification of key challenges and research directions. This work advances the understanding of LLM-based vulnerability detection. The latest findings are maintained at https://github.com/OwenSanzas/LLM-For-Vulnerability-Detection
This study introduces regression discontinuity design to LLM security evaluation, analyzing charter effectiveness across 858 trials with Claude-3.5-Sonnet and GPT-4o. We discover that security charter responsiveness operates independently from baseline model performance: while GPT-4o's overall scores dropped 12.58 points between experimental phases, its sensitivity to security guidance increased dramatically through optimization ($12.8 \times$ effect size improvement from $\text{d = 0. 1 9 1}$ to $\text{d = 0. 3 4 6}$). Claude maintained stable performance with consistent charter responsiveness (+6.18 points, $\mathrm{p}=0.030$). Task-specific analysis reveals both models respond strongly to charters on 4-5 out of 8 vulnerability domains ($d \geq 0.8$), effects completely hidden in aggregate measures. Strategic placement comparison shows embedded and late charter positioning outperform early placement across models. Despite achieving perfect security compliance (no vulnerabilities across $\text{8 5 8}$ trials), charter influence operates through security practice enhancement rather than vulnerability elimination. Our findings demonstrate that charter effectiveness depends critically on task characteristics and model architecture, with single outlier tasks capable of masking significant intervention potential. These results provide the first causal evidence that security guidance and model capability represent distinct architectural systems in LLMs.
Recently, Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience and have attracted considerable attention. Despite the impressive progress, the research question regarding the safety vulnerability of LLM-empowered RecSys still remains largely under-investigated. Given the security and privacy concerns, it is more practical to focus on attacking the black-box RecSys, where attackers can only observe the system's inputs and outputs. However, traditional attack approaches employing reinforcement learning (RL) agents are not effective for attacking LLM-empowered RecSys due to the limited capabilities in processing complex textual inputs, planning, and reasoning. On the other hand, LLMs provide unprecedented opportunities to serve as attack agents to attack RecSys because of their impressive capability in simulating human-like decision-making processes. Therefore, in this paper, we propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs, where an LLM-based agent is developed to attack LLM-Empowered RecSys. Specifically, our method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, we utilize the prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real-world datasets demonstrate the effectiveness of our proposed attacking method.
Inspired by the success of large language models (LLMs), there is a significant research shift from traditional graph learning methods to LLM-based graph frameworks, formally known as GraphLLMs. GraphLLMs leverage the reasoning power of LLMs by integrating three key components: the textual attributes of input nodes, the structural information of node neighborhoods, and task-specific prompts that guide decision-making. Despite their promise, the robustness of GraphLLMs against adversarial perturbations remains largely unexplored-a critical concern for deploying these models in high-stakes scenarios. To bridge the gap, we introduce TrustGLM, a comprehensive study evaluating the vulnerability of GraphLLMs to adversarial attacks across three dimensions: text, graph structure, and prompt manipulations. We implement state-of-the-art attack algorithms from each perspective to rigorously assess model resilience. Through extensive experiments on six benchmark datasets from diverse domains, our findings reveal that GraphLLMs are highly susceptible to text attacks that merely replace a few semantically similar words in a node's textual attribute. We also find that standard graph structure attack methods can significantly degrade model performance, while random shuffling of the candidate label set in prompt templates leads to substantial performance drops. Beyond characterizing these vulnerabilities, we investigate defense techniques tailored to each attack vector through data-augmented training and adversarial training, which show promising potential to enhance the robustness of GraphLLMs. We hope that our open-sourced library will facilitate rapid, equitable evaluation and inspire further innovative research in this field. The benchmark code can be found in https://github.com/Palasonic5/TrustGLM.git.
Efficiently retrieving a concise set of candidates from a large doc- ument corpus remains a pivotal challenge in Information Retrieval (IR). Neural retrieval models, particularly dense retrieval models built with transformers and pretrained language models, have been popular due to their superior performance. However, criticisms have also been raised on their lack of explainability and vulnerability to adversarial attacks. In response to these challenges, we propose to improve the robustness of dense retrieval models by enhancing their sensitivity of fine-grained relevance signals. A model achieving sensitivity in this context should exhibit high variances when doc- uments' key passages determining their relevance to queries have been modified, while maintaining low variances for other changes in irrelevant passages. This sensitivity allows a dense retrieval model to produce robust results with respect to attacks that try to promote documents without actually increasing their relevance. It also makes it possible to analyze which part of a document is actually relevant to a query, and thus improve the explainability of the retrieval model. Motivated by causality and counterfactual analysis, we propose a se- ries of counterfactual regularization methods based on game theory and unsupervised learning with counterfactual passages. Specifically, we first introduce a cooperative game theory-based counterfactual passage extraction method, identifying the key passages that can influence relevance. Then we propose several subsequent unsuper- vised learning tasks, based on these counterfactual passages, serve to regularize the model's learning process to improve the robustness and sensitivity. Experiments show that, our method can extract key passages without reliance on the passage-level relevance annotations. Moreover, the regularized dense retrieval models exhibit heightened robustness against adversarial attacks, surpassing the state-of-the-art anti-attack methods.
No abstract available
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
The adoption of Large Language Models (LLMs) for automated software vulnerability patching has shown promising outcomes on carefully curated evaluation sets. Nevertheless, existing datasets predominantly rely on superficial validation methods rather than exploit-based verification, leading to overestimated performance in security-sensitive applications. This paper introduces VulnRepairEval, an evaluation framework anchored in functional Proof-of-Concept (PoC) exploits. Our framework delivers a comprehensive, containerized evaluation pipeline that enables reproducible differential assessment, where repair success requires the original exploit to fail execution against the modified code. The benchmark construction involved extensive data curation: we processed over 400 CVEs and approximately 2,500 potential sources to extract a collection of authentic vulnerability instances (23 Python CVEs) amenable to automated testing with working PoCs. Through VulnRepairEval, we conduct a comprehensive evaluation of 12 popular LLMs and observe a significant performance deficit: even the top-performing model successfully addresses merely 5/23 instances (about 21.7%), exposing critical weaknesses in security-focused applications. Our failure analysis reveals that most unsuccessful attempts stem from imprecise vulnerability identification and patches containing syntactic or semantic errors. Enhanced prompting strategies and multi-agent approaches yield minimal improvements, with overall effectiveness remaining largely unaffected. This work contributes a stringent, practical evaluation framework for LLM-driven vulnerability remediation and underscores the necessity for assessment protocols that authentically reflect real-world exploitation scenarios.
The significant advancements in Large Language Models (LLMs) have resulted in their widespread adoption across various tasks within Software Engineering (SE), including vulnerability detection and repair. Numerous studies have investigated the application of LLMs to enhance vulnerability detection and repair tasks. Despite the increasing research interest, there is currently no existing survey that focuses on the utilization of LLMs for vulnerability detection and repair. In this paper, we aim to bridge this gap by offering a systematic literature review of approaches aimed at improving vulnerability detection and repair through the utilization of LLMs. The review encompasses research work from leading SE, AI, and Security conferences and journals, encompassing 43 papers published across 25 distinct venues, along with 15 high-quality preprint papers, bringing the total to 58 papers. By answering three key research questions, we aim to (1) summarize the LLMs employed in the relevant literature, (2) categorize various LLM adaptation techniques in vulnerability detection, and (3) classify various LLM adaptation techniques in vulnerability repair. Based on our findings, we have identified a series of limitations of existing studies. Additionally, we have outlined a roadmap highlighting potential opportunities that we believe are pertinent and crucial for future research endeavors.
Recent years have seen an explosion of activity in Generative AI, specifically Large Language Models (LLMs), revolutionising applications across various fields. Smart contract vulnerability detection is no exception; as smart contracts exist on public chains and can have billions of dollars transacted daily, continuous improvement in vulnerability detection is crucial. This has led to many researchers investigating the usage of generative large language models (LLMs) to aid in detecting vulnerabilities in smart contracts. This paper presents a systematic review of the current LLM-based smart contract vulnerability detection tools, comparing them against traditional static and dynamic analysis tools Slither and Mythril. Our analysis highlights key areas where each performs better and shows that while these tools show promise, the LLM-based tools available for testing are not ready to replace more traditional tools. We conclude with recommendations on how LLMs are best used in the vulnerability detection process and offer insights for improving on the state-of-the-art via hybrid approaches and targeted pre-training of much smaller models.
Large Language Models (LLMs) have demonstrated exceptional capabilities in the field of Artificial Intelligence (AI) and are now widely used in various applications globally. However, one of their major challenges is handling high-concurrency workloads, especially under extreme conditions. When too many requests are sent simultaneously, LLMs often become unresponsive which leads to performance degradation and reduced reliability in real-world applications. To address this issue, this paper proposes a queue-based system that separates request handling from direct execution. By implementing a distributed queue, requests are processed in a structured and controlled manner, preventing system overload and ensuring stable performance. This approach also allows for dynamic scalability, meaning additional resources can be allocated as needed to maintain efficiency. Our experimental results show that this method significantly improves resilience under heavy workloads which prevents resource exhaustion and enables linear scalability. The findings highlight the effectiveness of a queue-based web service in ensuring LLMs remain responsive even under extreme workloads.
The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.
Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.
Finding interpretable factors for stock returns is the most vital issue in the empirical asset pricing domain. As data-driven methods, existing factor mining models can be categorized into symbol-based and neural-based models. Symbol-based models are interpretable but inefficient, while neural-based approaches are efficient but lack interpretability. Hence, mining interpretable factors effectively presents a significant challenge. Inspired by the success of Large Language Models (LLMs) in various tasks, we propose a FActor Mining Agent (FAMA) model that enables LLMs to integrate the strengths of both neural and symbolic models for factor mining. In this paper, FAMA consists of two main components: Cross-Sample Selection (CSS) and Chain-of-Experience (CoE). CSS addresses the homogeneity challenges in LLMs during factor mining by assimilating diverse factors as in-context samples, whereas CoE enables LLMs to leverage past successful mining experiences, expe-diting the mining of effective factors. Experimental evaluations on real-world stock market data demonstrate the effectiveness of our approach by surpassing the SOTA RankIC by 0.006 and RankICIR by 0.105 in predicting S&P 500 returns
No abstract available
In the field of materials science, addressing the complex relationship between the material structure and properties has increasingly involved leveraging the text generation capabilities of AI-generated content (AIGC) models for tasks that include literature mining and data analysis. However, theoretical calculations and code development remain labor-intensive challenges. This paper proposes a novel approach based on text-to-code generation, utilizing large language models to automate the implementation of simulation programs in materials science. The effectiveness of automated code generation and review is validated with thermodynamics simulations based on the LAMMPS software as a foundation. This study introduces Molecular Dynamics Agent (MDAgent), a framework designed to guide large models in automatically generating, executing, and refining simulation code. In addition, a thermodynamic simulation code dataset for LAMMPS was constructed to fine-tune the language model. Expert evaluation scores demonstrate that MDAgent significantly improves the code generation and review capabilities. The proposed approach reduces the average task time by 42.22%, as compared to traditional models, thus highlighting its potential applications in the field of materials science.
Recent advancements in Recommender Systems (RS) have incorporated Reinforcement Learning (RL), framing the recommendation as a Markov Decision Process (MDP). However, offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments. Additionally, excessive focus on exploiting short-term relevant items can hinder exploration, leading to sub-optimal recommendations and negatively impacting long-term user gains. Online RL-based RS also face challenges in production deployment, due to the risks of exposing users to untrained or unstable policies. Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline to enhance the initial recommendations in online settings. Effectively managing distribution shift and balancing exploration are crucial for improving RL-based RS, especially when leveraging LLM-based pre-training. To address these challenges, we propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM. Our approach involves prompting the LLM with user states to extract item preferences, learning rewards based on feedback, and updating the RL policy using an actor-critic framework. Furthermore, to deploy iALP in an online scenario, we introduce an adaptive variant, A-iALP, that implements a simple fine-tuning strategy (A-iALPft), and an adaptive approach (A-iALPap) designed to mitigate issues with compromised policies and limited exploration. Experiments across three simulated environments demonstrate that A-iALP introduces substantial performance improvements.
Analogies inspire creative solutions to problems, and facilitate the creative expression of ideas and the explanation of complex concepts. They have widespread applications in scientific innovation, creative writing, and education. The ability to discover creative analogies that are not explicitly mentioned but can be inferred from the web is highly desirable to power all such applications dynamically and augment human creativity. Recently, Large Pre-trained Language Models (PLMs), trained on massive Web data, have shown great promise in generating mostly known analogies that are explicitly mentioned on the Web. However, it is unclear how they could be leveraged for mining creative analogies not explicitly mentioned on the Web. We address this challenge and propose Creative Analogy Mining (CAM), a novel framework for mining creative analogies, which consists of the following three main steps: 1) Generate analogies using PLMs with effectively designed prompts, 2) Evaluate their quality using scoring functions, and 3) Refine the low-quality analogies by another round of prompt-based generation. We propose both unsupervised and supervised instantiations of the framework so that it can be used even without any annotated data. Based on human evaluation using Amazon Mechanical Turk, we find that our unsupervised framework can mine 13.7% highly-creative and 56.37% somewhat-creative analogies. Moreover, our supervised scores are generally better than the unsupervised ones and correlate moderately with human evaluators, indicating that they would be even more effective at mining creative analogies. These findings also shed light on the creativity of PLMs 1.
As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.
Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, typographic attacks, which disrupt Vision-Language Models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), have also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks impacting VLMs and LVLMs, leading to three highly insightful discoveries. During the process of further validating the rationality of our discoveries, we can reduce the performance degradation caused by typographic attacks from 42.07\% to 13.90\%. Code and Dataset are available in \href{https://github.com/ChaduCheng/TypoDeceptions}
Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense
Event extraction is an important task in natural language processing that focuses on mining event-related information from unstructured text. Despite considerable advancements, it is still challenging to achieve satisfactory performance in this task, and issues like data scarcity and imbalance obstruct progress. In this paper, we introduce an innovative approach where we employ Large Language Models (LLMs) as expert annotators for event extraction. We strategically include sample data from the training dataset in the prompt as a reference, ensuring alignment between the data distribution of LLM-generated samples and that of the benchmark dataset. This enables us to craft an augmented dataset that complements existing benchmarks, alleviating the challenges of data imbalance and scarcity and thereby enhancing the performance of fine-tuned models. We conducted extensive experiments to validate the efficacy of our proposed method, and we believe that this approach holds great potential for propelling the development and application of more advanced and reliable event extraction systems in real-world scenarios.
Abstract The emergence of generative large language model (LLM) artificial intelligence (AI) represents one of the most profound developments in healthcare in decades, with the potential to create revolutionary and seismic changes in the practice of medicine as we know it. However, significant concerns have arisen over questions of liability for bad outcomes associated with LLM AI-influenced medical decision making. Although the authors were not able to identify a case in the United States that has been adjudicated on medical malpractice in the context of LLM AI at this time, sufficient precedent exists to interpret how analogous situations might be applied to these cases when they inevitably come to trial in the future. This commentary will discuss areas of potential legal vulnerability for clinicians utilizing LLM AI through review of past case law pertaining to third-party medical guidance and review the patchwork of current regulations relating to medical malpractice liability in AI. Finally, we will propose proactive policy recommendations including creating an enforcement duty at the US Food and Drug Administration (FDA) to require algorithmic transparency, recommend reliance on peer-reviewed data and rigorous validation testing when LLMs are utilized in clinical settings, and encourage tort reform to share liability between physicians and LLM developers.
Auto-regressive large language models (LLMs) have yielded impressive performance in many real-world tasks. However, the new paradigm of these LLMs also exposes novel threats. In this paper, we explore their vulnerability to inference cost attacks, where a malicious user crafts Engorgio prompts to intentionally increase the computation cost and latency of the inference process. We design Engorgio, a novel methodology, to efficiently generate adversarial Engorgio prompts to affect the target LLM's service availability. Engorgio has the following two technical contributions. (1) We employ a parameterized distribution to track LLMs' prediction trajectory. (2) Targeting the auto-regressive nature of LLMs' inference process, we propose novel loss functions to stably suppress the appearance of thetoken, whose occurrence will interrupt the LLM's generation process. We conduct extensive experiments on 13 open-sourced LLMs with parameters ranging from 125M to 30B. The results show that Engorgio prompts can successfully induce LLMs to generate abnormally long outputs (i.e., roughly 2-13$\times$ longer to reach 90%+ of the output length limit) in a white-box scenario and our real-world experiment demonstrates Engergio's threat to LLM service with limited computing resources. The code is released at: https://github.com/jianshuod/Engorgio-prompt.
Over the past few years, the software engineering (SE) community has widely employed deep learning (DL) techniques in many source code processing tasks. Similar to other domains like computer vision and natural language processing (NLP), the state‐of‐the‐art DL techniques for source code processing can still suffer from adversarial vulnerability, where minor code perturbations can mislead a DL model's inference. Efficiently detecting such vulnerability to expose the risks at an early stage is an essential step and of great importance for further enhancement. This paper proposes a novel black‐box effective and high‐quality adversarial attack method, namely CodeBERT‐Attack (CBA), based on the powerful large pre‐trained model (i.e., CodeBERT) for DL models of source code processing. CBA locates the vulnerable positions through masking and leverages the power of CodeBERT to generate textual preserving perturbations. We turn CodeBERT against DL models and further fine‐tuned CodeBERT models for specific downstream tasks, and successfully mislead these victim models to erroneous outputs. In addition, taking the power of CodeBERT, CBA is capable of effectively generating adversarial examples that are less perceptible to programmers. Our in‐depth evaluation on two typical source code classification tasks (i.e., functionality classification and code clone detection) against the most widely adopted LSTM and the powerful fine‐tuned CodeBERT models demonstrate the advantages of our proposed technique in terms of both effectiveness and efficiency. Furthermore, our results also show (1) that pre‐training may help CodeBERT gain resilience against perturbations further, and (2) certain pre‐training tasks may be beneficial for adversarial robustness.
Deep neural networks, especially pre-trained BERT models, have been widely applied in programming language processing tasks and achieved promising results. Their down-stream applications such as code clone detection and code search play a crucial role in data-driven security solutions such as vulnerability analysis. However, the resilience of these models against anti-analysis attacks remains unexplored. Therefore, we try to investigate whether deep neural networks can remain the same performance on different types of code change and what types of biases are introduced in the learning process.We introduce a new code obfuscation tool, a Multi-programming-language Obfuscator (Milo), for programming language processing tasks. Milo can be used to generate adversarial data to verify the model’s generalizability and robustness against code obfuscations. Milo supports five obfuscation methods: variable renaming, method renaming, string splitting, operation substitution, and control flow shuffling on three mainstream programming languages including Java, Python, and JavaScript. It is designed to apply anti-analysis obfuscation techniques across different programming languages that alter the syntactic and semantic features of a code snippet. To better quantify the adverse effects of anti-analysis techniques on pre-trained models for programming languages, we have performed extensive experiments across several pre-trained models, BERT, CodeBERT, and GraphCodeBERT with four downstream tasks which are code documentation generation, code clone detection, code search, and code translation. Our results indicate that most pre-trained BERT models are susceptible to code obfuscations and rely heavily on the literal representations (name or string) of the code segment.
Pre-trained language models (PLMs) have recently emerged as a powerful tool in Automated Vulnerability Repair (AVR), showing great potential in automating the generation of vulnerability patches. By leveraging their learned contextual understanding of code, these models are increasingly fine-tuned for complex tasks such as vulnerability detection and repair, making them valuable assets for improving software security. In this paper, we present a case study exploring the effectiveness of state-of-the-art (SOTA) AVR techniques that utilize PLMs for vulnerability repair. Our evaluation aims to determine whether these models genuinely understand security vulnerabilities and can generate appropriate fixes. Our findings highlight key challenges, including overfitting and unrealistic train/test data splits, which hinder the generalization of current approaches. These limitations underscore the need for more rigorous evaluation methodologies and improvements in model design to enhance the real-world applicability of AVR systems.
Timely and effective vulnerability patching is essential for cybersecurity defense, for which various approaches have been proposed yet still struggle to generate valid and correct patches for real-world vulnerabilities. In this paper, we leverage the power and merits of pre-trained language language models (LLMs) to enable automated vulnerability patching using no test input/exploit evidence and without model training/fine-tuning. To elicit LLMs to effectively reason about vulnerable code behaviors, which is essential for quality patch generation, we introduce vulnerability semantics reasoning and adaptive prompting on LLMs and instantiate the methodology as APPATCH, an automated LLM-based patching system. Our evaluation of APPATCH on 97 zero-day vulnerabilities and 20 existing vulnerabilities demonstrates its superior performance to both existing prompting methods and state-of-the-art non-LLM-based techniques (by up to 28.33% in F1 and 182.26% in recall over the best baseline). Through APPATCH, we demonstrate what helps for LLM-based patching and how, as well as discussing what still lacks and why.
In recent years, the proliferation of software vulnerabilities has significantly increased the complexities and costs associated with manual remediation efforts. Although AI-based methods for automated vulnerability repair are gaining traction, many existing approaches have two limitations: 1) treat code as a sequence of tokens, neglecting critical structural information like control flow and data flow, and 2) do not fully utilize the repair patterns of vulnerabilities. To address these limitations, we introduce FAVOR, an innovative tool that utilizes both the vulnerable function's code and its control flow graph (CFG) as inputs. FAVOR incorporates a dependency embedding module to capture structural and dependency information and leverages CodeT5, a state-of-the-art model pre-trained for code generation tasks. To further enhance the repair process, we introduce a pattern store that uses KNN search to retrieve similar past repair patterns, which helps guide the model toward generating more contextually accurate patches. In our experiments, FAVOR, trained on a dataset of 6548 faulty C/C++ functions, repaired 45 more vulnerabilities compared to VULREPAIR, demonstrating improved accuracy and efficiency in automated vulnerability repair.
本报告系统性地整合了基于大语言模型(LLM)的漏洞挖掘与检测研究。当前研究已从早期的简单文本分类演进为深度语义与结构化逻辑理解的综合体系。核心趋势包括:1) 融合程序分析与图表征的混合架构成为主流,以弥补LLM在复杂逻辑上的短板;2) 提示工程与多智能体协作显著提升了推理的深度与准确性;3) 应用场景从通用软件扩展至智能合约、工控及硬件等垂直领域;4) 自动化闭环(从检测到修复)与模型自身的对抗鲁棒性成为新的研究热点。总体而言,LLM正引领漏洞分析从规则驱动向智能语义驱动的范式转型。