大语言模型训练与微调工具及平台的技术研究
参数高效微调(PEFT)的算法演进与理论机制
该组文献聚焦于如何在有限计算资源下,通过改进LoRA及其变体(如DoRA、LoRA-Dropout)实现模型的高效适配。研究涵盖了PEFT的系统综述、参数更新的数学原理,以及混合专家模型(MoE)与LoRA的结合,旨在提升微调的理论上限与泛化能力。
- PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models(Robert Belanec, Ivan Srba, Mária Bieliková, 2025, ArXiv.org)
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention(Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, Yu Qiao, Qiao, Yu, 2023, arXiv (Cornell University))
- Full Parameter Fine-tuning for Large Language Models with Limited Resources(Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, Xipeng Qiu, 2024, No journal)
- IISAN: Efficiently Adapting Multimodal Representation for Sequential Recommendation with Decoupled PEFT(Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Jie Wang, Joemon M. Jose, 2024, No journal)
- Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment(Lingling Xu, Haoran Xie, S. Joe Qin, Xiaohui Tao, Fu Lee Wang, 2023, arXiv (Cornell University))
- DoRA: Weight-Decomposed Low-Rank Adaptation(Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang‐Ting Cheng, Min-Hung Chen, 2024, arXiv (Cornell University))
- LoRA Dropout as a Sparsity Regularizer for Overfitting Control(Lin Yang, Xinyu Ma, Xu Chu, Yujie Jin, Zhibang Yang, Yasha Wang, Hong Mei, 2024, arXiv (Cornell University))
- Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning(Vladislav Lialin, Vijeta Deshpande, Anna Rumshisky, Rumshisky, Anna, 2023, arXiv (Cornell University))
- When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications(Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, Yefeng Zheng, 2024, No journal)
模型压缩、量化微调与边缘设备部署优化
此类文献探讨了在大规模语言模型训练与推理过程中的资源优化技术。核心包括量化感知微调(QA-LoRA、QLoRA、IR-QLoRA)、权重剪枝(LoraPrune)、离群值感知量化(OWQ)以及针对6G/MEC等边缘计算场景的轻量化部署方案。
- QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models(Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, Qi Tian, 2023, arXiv (Cornell University))
- LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models(Yixiao Li, Yifan Yu, Liang Chen, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao, 2023, arXiv (Cornell University))
- ReALLM: A general framework for LLM compression and fine-tuning(Louis Leconte, Lisa Bedin, Van Minh Nguyen, Éric Moulines, 2024, arXiv (Cornell University))
- Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities(Zheng Lin, Guanqiao Qu, Qiyuan Chen, Xianhao Chen, Zhe Chen, Kaibin Huang, 2023, arXiv (Cornell University))
- QLoRA: Efficient Finetuning of Quantized LLMs(Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023, arXiv (Cornell University))
- Accurate LoRA-Finetuning Quantization of LLMs via Information Retention(Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 2024, arXiv (Cornell University))
- OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models(Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, Eunhyeok Park, 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
- LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning(Mingyang Zhang, H. S. Chen, Chunhua Shen, Zhen Yang, Linlin Ou, Xinyi Yu, Bohan Zhuang, 2024, No journal)
综合性微调基础设施、分布式架构与训推一体化平台
这组文献关注大模型工程化落地。研究涵盖了如LLaMA-Factory、SWIFT等一站式训练框架,集成Triton内核提升吞吐量的方法,以及区块链去中心化训练(AIArena)、云边协同自动化平台和各种分布式训练中间件。
- LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models(Yaowei Zheng, Richong Zhang, Junhao Zhang, YeYanhan YeYanhan, Zheyan Luo, 2024, No journal)
- 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training(Hejian Zou, Xiaowei Lv, Shi Jia, Chunlin Li, Xianmin Gong, Xiangzheng Zhang, 2025, ArXiv.org)
- SWIFT: A Scalable Lightweight Infrastructure for Fine-Tuning(Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, Yingda Chen, 2025, Proceedings of the AAAI Conference on Artificial Intelligence)
- LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models(Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee‐Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, Roy Lee, 2023, No journal)
- The Falcon Series of Open Language Models(Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, B. Pannier, Guilherme Penedo, Pannier, Baptiste, Penedo, Guilherme, 2023, arXiv (Cornell University))
- STAF-LLM: A scalable and task-adaptive fine-tuning framework for large language models in medical domain(Tianhan Xu, Ling Chen, Zhe Hu, Bin Li, 2025, Expert Systems with Applications)
- Optimizing throughput of Seq2Seq model training on the IPU platform for AI-accelerated CFD simulations(Paweł Rościszewski, Adam Krzywaniak, Sergio Iserte, Krzysztof Rojek, Paweł Gepner, 2023, Future Generation Computer Systems)
- Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-tuning(Bei Ouyang, Shengyuan Ye, Liekang Zeng, Tianyi Qian, Jingyi Li, Xu Chen, 2024, No journal)
- An AI Model Automatic Training and Deployment Platform Based on Cloud Edge Architecture for DC Energy-Saving(Chunfang Li, Zhou Guo, Xingmin He, Fei Hu, Weiye Meng, 2023, No journal)
- A Scalable AI Training Platform for Remote Sensing Data(Hendrik M. Würz, Kevin Kocon, Barbara Pedretscher, Eva Klien, Eva Eggeling, 2023, AGILE GIScience Series)
- Liger Kernel: Efficient Triton Kernels for LLM Training(Byron, Hsu -, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Sadanori Shimizu, Shivam Sahni, Haichun Ning, 2024, arXiv (Cornell University))
- 训推一体平台架构设计与关键技术研究(梁秉豪, 张传刚, 2023, 计算机科学与应用)
人类偏好对齐、反馈学习与安全治理体系
本组文献研究如何使模型符合人类预期并确保安全性。涉及DPO、KTO、RLAIF等对齐算法,解决奖励过度优化(ROO)的方案,以及通过差分隐私(DP)、联邦学习(FedLLM)保护训练数据,并建立针对多模态和实验室安全的红队测试基准。
- Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization(Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, Yu Qiao, 2024, No journal)
- KTO: Model Alignment as Prospect Theoretic Optimization(Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela, 2024, arXiv (Cornell University))
- Countering Reward Over-Optimization in LLM with Demonstration-Guided Reinforcement Learning(Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, Olivier Pietquin, 2024, No journal)
- Kimi k1.5: Scaling Reinforcement Learning with LLMs(Kimi Team, Angang Du, Bofei Gao, Bowei Xing, C. H. Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chengde Liao, C.A. Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Gang Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao-Tsung Yang, Hao Zhang, Haotian Yao, H. W. Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, J. Zhao, Jin Zhang, Junming Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengjiao Dong, Nan Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Shifeng Cao, Siying Huang, Tao Jiang, Weihao Gao, Xiong Weijun, W. He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, X. H. Zhou, Xuehai Pan, Y F Young Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Y. Liu, Yiming Qin, Yifeng Liu, Yingguo Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Z B Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Ziting Wang, Zhilin Yang, Zhiqi Huang, Zhiyi Huang, Zhao Xu, Zonghan Yang, Yang, Zonghan, Lin, Zongyu, 2025, ArXiv.org)
- Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond(Liang Wen, Yiyu Cai, Fengping Xiao, Xin He, Qi An, Zhaojun Duan, Y. Y. Du, Junchen Liu, Tanglifu Tanglifu, Xiaowei Lv, Hejian Zou, Yongchao Deng, Shi Jia, Xiangzheng Zhang, 2025, No journal)
- RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback(Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Cărbune, Abhinav Rastogi, Carbune, Victor, Rastogi, Abhinav, Prakash, Sushant, 2023, arXiv (Cornell University))
- Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation(Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen Wang, Ji-Rong Wen, Zhicheng Dou, 2025, No journal)
- Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic(Rishabh Bhardwaj, Duc Anh, Soujanya Poria, 2024, No journal)
- LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs(Yujun Zhou, Jingdong Yang, Huang, Yue, Kehan Guo, Emory, Zoe, Ghosh, Bikram, Bedar, Amita, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Gao, Tian, Geyer, Werner, Moniz, Nuno, Nitesh V. Chawla, Xiangliang Zhang, 2025, Code Ocean)
- Red Teaming Visual Language Models(Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu, 2024, No journal)
- Hardening LLM Fine-Tuning: From Differentially Private Data Selection to Trustworthy Model Quantization(Zehang Deng, Ruoxi Sun, Minhui Xue, Wanlun Ma, Sheng Wen, Surya Nepal, Yang Xiang, 2025, IEEE Transactions on Information Forensics and Security)
- <scp>AIArena</scp> : A Blockchain-Based Decentralized AI Training Platform(Zhipeng Wang, Rui Sun, Eric Lui, Tuo Zhou, Yizhe Wen, Jiahao Sun, 2025, No journal)
- AIArena: A Blockchain-Based Decentralized AI Training Platform(Zhipeng Wang, Rui Sun, Eric Lui, Tuo Zhou, Yizhe Wen, Jiahao Sun, 2024, arXiv (Cornell University))
- FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models(Tao Fan, Yan Kang, Guoqiang Ma, Weijing Chen, Wenbin Wei, Lixin Fan, Qiang Yang, 2023, arXiv (Cornell University))
- FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning(Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou, 2024, No journal)
- Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs(Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Seyyedali Hosseinalipour, Christopher G. Brinton, 2025, ArXiv.org)
- SAP: Privacy-Preserving Fine-Tuning on Language Models with Split-and-Privatize Framework(Huan Tian, Guangsheng Zhang, Bo Liu, Tianqing Zhu, Ming Ding, Wanlei Zhou, Bing Duan, Zirui Huang, Yunlong Mao, Ye Wu, Sheng Zhong, 2024, No journal)
知识增强、工具学习与复杂专项能力扩展
该组研究旨在扩展LLM的基础边界。包括利用合成数据进行知识注入(Ski)、增强外部工具调用(ToolLLM)、处理图结构数据(GraphGPT)、检索增强生成(RAG)、多模态能力融合以及长文本处理等任务的适配研究。
- Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models(Jiaxin Zhang, Wendi Cui, Yiran Huang, Kamalika Das, Sricharan Kumar, 2024, No journal)
- ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs(Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun, Sun, Maosong, 2023, arXiv (Cornell University))
- GraphGPT: Graph Instruction Tuning for Large Language Models(Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, Chao Huang, 2024, No journal)
- 大语言模型融合知识图谱的装备问答系统研究(王美华, 张友星, 2025, 人工智能与机器人研究)
- Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages(Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, Yang Liu, 2024, No journal)
- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca(Yiming Cui, Ziqing Yang, Xin Yao, 2023, arXiv (Cornell University))
- LIFT: A Novel Framework for Enhancing Long-Context Understanding of LLMs via Long Input Fine-Tuning(Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang, 2025, arXiv (Cornell University))
- A Novel Multimodal Transformer Approach for Targeted Information Retrieval from Obscure Images(Kanishk Dukia, Utsav Gupta, Vasudev Dehalwar, Amit Kumar Nandanwar, 2025, No journal)
- 基于视觉–语言联合建模与LoRA微调的医疗废弃物检测模型(刘 奥, 曾 耀, 李 卓, 孙 强, 王孟飞, 2025, 人工智能与机器人研究)
- Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality(Jiahuan Pei, Irene Viola, Hao‐Chen Huang, Junxiao Wang, Moonisa Ahsan, Fanghua Ye, Yiming Jiang, Yao Sai, Di Wang, Zhumin Chen, Pengjie Ren, Pablo César, 2024, No journal)
- Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models(Seungduk Kim, Seungtaek Choi, Myeongho Jeong, 2024, arXiv (Cornell University))
- Open-source LLMs for text annotation: a practical guide for model setting and fine-tuning(Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin Dehghani, Mohammadmasiha Zahedivafa, Juan Diego Bermeo, Maria Korobeynikova, Fabrizio Gilardi, 2024, Journal of Computational Social Science)
垂直行业领域的定制化微调与应用实践
这些文献展示了LLM在医疗、金融、法律、制造、交通、代码评审及推荐系统等特定领域的深度应用。重点研究如何利用领域特定数据和专业指令进行微调,以提升模型在专业逻辑、术语理解和行业任务中的性能表现。
- Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue(Songhua Yang, Hanjie Zhao, Senbin Zhu, Guangyu Zhou, Xu Hongfei, Yuxiang Jia, Hongying Zan, 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
- 基于LLM的智能阅卷系统设计(魏 明, 2025, 管理科学与工程)
- 基于DeepSeek微调和动态建模的交通流预测(高 畅, 2025, 交通技术)
- MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data(Tianyu Han, Lisa C. Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, Keno K. Bressem, Bressem, Keno K., 2023, arXiv (Cornell University))
- Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain(Aryo Pradipta Gema, Pasquale Minervini, Luke Daines, Tom Hope, Beatrice Alex, 2024, No journal)
- LLM-MANUF: An integrated framework of Fine-Tuning large language models for intelligent Decision-Making in manufacturing(Kui Du, Bo Yang, Keqiang Xie, Nan Dong, Zhengping Zhang, Shilong Wang, Fan Mo, 2025, Advanced Engineering Informatics)
- Fine-Tuning Large Language Models for Specialized Use Cases(DM Anisuzzaman, Jeffrey G. Malins, Paul A. Friedman, Zachi I. Attia, 2024, Mayo Clinic Proceedings Digital Health)
- 基于大语言模型的钻井智能系统构建技术研究(郭晓乐, 吴达越, 安思旭, 段 正, 周 超, 2025, 矿山工程)
- PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance(Qianqian Xie, Weiguang Han, Zhang Xiao, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, Jimin Huang, 2023, arXiv (Cornell University))
- 医疗电商平台中大语言模型驱动的中文医学对话系统研究(滚流海, 曾以春, 吴 娜, 2024, 电子商务评论)
- LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning(Junyi Lu, Lei Yu, LI Xiao-jia, Yang Li, Chun Zuo, 2023, No journal)
- TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation(Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He, 2023, No journal)
- AutoRE: Document-Level Relation Extraction with Large Language Models(Lilong Xue, Dan Zhang, Yuxiao Dong, Jie Tang, 2024, No journal)
- A GAIL Fine-Tuned LLM Enhanced Framework for Low-Resource Knowledge Graph Question Answering(Zhiqiang Zhang, Liqiang Wen, Wen Zhao, 2024, No journal)
- Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment(Cody Savage, Adway Kanhere, Vishwa S. Parekh, Curtis P. Langlotz, Anupam Joshi, Heng Huang, Florence X. Doo, 2025, Radiology)
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct(Ziyang Luo, Can Xu, Pu Zhao, Qing‐Feng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang, 2023, arXiv (Cornell University))
- ITERTL: An Iterative Framework for Fine-tuning LLMs for RTL Code Generation(Peiyang Wu, Nan Guo, Xiao Xiao, Wenming Li, Xiaochun Ye, Dongrui Fan, 2025, No journal)
- AI-TA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs(Yann Hicke, Anmol Agarwal, Qianou Ma, Paul Denny, 2023, arXiv (Cornell University))
- A Comparative Analysis of Large Model Role-Dialogues Based on LoRA Fine-Tuning has been Conducted(Qiang Wang, Ning Ma, 2025, No journal)
- 大语言模型在企业信息化中的应用探讨(刘浩东, 2025, 电子商务评论)
- Fine Tuning LLM for Enterprise: Practical Guidelines and Recommendations(Mathav Raj J, Kushala VM, Harikrishna Warrier, Yogesh Kumar Gupta, 2024, arXiv (Cornell University))
- Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QLoRA(Xuan Zhang, Navid Rajabi, Kevin Duh, Philipp Koehn, 2023, No journal)
- BB-GeoGPT: A framework for learning a large language model for geographic information science(Yifan Zhang, Zhiyun Wang, Zhengting He, Jingxuan Li, Gengchen Mai, Jianfeng Lin, Wei Cheng, Wenhao Yu, 2024, Information Processing & Management)
- 对比经微调的ERNIE-Lite-8K-0922和GPT-4在使用Prompt策略后在英语对话系统中的表现:以心理咨询师角色为例(季东霖, 郭子浩, 陈雨洁, 王欣然, 张梦林, 孙文韬, 2024, 人工智能与机器人研究)
- SPRec: Self-Play to Debias LLM-based Recommendation(Chongming Gao, Renqiang Chen, Shuai Yuan, Kexin Huang, Yuanqing Yu, Xiangnan He, 2025, No journal)
- Harnessing Large Language Models for Text-Rich Sequential Recommendation(Zhi Zheng, Wenshuo Chao, Zhaopeng Qiu, Hengshu Zhu, Hui Xiong, 2024, No journal)
- 基于思维链的通用语言模型推理能力研究(康睿哲, 2025, 人工智能与机器人研究)
最终分组结果全面勾勒了大语言模型从底层算法到顶层应用的技术全景图。研究体系分为六大核心:以PEFT及其量化版本为代表的高效算法层;以分布式和训推一体化平台为代表的基础设施层;以人类偏好对齐、隐私保护和安全红队为代表的治理层;以多模态、工具调用和RAG为代表的能力扩展层;以及涵盖医疗、金融、制造等多个行业的垂直应用层。这体现了LLM正处于从“通用大模型”向“高效、安全、专业且具备复杂交互能力的工业级工具”转型的关键阶段。
总计86篇相关文献
No abstract
Abstract. We present a platform to support the AI development lifecycle with focus on large data like remote sensing.We target developers who are not allowed to use existing commercial cloud platforms for legal reasons or data compliance. The flexible implementation of our platform enables a deployment on classic server infrastructures as well as on internal clouds. Our goals of scalable and resource-efficient execution, independence from specific AI frameworks and programming languages, as well as reproducibility of results are met through a workflow-based calculation combined with the tool Data Version Control. The capabilities of the platform are demonstrated by training an AI-based forest type classification.
The rapid advancement of AI has underscored critical challenges in its development and implementation, largely due to centralized control by a few major corporations. This concentration of power intensifies biases within AI models, resulting from inadequate governance and oversight mechanisms. Additionally, it limits public involvement and heightens concerns about the integrity of model generation. Such monopolistic control over data and AI outputs threatens both innovation and fair data usage, as users inadvertently contribute data that primarily benefits these corporations.
Large language models (LLMs) have unlocked a plethora of powerful applications at the network edge, such as intelligent personal assistants. Data privacy and security concerns have prompted a shift towards edge-based fine-tuning of personal LLMs, away from cloud reliance. However, this raises issues of computational intensity and resource scarcity, hindering training efficiency and feasibility. While current studies investigate parameter-efficient fine-tuning (PEFT) techniques to mitigate resource constraints, our analysis indicates that these techniques are not sufficiently resource-efficient for edge devices. Other studies focus on exploiting the potential of edge devices through resource management optimization, yet are ultimately bottlenecked by the resource wall of individual devices.
Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains, thereby prompting researchers to explore their potential for use in recommendation systems. Initial attempts have leveraged the exceptional capabilities of LLMs, such as rich knowledge and strong generalization through In-context Learning, which involves phrasing the recommendation task as prompts. Nevertheless, the performance of LLMs in recommendation tasks remains suboptimal due to a substantial disparity between the training tasks for LLMs and recommendation tasks, as well as inadequate recommendation data during pre-training. To bridge the gap, we consider building a Large Recommendation Language Model by tunning LLMs with recommendation data. To this end, we propose an efficient and effective Tuning framework for Aligning LLMs with Recommendations, namely TALLRec. We have demonstrated that the proposed TALLRec framework can significantly enhance the recommendation capabilities of LLMs in the movie and book domains, even with a limited dataset of fewer than 100 samples. Additionally, the proposed framework is highly efficient and can be executed on a single RTX 3090 with LLaMA-7B. Furthermore, the fine-tuned LLM exhibits robust cross-domain generalization. Our code and data are available at https://github.com/SAI990323/TALLRec.
The success of large language models (LLMs), like GPT-4 and ChatGPT, has led to the development of numerous cost-effective and accessible alternatives that are created by finetuning open-access LLMs with task-specific data (e.g., ChatDoctor) or instruction data (e.g., Alpaca). Among the various fine-tuning methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly one of the most attractive topics, as it only requires fine-tuning a few external parameters instead of the entire LLMs while achieving comparable or even better performance. To enable further research on PEFT methods of LLMs, this paper presents LLM-Adapters, an easy-to-use framework that integrates various adapters into LLMs and can execute these adapter-based PEFT methods of LLMs for different tasks. The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, and GPT-J, as well as widely used adapters such as Series adapters, Parallel adapter, Prompt-based learning and Reparametrization-based methods. Moreover, we conduct extensive empirical studies on the impact of adapter types, placement locations, and hyper-parameters to the best design for each adapter-based methods. We evaluate the effectiveness of the adapters on fourteen datasets from two different reasoning tasks, Arithmetic Reasoning and Commonsense Reasoning. The results demonstrate that using adapter-based PEFT in smaller-scale LLMs (7B) with few extra trainable parameters yields comparable, and in some cases superior, performance to powerful LLMs (175B) in zero-shot inference on simple math reasoning datasets.
No abstract
Recent studies on knowledge graph question answering (KGQA) have focused on tackling complex inquiries to enhance the applicability of models in real-life settings. Unfortunately, KGQA models encounter significant challenges due to the lack of high-quality annotated data, making it difficult to accurately answer the diverse range of complex natural language questions posed by users. Inspired by the recent success of Large Language Models (LLMs), the burden associated with manual annotation can be mitigated by utilizing LLMs. However, the data generated directly by LLMs may exhibit a potential distribution discrepancy with real user queries. In this paper, we present an enhancement framework that utilizes Generative Adversarial Imitation Learning (GAIL) to fine-tune LLMs, which can address the challenges inherent in the low-resource KGQA task. Specifically, based on GAIL, the LLMs act as the generator aiming to output samples resembling expert demonstrations. Meanwhile, we utilize a paired discriminator to assess the authenticity of generated sequences and their relevance to the input SPARQL queries. Additionally, proximal policy optimization is leveraged to stabilize the training of the generator. Furthermore, we employ an automated algorithm to controllably sample various SPARQL queries from the knowledge graph, subsequently transforming them into corresponding natural language questions using fine-tuned LLMs. The synthetic dataset can serve as supplementary data for training lightweight KGQA models in real-world scenarios. Experimental results on the WebQuestionsSP, ComplexWebQuestions, and GrailQA show that our framework achieves state-of-the-art performance in a low-resource setting, even approaching the performance of supervised models.
No abstract
Graph Neural Networks (GNNs) have evolved to understand graph structures through recursive exchanges and aggregations among nodes. To enhance robustness, self-supervised learning (SSL) has become a vital tool for data augmentation. Traditional methods often depend on fine-tuning with task-specific labels, limiting their effectiveness when labeled data is scarce. Our research tackles this by advancing graph model generalization in zero-shot learning environments. Inspired by the success of large language models (LLMs), we aim to create a graph-oriented LLM capable of exceptional generalization across various datasets and tasks without relying on downstream graph data. We introduce the GraphGPT framework, which integrates LLMs with graph structural knowledge through graph instruction tuning. This framework includes a text-graph grounding component to link textual and graph structures and a dual-stage instruction tuning approach with a lightweight graph-text alignment projector. These innovations allow LLMs to comprehend complex graph structures and enhance adaptability across diverse datasets and tasks. Our framework demonstrates superior generalization in both supervised and zero-shot graph learning tasks, surpassing existing benchmarks. The open-sourced model implementation of our GraphGPT is available at https://github.com/HKUDS/GraphGPT.
We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on $b$ bits and a neural decoder model $\mathcal{D}_ϕ$ with its weights on $b_ϕ$ bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of $3$ bits without any training. With a budget of $2$ bits, ReALLM achieves state-of-the art performance after fine-tuning on a small calibration dataset.
The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently has been proposed to facilitate collaborative LLM fine-tuning on distributed private data, where multiple data owners collaboratively fine-tune a shared LLM without sharing raw data. However, the staggering model size of LLMs imposes heavy computing and communication burdens on clients, posing significant barriers to the democratization of the FL LLM fine-tuning paradigm. To address this issue, split learning (SL) has emerged as a promising solution by offloading the primary training workload to a server via model partitioning while exchanging activation/activation's gradients with smaller data sizes rather than the entire LLM. Unfortunately, research on the SL LLM fine-tuning paradigm is still in its nascent stage. To fill this gap, in this paper, we propose the first SL LLM fine-tuning framework, named SplitLoRA. SplitLoRA is built on the split federated learning (SFL) framework, amalgamating the advantages of parallel training from FL and model splitting from SL and thus greatly enhancing the training efficiency. It is worth noting that SplitLoRA is the inaugural open-source benchmark for SL LLM fine-tuning, providing a foundation for research efforts dedicated to advancing SL LLM fine-tuning. Extensive simulations validate that SplitLoRA achieves target accuracy in significantly less time than state-of-the-art LLM fine-tuning frameworks, demonstrating the superior training performance of SplitLoRA. The project page is available at https://fduinc.github.io/splitlora/.
The development of 5G, cloud computing, artificial intelligence (AI) and other new generation information technologies has promoted the rapid development of the data center (DC) industry, which directly increase severe energy consumption and carbon emissions problem. In addition to traditional engineering based methods, AI based technology has been widely used in existing data centers. However, the existing AI model training schemes are time-consuming and laborious. To tackle this issues, we propose an automated training and deployment platform for AI modes based on cloud-edge architecture, including the processes of data processing, data annotation, model training optimization, and model publishing. The proposed system can generate specific models based on the room environment and realize standardization and automation of model training, which is helpful for large-scale data center scenarios. The simulation and experimental results show that the proposed solution can reduce the time required of single model training by 76.2%, and multiple training tasks can run concurrently. Therefore, it can adapt to the large-scale energy-saving scenario and greatly improve the model iteration efficiency, which improves the energy-saving rate and help green energy conservation for data centers.
Long context understanding remains challenging for large language models due to their limited context windows. This paper introduces Long Input Fine-Tuning (LIFT), a novel framework for long-context modeling that can enhance the long-context performance of arbitrary short-context LLMs by dynamically adapting their parameters to the given long input. Importantly, rather than endlessly extending the context window size to accommodate increasingly longer inputs in context, LIFT stores and absorbs the long input in parameters. By fine-tuning the long input into model parameters, LIFT allows short-context LLMs to answer questions even when the required information is not provided in the context during inference, avoiding the quadratic complexity w.r.t. input length of a normal long context model. Furthermore, LIFT does not simply perform continued pretraining on new, long contexts, but leverages carefully designed LLM-generated synthetic tasks to enhance the comprehension of long contexts, moving beyond mere memorization. To accommodate the additional cost of fine-tuning, we design a highly optimized pipeline that reduces the Time to First Token (TTFT) to less than 10 seconds for 8k context. We further provide a comprehensive analysis of LIFT's strengths and limitations in long-context understanding, discuss its feasibility for large-scale real-world deployment, and highlight valuable directions for future research.
The recent surge in Large Language Models (LLMs) has garnered significant attention across numerous fields. Fine-tuning is often required to fit general LLMs for a specific domain, like the web-based healthcare system. However, two problems arise during fine-tuning LLMs for medical applications. One is the task variety problem, which involves distinct tasks in real-world medical scenarios. The variety often leads to sub-optimal fine-tuning for data imbalance and seesaw problems. Besides, the large amount of parameters in LLMs leads to huge time and computation consumption by fine-tuning. To address these two problems, we propose a novel parameter efficient fine-tuning framework for multi-task medical applications, dubbed as MOELoRA. The designed framework aims to absorb both the benefits of mixture-of-expert (MOE) for multi-task learning and low-rank adaptation (LoRA) for parameter efficient fine-tuning. For unifying MOE and LoRA, we devise multiple experts as the trainable parameters, where each expert consists of a pair of low-rank matrices to retain the small size of trainable parameters. Then, a task-motivated gate function for all MOELoRA layers is proposed, which can control the contributions of each expert and produce distinct parameters for various tasks. We conduct experiments on a multi-task medical dataset, indicating MOELoRA outperforms the existing parameter efficient fine-tuning methods. The code is available online.
Large language models (LLMs) have demonstrated great capabilities in various natural language understanding and generation tasks. These pre-trained LLMs can be further improved for specific downstream tasks by fine-tuning. However, the adoption of LLM in real-world applications can be hindered by privacy concerns and the resource-intensive nature of model training and fine-tuning. When multiple entities have similar interested tasks but cannot directly share their local data due to privacy regulations, federated learning (FL) is a mainstream solution to leverage the data of different entities. Besides avoiding direct data sharing, FL can also achieve rigorous data privacy protection, model intelligent property protection, and model customization via composition with different techniques. Despite the aforementioned advantages of FL, fine-tuning LLMs in FL settings still lacks adequate support from the existing frameworks and, therefore, faces challenges in optimizing the consumption of significant communication and computational resources, preparing various data for different tasks, and satisfying diverse information protection demands. In this paper, we discuss these challenges and introduce our package FederatedScope-LLM (FS-LLM) as a main contribution, which consists: (1) We build a complete end-to-end benchmarking pipeline under real-world scenarios, automizing the processes of dataset preprocessing, federated fine-tuning execution or simulation, and performance evaluation; (2) We provide comprehensive and off-the-shelf federated parameter-efficient fine-tuning (PEFT) algorithm implementations and versatile programming interfaces for future extension, enhancing the capabilities of LLMs in FL scenarios with low communication and computation costs, even without accessing the full model; (3) We adopt several accelerating and resource-efficient operators, and provide flexible pluggable sub-routines for interdisciplinary study. We conduct extensive and reproducible experiments to show the effectiveness of FS-LLM and benchmark advanced LLMs with PEFT algorithms in FL. We release FS-LLM at https://github.com/alibaba/FederatedScope/tree/llm.
Fine-tuning large language models (LLMs) on resource-constrained clients remains a challenging problem. Recent works have fused low-rank adaptation (LoRA) techniques with federated fine-tuning to mitigate challenges associated with client model sizes and data scarcity. Still, the heterogeneity of resources remains a critical bottleneck: while higher-rank modules generally enhance performance, varying client capabilities constrain LoRA's feasible rank range. Existing approaches attempting to resolve this issue either lack analytical justification or impose additional computational overhead, leaving a wide gap for efficient and theoretically-grounded solutions. To address these challenges, we propose federated sketching LoRA (FSLoRA), which leverages a sketching mechanism to enable clients to selectively update submatrices of global LoRA modules maintained by the server. By adjusting the sketching ratios, which determine the ranks of the submatrices on the clients, FSLoRA flexibly adapts to client-specific communication and computational constraints. We provide a rigorous convergence analysis of FSLoRA that characterizes how the sketching ratios affect the convergence rate. Through comprehensive experiments on multiple datasets and LLM models, we demonstrate FSLoRA's performance improvements compared to various baselines.
Recently, large language models (LLMs) have demonstrated excellent performance, inspiring researchers to explore their use in automating register transfer level (RTL) code generation and improving hardware design efficiency. However, the existing approaches to fine-tune LLMs for RTL generation typically are conducted on fixed datasets, which do not fully stimulate the capability of LLMs and require large amounts of reference data, which are costly to acquire. To mitigate these issues, we innovatively introduce an iterative training paradigm named ITERTL. During each iteration, samples are drawn from the model trained in the previous cycle. Then these new samples are employed for training in current loop. Furthermore, we introduce a plug-and-play data filtering strategy, thereby encouraging the model to generate high-quality, self-contained code. Our model outperforms GPT4 and state-of-the-art (SOTA) open-source models, achieving remarkable 53.8% pass@1 rate on VerilogEval-human benchmark. Under similar conditions of data quantity and quality, our approach significantly outperforms the baseline. Extensive experiments validate the effectiveness of the proposed method.
The automation of code review activities, a long-standing pursuit in software engineering, has been primarily addressed by numerous domain-specific pre-trained models. Despite their success, these models frequently demand extensive resources for pre-training from scratch. In contrast, Large Language Models (LLMs) provide an intriguing alternative, given their remarkable capabilities when supplemented with domain-specific knowledge. However, their potential for automating code review tasks remains largely unexplored.In response to this research gap, we present LLaMA-Reviewer, an innovative framework that leverages the capabilities of LLaMA, a popular LLM, in the realm of code review. Mindful of resource constraints, this framework employs parameter-efficient fine-tuning (PEFT) methods, delivering high performance while using less than 1% of trainable parameters.An extensive evaluation of LLaMA-Reviewer is conducted on two diverse, publicly available datasets. Notably, even with the smallest LLaMA base model consisting of 6.7B parameters and a limited number of tuning epochs, LLaMA-Reviewer equals the performance of existing code-review-focused models.The ablation experiments provide insights into the influence of various fine-tuning process components, including input representation, instruction tuning, and different PEFT methods. To foster continuous progress in this field, the code and all PEFT-weight plugins have been made open-source.
Although large language models (LLMs) has shown great performance on natural language processing (NLP) in the financial domain, there are no publicly available financial tailtored LLMs, instruction tuning datasets, and evaluation benchmarks, which is critical for continually pushing forward the open-source development of financial artificial intelligence (AI). This paper introduces PIXIU, a comprehensive framework including the first financial LLM based on fine-tuning LLaMA with instruction data, the first instruction data with 136K data samples to support the fine-tuning, and an evaluation benchmark with 5 tasks and 9 datasets. We first construct the large-scale multi-task instruction data considering a variety of financial tasks, financial document types, and financial data modalities. We then propose a financial LLM called FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. To support the evaluation of financial LLMs, we propose a standardized benchmark that covers a set of critical financial tasks, including five financial NLP tasks and one financial prediction task. With this benchmark, we conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks. The model, datasets, benchmark, and experimental results are open-sourced to facilitate future research in financial AI.
No abstract
Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.
Critical infrastructures are increasingly integrating artificial intelligence (AI) technologies, including large language models (LLMs), into essential systems and services that are vital to societal functioning. Fine-tuning LLMs for specific domain tasks are crucial for their effective deployment in these contexts, but this process must carefully address both privacy and security concerns. Without proper safeguards, such integration can introduce additional risks, such as data leakage during training and diminished model trustworthiness due to the need for model compression to operate within limited bandwidth and computational capacity constraints. In this paper, we propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Hardening LLM Fine-tuning framework</i> (HARDLLM), which addresses these challenges through two key components: (<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</i>) we develop a differentially private data selection method that ensures privacy protection by training the model exclusively on sampled and synthesized public data, thereby preventing any direct use of private data and enhancing leakage resilience throughout the training process, and (<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ii</i>) we introduce a trustworthiness-aware model quantization approach to improve LLMs performance, such as reducing toxicity, enhancing adversarial robustness, and mitigating stereotypes, while maintaining negligible impact on model utility. Experimental results show that, the proposed algorithm ensures differential privacy when privacy budget is set at ϵ = 0.5, with only a 1% drop in accuracy, while other state-of-the-art methods experience an accuracy drop of at least 20% under the same privacy budget. Additionally, our quantization approach improves the trustworthiness of fine-tuned LLMs by an average of 3-4%, with only a negligible utility loss (approximately 1%) at a 50% compression rate.
Recent advances in Large Language Models (LLMs) have been changing the paradigm of Recommender Systems (RS). However, when items in the recommendation scenarios contain rich textual information, such as product descriptions in online shopping or news headlines on social media, LLMs require longer texts to comprehensively depict the historical user behavior sequence. This poses significant challenges to LLM-based recommenders, such as over-length limitations, extensive time and space overheads, and suboptimal model performance. To this end, in this paper, we design a novel framework for harnessing Large Language Models for Text-Rich Sequential Recommendation (LLM-TRSR). Specifically, we first propose to segment the user historical behaviors and subsequently employ an LLM-based summarizer for summarizing these user behavior blocks. Particularly, drawing inspiration from the successful application of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) models in user modeling, we introduce two unique summarization techniques in this paper, respectively hierarchical summarization and recurrent summarization. Then, we construct a prompt text encompassing the user preference summary, recent user interactions, and candidate item information into an LLM-based recommender, which is subsequently fine-tuned using Supervised Fine-Tuning (SFT) techniques to yield our final recommendation model. We also use Low-Rank Adaptation (LoRA) for Parameter-Efficient Fine-Tuning (PEFT). We conduct experiments on two public datasets, and the results clearly demonstrate the effectiveness of our approach.
The rapid advancement of AI has underscored critical challenges in its development and implementation, largely due to centralized control by a few major corporations. This concentration of power intensifies biases within AI models, resulting from inadequate governance and oversight mechanisms. Additionally, it limits public involvement and heightens concerns about the integrity of model generation. Such monopolistic control over data and AI outputs threatens both innovation and fair data usage, as users inadvertently contribute data that primarily benefits these corporations. In this work, we propose AIArena, a blockchain-based decentralized AI training platform designed to democratize AI development and alignment through on-chain incentive mechanisms. AIArena fosters an open and collaborative environment where participants can contribute models and computing resources. Its on-chain consensus mechanism ensures fair rewards for participants based on their contributions. We instantiate and implement AIArena on the public Base blockchain Sepolia testnet, and the evaluation results demonstrate the feasibility of AIArena in real-world applications.
Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.
No abstract
In the domain of natural language processing, roledialog systems confront the challenge of generating dialog content that is often unnatural and incongruent with the scenario setting. The fine-tuning method based on large language models offers a novel approach to address this challenge. In this paper, we constructed a dataset of role dialogues in four major domains, including healthcare, finance, education, and e-commerce, for different role dialogues scenarios in Chinese. We ensured the diversity and balance of the dataset through strict data cleaning and stratified sampling strategies. Utilizing a unified LLaMA Factory training framework, it employs supervised fine-tuning (SFT) strategies and low-rank adaptation (LoRA) fine-tuning techniques to systematically analyze the optimization effects of two base models, Llama3-8B-Chinese-Chat and DeepSeek-LLM-7BChat, on evaluation metrics such as BLEU and ROUGE series. The experimental results demonstrate that DeepSeek-LLM-7BChat exhibits a marked advantage in terms of training efficiency and is well-suited for scenarios that prioritize inference speed, while Llama3-8B-Chinese-Chat demonstrates a stronger optimization potential and is particularly adept at handling scenarios that demand high-quality generation.
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.
Pre-trained Language Models (PLM) have enabled a cost-effective approach to handling various downstream applications via Parameter-Efficient-Fine-Tuning (PEFT) techniques. In this context, service providers have introduced a popular fine-tuning-based product service known as Model-as-a-Service (MaaS). This service offers users access to extensive PLMs and training resources. With MaaS, users can fine-tune, deploy, and utilize their customized models seamlessly, leveraging a one-stop platform that allows them to work with their private datasets efficiently. However, this service paradigm has recently been exposed to the possibility of leaking user private data. To this end, we identify the data privacy leakage risks in MaaS-based PEFT and propose a Split-and-Privatize (SAP) framework, mitigating the privacy leakage by integrating split learning and differential privacy into MaaS PEFT. Furthermore, we propose Contributing-Token-Identification (CTI), a novel method to balance model utility degradation and privacy leakage. As a result, the proposed framework is comprehensively evaluated, demonstrating a 65% improvement in empirical privacy with only a 1% degradation in model performance on the Stanford Sentiment Treebank dataset, outperforming existing state-of-the-art baselines.
The LoRA-finetuning quantization of LLMs has been extensively studied to obtain accurate yet compact LLMs for deployment on resource-constrained hardware. However, existing methods cause the quantized LLM to severely degrade and even fail to benefit from the finetuning of LoRA. This paper proposes a novel IR-QLoRA for pushing quantized LLMs with LoRA to be highly accurate through information retention. The proposed IR-QLoRA mainly relies on two technologies derived from the perspective of unified information: (1) statistics-based Information Calibration Quantization allows the quantized parameters of LLM to retain original information accurately; (2) finetuning-based Information Elastic Connection makes LoRA utilizes elastic representation transformation with diverse information. Comprehensive experiments show that IR-QLoRA can significantly improve accuracy across LLaMA and LLaMA2 families under 2-4 bit-widths, e.g., 4- bit LLaMA-7B achieves 1.4% improvement on MMLU compared with the state-of-the-art methods. The significant performance gain requires only a tiny 0.31% additional time consumption, revealing the satisfactory efficiency of our IR-QLoRA. We highlight that IR-QLoRA enjoys excellent versatility, compatible with various frameworks (e.g., NormalFloat and Integer quantization) and brings general accuracy gains. The code is available at https://github.com/htqin/ir-qlora.
We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.
As large language models (LLMs) like OpenAI's GPT series continue to make strides, we witness the emergence of artificial intelligence applications in an ever-expanding range of fields. In medicine, these LLMs hold considerable promise for improving medical workflows, diagnostics, patient care, and education. Yet, there is an urgent need for open-source models that can be deployed on-premises to safeguard patient privacy. In our work, we present an innovative dataset consisting of over 160,000 entries, specifically crafted to fine-tune LLMs for effective medical applications. We investigate the impact of fine-tuning these datasets on publicly accessible pre-trained LLMs, and subsequently, we juxtapose the performance of pre-trained-only models against the fine-tuned models concerning the examinations that future medical doctors must pass to achieve certification.
This paper presents a systematic overview of parameter-efficient fine-tuning methods, covering over 50 papers published between early 2019 and mid-2024. These methods aim to address the challenges of fine-tuning large language models by training only a small subset of parameters. We provide a taxonomy that covers a broad range of methods and present a detailed method comparison with a specific focus on real-life efficiency in fine-tuning multibillion-scale language models. We also conduct an extensive head-to-head experimental comparison of 15 diverse PEFT methods, evaluating their performance and efficiency on models up to 11B parameters. Our findings reveal that methods previously shown to surpass a strong LoRA baseline face difficulties in resource-constrained settings, where hyperparameter optimization is limited and the network is fine-tuned only for a few epochs. Finally, we provide a set of practical recommendations for using PEFT methods and outline potential future research directions.
Recent development in Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) have achieved superior performance and generalization capabilities, covered extensive areas of traditional tasks. However, existing large model training frameworks support only a limited number of models and techniques, particularly lacking in support for new models, which makes fine-tuning LLMs challenging for most developers. Therefore, we develop SWIFT, a customizable one-stop infrastructure for large models. With support of over 350+ LLMs and 80+ MLLMs, SWIFT stands as the open-source framework that provide the most comprehensive support for fine-tuning large models. In particular, it is the first training framework that provides systematic support for MLLMs. Moreover, SWIFT integrates post-training processes such as inference, evaluation, and quantization, to facilitate fast adoptions of large models in various application scenarios, offering helpful utilities like benchmark comparisons among different training techniques.
Large language models (LLMs) with hundreds of billions of parameters require powerful server-grade GPUs for inference, limiting their practical deployment. To address this challenge, we introduce the outlier-aware weight quantization (OWQ) method, which aims to minimize LLM's footprint through low-precision representation. OWQ prioritizes a small subset of structured weights sensitive to quantization, storing them in high-precision, while applying highly tuned quantization to the remaining dense weights. This sensitivity-aware mixed-precision scheme reduces the quantization error notably, and extensive experiments demonstrate that 3.1-bit models using OWQ perform comparably to 4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a parameter-efficient fine-tuning for task-specific adaptation, called weak column tuning (WCT), enabling accurate task-specific LLM adaptation with minimal memory overhead in the optimized format. OWQ represents a notable advancement in the flexibility, efficiency, and practicality of LLM optimization literature. The source code is available at https://github.com/xvyaward/owq.
While large language models have made remarkable advancements in natural language generation, their potential in machine translation, especially when fine-tuned, remains under-explored. In our study, we conduct comprehensive experiments, evaluating 15 publicly available language models on machine translation tasks. We compare the performance across three methodologies: zero-shot prompting, few-shot learning, and fine-tuning. Central to our approach is the use of QLoRA, an efficient fine-tuning method. On French-English, QLoRA fine-tuning outperforms both few-shot learning and models trained from scratch. This superiority is highlighted in both sentence-level and document-level translations, with a significant BLEU score improvement of 28.93 over the prompting method. Impressively, with QLoRA, the enhanced performance is achieved by fine-tuning a mere 0.77% of the model's parameters.
With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs), particularly the emergence of large language models (LLMs) with billions of parameters, many natural language processing (NLP) tasks have demonstrated remarkable success. However, the enormous size and computational demands of these models pose significant challenges for adapting them to specific downstream tasks, especially in environments with limited computational resources. Parameter Efficient Fine-Tuning (PEFT) offers an effective solution by reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning. The demands for fine-tuning PLMs, especially LLMs, have led to a surge in the development of PEFT methods, as depicted in Fig. 1. In this paper, we present a comprehensive and systematic review of PEFT methods for PLMs. We summarize these PEFT methods, discuss their applications, and outline future directions. Furthermore, we conduct experiments using several representative PEFT methods to better understand their effectiveness in parameter efficiency and memory efficiency. By offering insights into the latest advancements and practical applications, this survey serves as an invaluable resource for researchers and practitioners seeking to navigate the challenges and opportunities presented by PEFT in the context of PLMs.
Large Language Models (LLMs) have demonstrated exceptional abilities in comprehending and generating text, motivating numerous researchers to utilize them for Information Extraction (IE) purposes, including Relation Extraction (RE).Nonetheless, most existing methods are predominantly designed for Sentencelevel Relation Extraction (SentRE) tasks, which typically encompass a restricted set of relations and triplet facts within a single sentence.Furthermore, certain approaches resort to treating relations as candidate choices integrated into prompt templates, leading to inefficient processing and suboptimal performance when tackling Document-Level Relation Extraction (DocRE) tasks, which entail handling multiple relations and triplet facts distributed across a given document, posing distinct challenges.To overcome these limitations, we introduce AutoRE, an endto-end DocRE model that adopts a novel RE extraction paradigm named RHF (Relation-Head-Facts).Unlike existing approaches, AutoRE does not rely on the assumption of known relation options, making it more reflective of realworld scenarios.Additionally, we have developed an easily extensible RE framework using a Parameters Efficient Fine Tuning (PEFT) algorithm (QLoRA).Our experiments on the RE-DocRED dataset showcase AutoRE's best performance, achieving state-of-the-art results, surpassing TAG by 10.03% and 9.03% respectively on the dev and test set.The code is available 1 and the demonstration video is provided 2 .
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.
Large Language Models (LLMs), such as LLaMA and T5, have shown exceptional performance across various tasks through finetuning.Although low-rank adaption (LoRA) has emerged to cheaply fine-tune these LLMs on downstream tasks, their deployment is still hindered by the vast model scale and computational costs.Post-training model pruning offers a way to compress LLMs.However, the current pruning methods designed for LLMs are not compatible with LoRA.This is due to their utilization of unstructured pruning on LLMs, impeding the merging of LoRA weights, or their dependence on the gradients of pre-trained weights to guide pruning, which can impose significant memory overhead.To this end, we propose LoRAPrune, a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.Specifically, we first design a LoRA-guided pruning criterion, which uses the weights and gradients of LoRA, rather than the gradients of pre-trained weights for importance estimation.We subsequently integrate this criterion into an iterative pruning process, effectively removing redundant channels and heads.Extensive experimental results demonstrate the superior performance of our LoRAPrune over existing approaches on the LLaMA series models.At a 50% compression rate, LoRAPrune demonstrates superior performance over LLM-Pruner, achieving a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.Besides, LoRAPrune also matches semi-structural pruning across multiple LLMs, proving its wide applicability.The code is available at https: //github.com/aim-uofa/LoRAPrune.
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
No abstract
Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM
Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrepancy between the quantized and full-precision model and significantly improves generalization in downstream tasks. We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. The code is available on https://github.com/yxli2123/LoftQ.
Retrieval-augmented generation (RAG) has effectively mitigated the hallucination problem of large language models (LLMs). However, the difficulty of aligning the retriever with the LLMs' diverse knowledge preferences inevitably poses a challenge in developing a reliable RAG system. To address this issue, we propose DPA-RAG, a universal framework designed to align diverse knowledge preferences within RAG systems. Specifically, we initially introduce a preference knowledge construction pipeline and incorporate five novel query augmentation strategies to alleviate preference data scarcity. Based on preference data, DPA-RAG accomplishes both external and internal preference alignment: 1) It jointly integrates pairwise, pointwise, and contrastive preference alignment abilities into the reranker, achieving external preference alignment among RAG components. 2) It further introduces a pre-aligned stage before vanilla Supervised Fine-tuning (SFT), enabling LLMs to implicitly capture knowledge aligned with their reasoning preferences, achieving LLMs' internal alignment. Experimental results across four knowledge-intensive QA datasets demonstrate that DPA-RAG outperforms all baselines and seamlessly integrates both black-box and open-sourced LLM readers. Further qualitative analysis and discussions provide empirical guidance for achieving reliable RAG systems. Our code and example dataset are available at https://github.com/dongguanting/DPA-RAG.
No abstract
Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
No abstract
There is a compelling necessity from enterprises for fine tuning LLMs (Large Language Models) o get them trained on proprietary domain knowledge. The challenge is to imbibe the LLMs with domain specific knowledge using the most optimial resource and cost and in the best possible time. Many enterprises rely on RAG (Retrieval Augmented Generation) which does not need LLMs to be ine-tuned but they are limited by the quality of vector databases and their retrieval capabilities rather than the intrinsic capabilities of the LLMs themselves. In our current work we focus on fine tuning LLaMA, an open source LLM using proprietary documents and code from an enterprise repository and use the fine tuned models to evaluate the quality of responses. As part of this work, we aim to guide beginners on how to start with fine tuning an LLM for documentation and code by making educated guesses on size of GPU required and options that are available for formatting the data. We also propose pre processing recipes for both documentation and code to prepare dataset in different formats. The proposed methods of data preparation for document datasets are forming paragraph chunks, forming question and answer pairs and forming keyword and paragraph chunk pairs. For code dataset we propose forming summary and function pairs. Further, we qualitatively evaluate the results of the models for domain specific queries. Finally, we also propose practical guidelines and recommendations for fine tuning LLMs.
Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing \ours, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. \ours~consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding. Code is available at https://github.com/NVlabs/DoRA.
Large language models (LLMs) are proficient in capturing factual knowledge across various domains.However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge.In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources.We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models.Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection.We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities. 1
Large Language Models (LLMs), such as ChatGPT, LLaMA, GLM, and PaLM, have exhibited remarkable performances across various tasks in recent years. However, LLMs face two main challenges in real-world applications. One challenge is that training LLMs consumes vast computing resources, preventing LLMs from being adopted by small and medium-sized enterprises with limited computing resources. Another is that training LLM requires a large amount of high-quality data, which are often scattered among enterprises. To address these challenges, we propose FATE-LLM, an industrial-grade federated learning framework for large language models. FATE-LLM (1) facilitates federated learning for large language models (coined FedLLM); (2) promotes efficient training of FedLLM using parameter-efficient fine-tuning methods; (3) protects the intellectual property of LLMs; (4) preserves data privacy during training and inference through privacy-preserving mechanisms. We release the code of FATE-LLM at https://github.com/FederatedAI/FATE-LLM to facilitate the research of FedLLM and enable a broad range of industrial applications.
Large Language Models (LLMs), such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards Artificial General Intelligence (AGI). Nonetheless, the high costs associated with training and deploying LLMs present substantial obstacles to transparent, accessible academic research. While several large language models, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA's existing vocabulary with an additional 20,000 Chinese tokens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets, significantly enhancing the model's ability to comprehend and execute instructions. Our experimental results indicate that the newly proposed model markedly enhances the original LLaMA's proficiency in understanding and generating Chinese content. Additionally, the results on the C-Eval dataset yield competitive performance among the models with several times the size of ours. We have made our pre-trained models, training scripts, and other resources available through GitHub, fostering open research for our community. Chinese LLaMA series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca} and Chinese Llama-2 series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca-2}
Large language models (LLMs), which have shown remarkable capabilities, are revolutionizing AI development and potentially shaping our future. However, given their multimodality, the status quo cloud-based deployment faces some critical challenges: 1) long response time; 2) high bandwidth costs; and 3) the violation of data privacy. 6G mobile edge computing (MEC) systems may resolve these pressing issues. In this article, we explore the potential of deploying LLMs at the 6G edge. We start by introducing killer applications powered by multimodal LLMs, including robotics and healthcare, to highlight the need for deploying LLMs in the vicinity of end users. Then, we identify the critical challenges for LLM deployment at the edge and envision the 6G MEC architecture for LLMs. Furthermore, we delve into two design aspects, i.e., edge training and edge inference for LLMs. In both aspects, considering the inherent resource limitations at the edge, we discuss various cutting-edge techniques, including split learning/inference, parameter-efficient fine-tuning, quantization, and parameter-sharing inference, to facilitate the efficient deployment of LLMs. This article serves as a position paper for thoroughly identifying the motivation, challenges, and pathway for empowering LLMs at the 6G edge.
Adapting pretrained language models to novel domains, such as clinical applications, traditionally involves retraining their entire set of parameters.Parameter-Efficient Fine-Tuning (PEFT) techniques for fine-tuning language models significantly reduce computational requirements by selectively fine-tuning small subsets of parameters.In this study, we propose a two-step PEFT framework and evaluate it in the clinical domain.Our approach combines a specialised PEFT adapter layer designed for clinical domain adaptation with another adapter specialised for downstream tasks.We evaluate the framework on multiple clinical outcome prediction datasets, comparing it to clinically trained language models.Our framework achieves a better AUROC score averaged across all clinical downstream tasks compared to clinical language models.In particular, we observe large improvements of 4-5% AUROC in large-scale multilabel classification tasks, such as diagnoses and procedures classification.To our knowledge, this study is the first to provide an extensive empirical analysis of the interplay between PEFT techniques and domain adaptation in an important real-world domain of clinical applications. 1
Parameter-efficient fine-tuning methods, represented by LoRA, play an essential role in adapting large-scale pre-trained models to downstream tasks. However, fine-tuning LoRA-series models also faces the risk of overfitting on the training dataset, and yet there's still a lack of theoretical guidance and practical mechanism to control overfitting on LoRA-based PEFT methods. In this paper, we propose a LoRA Dropout mechanism for the LoRA-based methods by introducing random noises to the learnable low-rank matrices and increasing parameter sparsity. We then demonstrate the theoretical mechanism of our LoRA Dropout mechanism from the perspective of sparsity regularization by providing a generalization error bound under this framework. Theoretical results show that appropriate sparsity would help tighten the gap between empirical and generalization risks and thereby control overfitting. Furthermore, based on the LoRA Dropout framework, we introduce a test-time ensemble strategy and provide theoretical evidence demonstrating that the ensemble method can further compress the error bound, and lead to better performance during inference time. Extensive experiments on various NLP tasks provide practical validations of the effectiveness of our LoRA Dropout framework in improving model accuracy and calibration.
Recent advances in Large Language Models (LLMs) have achieved remarkable breakthroughs in understanding and responding to user intents. However, their performance lag behind general use cases in some expertise domains, such as Chinese medicine. Existing efforts to incorporate Chinese medicine into LLMs rely on Supervised Fine-Tuning (SFT) with single-turn and distilled dialogue data. These models lack the ability for doctor-like proactive inquiry and multi-turn comprehension and cannot align responses with experts' intentions. In this work, we introduce Zhongjing, the first Chinese medical LLaMA-based LLM that implements an entire training pipeline from continuous pre-training, SFT, to Reinforcement Learning from Human Feedback (RLHF). Additionally, we construct a Chinese multi-turn medical dialogue dataset of 70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly enhances the model's capability for complex dialogue and proactive inquiry initiation. We also define a refined annotation rule and evaluation criteria given the unique characteristics of the biomedical domain. Extensive experimental results show that Zhongjing outperforms baselines in various capacities and matches the performance of ChatGPT in some abilities, despite the 100x parameters. Ablation studies also demonstrate the contributions of each component: pre-training enhances medical knowledge, and RLHF further improves instruction-following ability and safety. Our code, datasets, and models are available at https://github.com/SupritYoung/Zhongjing.
Aligned language models face a significant limitation as their fine-tuning often results in compromised safety.To tackle this, we propose a simple method RESTA that performs LLM safety realignment.RESTA stands for REstoring Safety through Task Arithmetic.At its core, it involves a simple arithmetic addition of a safety vector to the weights of the compromised model.We demonstrate the effectiveness of RESTA in both parameter-efficient and full fine-tuning, covering a wide range of downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.We also showcase the generalizability of RESTA on three existing safety evaluation benchmarks and a multilingual benchmark dataset proposed as a part of this work, consisting of 550 harmful questions covering 11 categories, each with 5 sub-categories of harm.Overall, RESTA decreases the harmfulness of the compromised model from 18.6% to 5.1% and from 9.2% to 1.5% in parameter-efficient and full finetuning, respectively, while maintaining most of the model's performance on the task.We release the source code at: https://github. com/declare-lab/resta.
Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory.
Multimodal foundation models are transformative in sequential recommender systems, leveraging powerful representation learning capabilities. While Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt foundation models for recommendation tasks, most research prioritizes parameter efficiency, often overlooking critical factors like GPU memory efficiency and training speed. Addressing this gap, our paper introduces IISAN (Intra- and Inter-modal Side Adapted Network for Multimodal Representation), a simple plug-and-play architecture using a Decoupled PEFT structure and exploiting both intra- and inter-modal adaptation. IISAN matches the performance of full fine-tuning (FFT) and state-of-the-art PEFT. More importantly, it significantly reduces GPU memory usage - from 47GB to just 3GB for multimodal sequential recommendation tasks. Additionally, it accelerates training time per epoch from 443s to 22s compared to FFT. This is also a notable improvement over the Adapter and LoRA, which require 37-39 GB GPU memory and 350-380 seconds per epoch for training. Furthermore, we propose a new composite efficiency metric, TPME (Training-time, Parameter, and GPU Memory Efficiency) to alleviate the prevalent misconception that "parameter efficiency represents overall efficiency". TPME provides more comprehensive insights into practical efficiency comparisons between different methods. Besides, we give an accessible efficiency analysis of all PEFT and FFT approaches, which demonstrate the superiority of IISAN. We release our codes and other materials at https://github.com/GAIR-Lab/IISAN.
Jiahuan Pei, Irene Viola, Haochen Huang, Junxiao Wang, Moonisa Ahsan, Fanghua Ye, Jiang Yiming, Yao Sai, Di Wang, Zhumin Chen, Pengjie Ren, Pablo Cesar. Findings of the Association for Computational Linguistics: ACL 2024. 2024.
Integrating large language models (LLMs) into health care holds substantial potential to enhance clinical workflows and care delivery. However, LLMs also pose serious risks if integration is not thoughtfully executed, with complex challenges spanning accuracy, accessibility, privacy, and regulation. Proprietary commercial LLMs (eg, GPT-4 [OpenAI], Claude 3 Sonnet and Claude 3 Opus [Anthropic], Gemini [Google]) have received much attention from researchers in the medical domain, including radiology. Interestingly, open-source LLMs (eg, Llama 3 and LLaVA-Med) have received comparatively little attention. Yet, open-source LLMs hold several key advantages over proprietary LLMs for medical institutions, hospitals, and individual researchers. The wider adoption of open-source LLMs has been slower, perhaps in part due to the lack of familiarity, accessible computational infrastructure, and community-built tools to streamline their local implementation and customize them for specific use cases. Thus, this article provides a tutorial for the implementation of open-source LLMs in radiology, including examples of commonly used tools for text generation and techniques for troubleshooting issues with prompt engineering, retrieval-augmented generation, and fine-tuning. Implementation-ready code for each tool is provided at <i>https://github.com/UM2ii/Open-Source-LLM-Tools-for-Radiology</i>. In addition, this article compares the benefits and drawbacks of open-source and proprietary LLMs, discusses the differentiating characteristics of popular open-source LLMs, and highlights recent advancements that may affect their adoption.
This report introduces \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models that exhibit remarkable capabilities across English and Korean text understanding. Building on recent highly capable but English-centric LLMs, such as SOLAR-10.7B and Phi-2, where non-English texts are inefficiently processed with English-centric tokenizers, we present an efficient and effective vocabulary expansion (EEVE) method, which encompasses parameter freezing and subword initialization. In contrast to previous efforts that believe new embeddings require trillions of training tokens, we show that our method can significantly boost non-English proficiency within just 2 billion tokens. Surpassing most instruction-tuned LLMs on the Open Ko-LLM Leaderboard, as of January 2024, our model \texttt{EEVE-Korean-10.8B-v1.0} ranks as the leading Korean pre-trained model in the open-source community, according to Hugging Face's leaderboard. We open-source our models on Huggingface to empower the open research community in various languages.
A single language model, even when aligned with labelers through reinforcement learning from human feedback (RLHF), may not suit all human preferences.Recent approaches therefore prefer customization, gathering multidimensional feedback, and creating distinct reward models for each dimension.Different language models are then optimized for various preferences using multi-objective RLHF (MORLHF) with varying reward weights.However, RL fine-tuning is unstable and resourceheavy, especially with diverse and usually conflicting objectives.In this paper, we present Multi-Objective Direct Preference Optimization (MODPO), an RL-free extension of Direct Preference Optimization (DPO) for multiple alignment objectives.Essentially, MODPO folds language modeling directly into reward modeling, training language models as implicit collective reward models that combine all objectives with specific weights.MODPO theoretically yields the same optimal solutions as MORLHF but is practically more stable and efficient.Empirical results in safety alignment and long-form question answering show that MODPO matches or outperforms existing methods, producing a Pareto front of language models catering to diverse preferences with three times less computational resources compared to MORLHF.Code is available at https://github.com/ZHZisZZ/modpo.
# LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs ## 💡 Overview Artificial Intelligence (AI) is revolutionizing scientific research, but its growing integration into laboratory environments brings critical safety challenges. As large language models (LLMs) and vision language models (VLMs) are increasingly used for procedural guidance and even autonomous experiment orchestration, there is a risk of an "illusion of understanding" where users may overestimate the reliability of these systems in safety-critical situations. **LabSafety Bench** is a comprehensive evaluation framework designed to rigorously assess the trustworthiness of these models in laboratory settings. The benchmark includes two main evaluation components: - **Multiple-Choice Questions (MCQs):** A set of 765 questions derived from authoritative lab safety protocols, comprising 632 text-only questions and 133 multimodal questions. - **Real-World Scenario Evaluations:** A collection of 404 realistic laboratory scenarios that yield a total of 3128 open-ended questions, organized into: - **Hazards Identification Test:** Models identify all potential hazards in a given scenario. - **Consequence Identification Test:** Models predict the outcomes of executing specific hazardous actions. Developed via expert-AI collaboration using sources such as OSHA, WHO, and established textbooks, LabSafety Bench ensures that every evaluation item is verified for clarity, accuracy, and practical relevance. For more details, please visit our [project website](https://yujunzhou.github.io/LabSafetyBench.github.io/). ### LabSafety Bench Overview ## 🔧 Installation Install the required Python packages by running: ```bash pip install -r requirements.txt ``` ### Additional Setup For SFT (Supervised Fine-Tuning), please follow [@LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) to install LLaMA-Factory. For ChemCrow evaluation, please follow [@ChemCrow](https://github.com/ur-whitelab/chemcrow-public) and create a new environment for evaluation. ## 📖 Dataset Usage ### Data Downloading The dataset is divided into five splits: - **QA**: 632 text-only examples for standard evaluation. - **QA_I**: 133 multimodal examples for standard evaluation. - **sampledQA**: 80 text-only examples suitable for human evaluation, validation, or low-resource scenarios. - **sampledQA_I**: 20 multimodal examples for similar use cases. - **scenario**: 404 real-world scenarios combined with 3128 open-ended questions. After installing [Huggingface Datasets](https://huggingface.co/docs/datasets/quickstart), download the dataset by running: ```python from datasets import load_dataset # Load MCQ configuration (default configuration) MCQ_dataset = load_dataset("yujunzhou/LabSafety_Bench", name="MCQ") # Or load a specific split from MCQ configuration QA_split = load_dataset("yujunzhou/LabSafety_Bench", name="MCQ", split="QA") # Load scenario configuration scenario_dataset = load_dataset("yujunzhou/LabSafety_Bench", name="scenario", split="scenario") ``` ### Data Format #### MCQ Configuration ("MCQ") Each sample in the MCQ configuration is a dictionary containing the following keys: - **Question**: *string* A multiple-choice question with four options. - **Explanation**: *string* A detailed explanation outlining why the correct answer is right and why the other options are not. - **Correct Answer**: *string* The correct option (one of 'A', 'B', 'C', or 'D'). - **Category**: *list of strings* The lab safety category covered by the question. - **Topic**: *string* A brief descriptor identifying the main hazard or equipment involved. - **Level**: *string* “Easy” or “Hard”, indicating whether the question can be answered with high school-level knowledge. - **Image Path**: *string* The image file path for multimodal questions (None for text-only questions). - **Decoded Image**: *Image* The actual image for multimodal questions. ### Example Question Display #### Scenario Configuration ("scenario") Each sample in the scenario configuration is a dictionary containing the following keys: - **Scenario**: *string* A detailed description of the laboratory scenario. - **LabSafety_Related_Issues**: *dict* Contains: - **Most_Common_Hazards**: *list of strings* - **Improper_Operation_Issues**: *list of strings* - **Negative_Lab_Environment_Impacts**: *list of strings* - **Most_Likely_Safety_Incidents**: *list of strings* - **Topic**: *string* A brief descriptor identifying the main hazard or equipment involved. - **SubCategory**: *string* A subcategory label. - **Decisions**: *list of dicts* Each dictionary contains: - **Decision**: *string* - **Consequence**: *string* - **Subject**: *string* A Subject label. ## 📝 Evaluations ### 1. API Key Setup Ensure that you have configured your OpenAI API key and any other required keys (e.g., for Claude or Gemini) in the `config.py` file. ### 2. Evaluations of Multiple-Choice Questions LabSafety Bench supports evaluations for both text-only and multimodal tasks. Predefined models for text-only evaluations include, but are not limited to: - **LLMs**: 'llama3-instruct-8b', 'vicuna-7b', 'mistral-7b', etc. - **VLMs (for multimodal tasks)**: 'instructBlip-7B', 'Qwen-VL-Chat', 'InternVL2', etc. Example commands for text-only MCQs evaluation on sampled MCQ dataset: ```sh cd code/test python text_QA.py \ --models gpt-4o-mini,o3-mini \ --mode CoT \ --n_shots 0 \ --sampled ``` For text-with-image MCQs evaluation: ```sh python text_with_image_QA.py \ --model_name gpt-4o-mini \ --CoT \ --n_shots 0 \ ``` Additional scripts such as `code/analysis/category_acc.py` and `code/analysis/level_acc.py` provide detailed breakdowns by safety category and difficulty level. ### 3. Evaluation of Real-World Scenario Tasks The benchmark includes two additional real-world evaluation tasks: - **Hazards Identification Test**: Assess the model's ability to comprehensively list potential hazards in realistic lab scenarios. - **Consequence Identification Test**: Evaluate the model's capability to predict the outcomes of specific hazardous actions in a given scenario. These tasks simulate dynamic and practical lab environments, addressing the critical need to ensure that AI systems are reliable when making safety-critical decisions. Example commands for real-world scenario-based evaluation: For scenario identification test: ```sh python scenario_hazards.py \ --models gpt-4o-mini,o3-mini, llama3.3-70b \ --mode DA ``` For consequence identification test: ```sh python decision_consequnce.py \ --models gpt-4o-mini,o3-mini, llama3.3-70b \ --mode CoT ``` For scenario hazards evaluation with set points: ```sh python scenario_hazards_set_points.py \ --models gpt-4o-mini \ --mode DA \ --num_points 10 ``` ### 4. Evaluation of Additional Models To evaluate open-weight models not included in the predefined list in "code/config.py", follow these steps: 1. **Configure Model Paths**: First, add your model to `code/config.py` by setting the model name and path correspondence: ```python model_path_dicts = { # ... existing models ... "your-model-name": "/path/to/your/model", "another-model": "/path/to/another/model" } ``` 2. **Run Evaluations**: After configuring the model paths, run the evaluations from **Section 2 (Multiple-Choice Questions)** and **Section 3 (Real-World Scenario Tasks)** using your model names: ```sh # Example for MCQ evaluation python text_QA.py --models your-model-name --mode CoT --n_shots 0 # Example for scenario evaluation python scenario_hazards.py --models your-model-name --mode DA ``` 3. **Advanced Customization**: If needed, you can also modify the model loading and inference procedures in `code/utils` and adjust the corresponding evaluation scripts for specialized model architectures. ## 🚀 SFT Training and Evaluation For all SFT settings, please first use LLaMA-Factory for training. The training datasets are located in `llamafactory_data`, which also includes `sft.yaml` as an SFT template. You only need to modify the dataset and output_dir to use it directly. ### Training with LLaMA-Factory 1. **Configure Dataset Registration**: First, modify the `LLaMA-Factory/data/dataset_info.json` file to register our SFT datasets: 2. **Modify Training Configuration**: Navigate to your LLaMA-Factory installation directory and modify the `sft.yaml` configuration file in `llamafactory_data` with your desired dataset and output directory. 3. **Run Training**: ```bash llamafactory-cli train sft.yaml ``` 4. **Update Model Configuration**: After training completion, modify `code/config.py` to add the trained model path and name correspondence: ```python model_path_dicts = { # ... existing models ... "labsafety-text-qa": "/path/to/your/fine-tuned/text-qa-model", "labsafety-scenario": "/path/to/your/fine-tuned/scenario-model", "labsafety-decision": "/path/to/your/fine-tuned/decision-model" } ``` ### Post-Training Evaluation After training completion, use the following specialized SFT evaluation scripts for testing: For MCQ evaluation with fine-tuned models: ```sh cd code/test python text_QA_sft.py \ --models labsafety-text-qa \ --mode CoT ``` For scenario hazards evaluation with fine-tuned models: ```sh python scenario_hazards_sft.py \ --models labsafety-scenario \ --mode DA ``` For consequence identification with fine-tuned models: ```sh python decision_consequence_sft.py \ --models labsafety-decision \ --mode CoT ``` These evaluation scripts are based on the existing `scenario_hazards_sft.py`, `decision_consequence_sft.py`, and `text_QA_sft.py` files, which have been specifically adapted for fine-tuned model evaluation with proper model loading and testing procedures. ### Further Analysis For detailed analysis of results, you can directly use the following evaluation scripts: - `code/analysis/category_acc.py` - Analyze accuracy by safety categories - `co
Extracting specific information from diverse and complex image types is a challenging task, especially in real-world cases where images often contain noise, low resolution, occlusion, or poor contrast. Traditional approaches struggle in this case as they rely heavily on clear, high-quality datasets, making them unsuitable for cluttered or ambiguous inputs. In this paper, we introduce a novel solution leveraging the Qwen2 multimodal transformer, fine-tuned with the Llama Factory framework, to address these limitations. This approach has practical applications such as extracting product details for e-commerce platforms to improve filters and product detail pages, and retrieving key information from medicine packaging to ensure accurate assortment and categorization. Evaluated on a large dataset with thousands of unique samples, our method shows significant improvements in accurately retrieving information from complex images. The results highlight the robustness and efficiency of our approach, setting a new benchmark for real-world image retrieval tasks.
Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, Yang Liu. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
Training Large Language Models (LLMs) efficiently at scale presents a formidable challenge, driven by their ever-increasing computational demands and the need for enhanced performance. In this work, we introduce Liger-Kernel, an open-sourced set of Triton kernels developed specifically for LLM training. With kernel optimization techniques like kernel operation fusing and input chunking, our kernels achieve on average a 20% increase in training throughput and a 60% reduction in GPU memory usage for popular LLMs compared to HuggingFace implementations. In addition, Liger-Kernel is designed with modularity, accessibility, and adaptability in mind, catering to both casual and expert users. Comprehensive benchmarks and integration tests are built in to ensure compatibility, performance, correctness, and convergence across diverse computing environments and model architectures. The source code is available under a permissive license at: github.com/linkedin/Liger-Kernel.
Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Tanglifu Tanglifu, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 2025.
We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.
Large language models (LLMs) have attracted significant attention in recommendation systems.Current work primarily applies supervised fine-tuning (SFT) to adapt the model for recommendation tasks.However, SFT on positive examples only limits the model's ability to align with user preference.To address this, researchers recently introduced Direct Preference Optimization (DPO), which explicitly aligns LLMs with user preferences using offline preference ranking data.However, we found that DPO inherently biases the model towards a few items, exacerbating the filter bubble issue and ultimately degrading user experience.In this paper, we propose SPRec, a novel self-play framework designed to mitigate over-recommendation and improve fairness without requiring additional data or manual intervention.In each self-play iteration, the model undergoes an SFT step followed by a DPO step, treating offline interaction data as positive samples and the predicted outputs from the previous iteration as negative samples.This effectively re-weights the DPO loss function using the model's logits, adaptively suppressing biased items.Extensive experiments on multiple real-world datasets demonstrate SPRec's effectiveness in enhancing recommendation accuracy and fairness.The code is available via https://github.com/RegionCh/SPRec.
Responding to the thousands of student questions on online QA platforms each semester has a considerable human cost, particularly in computing courses with rapidly growing enrollments. To address the challenges of scalable and intelligent question-answering (QA), we introduce an innovative solution that leverages open-source Large Language Models (LLMs) from the LLaMA-2 family to ensure data privacy. Our approach combines augmentation techniques such as retrieval augmented generation (RAG), supervised fine-tuning (SFT), and learning from human preferences data using Direct Preference Optimization (DPO). Through extensive experimentation on a Piazza dataset from an introductory CS course, comprising 10,000 QA pairs and 1,500 pairs of preference data, we demonstrate a significant 30% improvement in the quality of answers, with RAG being a particularly impactful addition. Our contributions include the development of a novel architecture for educational QA, extensive evaluations of LLM performance utilizing both human assessments and LLM-based metrics, and insights into the challenges and future directions of educational data processing. This work paves the way for the development of AI-TA, an intelligent QA assistant customizable for courses with an online QA platform
VLMs (Vision-Language Models) extend the capabilities of LLMs (Large Language Models) to accept multimodal inputs.Since it has been verified that LLMs can be induced to generate harmful or inaccurate content through specific test cases (termed as Red Teaming), how VLMs perform in similar scenarios, especially with their combination of textual and visual inputs, remains a question.To explore this problem, we present a novel red teaming dataset RTVLM, which encompasses 12 subtasks (e.g., image misleading, multi-modal jailbreaking, face fairness, etc) under 4 primary aspects (faithfulness, privacy, safety, fairness).Our RTVLM is the first red teaming dataset to benchmark current VLMs in terms of these 4 different aspects.Detailed analysis shows that 10 prominent open-sourced VLMs struggle with the red teaming in different degrees and have up to 31% performance gap with GPT-4V.Additionally, we simply apply red teaming alignment to LLaVA-v1.5 with Supervised Fine-tuning (SFT) using RTVLM, and this bolsters the models' performance with 10% in RTVLM test set, 13% in MM-hallu, and without noticeable decline in MM-Bench, overpassing other LLaVA-based models in similar size with regular alignment data.This reveals that current open-sourced VLMs still lack red teaming alignment.Our code and datasets will be open-sourced 1 .
Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).
While reinforcement learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward overoptimization (ROO).Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning.Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself.Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective.Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function.This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation.We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.
在大数据时代,海量的互联网信息飞速增长,人们对信息获取的精准度与效率提出了更高的要求。随着企业信息化和装备管理现代化的不断推进,对海量企业装备信息进行有效的提炼、管理与利用,对于提升企业装备知识的应用价值以及企业资源的利用效率具有重要意义。本研究提出了一套融合大语言模型自然语言处理能力的系统,可智能理解用户查询并提供精准的装备信息。通过采用P-Tuning v2方法对大语言模型进行微调,大幅提升了其在企业装备领域对关键词的识别和提取能力。同时,借助企业装备知识图谱作为本地知识库,为模型提供行业领域知识,使其能够将相关信息作为问题的上下文进行学习。基于此,还设计了提示工程来引导模型生成更准确的回复,并对结果进行了效果评估。实验结果表明,相较于直接使用大语言模型,该基于知识图谱增强的大语言模型在企业装备领域的智能化回复准确率更高,为企业装备问答系统的建设提供了有力支持。
交通流预测是智能交通系统中的关键任务,对城市规划和交通管理具有重要意义。传统深度学习方法虽然提升了预测精度,但其可解释性较差,且常规大语言模型难以捕捉交通流的复杂时空动态特性。为此,本文提出了一种基于DeepSeek大语言模型微调与动态建模相结合的交通流预测方法(DynaSeek)。将动态建模技术和数据修正机制嵌入交通流预测任务中,量化时空因素(如地区、天气)对交通流的影响,采用Lora微调策略对DeepSeek模型进行优化,最后模型在预测时结合历史数据和实时时空信息进行动态修正。结果表明,本文方法在加利福尼亚州多模态数据集上表现优于基线模型,且在可解释性方面提供了更清晰的交通流变化规律。本文为交通流预测提供了一种新的多维度数据融合与动态建模框架,显著提升了模型的实用性和可靠性。
针对现有医疗废弃物分类模型在开放场景下存在小目标漏检率高、多类别混淆严重等问题,本文提出融合视觉–语言联合建模的改进型GroundingDINO模型。为了增强有效特征和精确位置信息的提取,并减少无效信息的干扰,在模型中构建了跨模态对比学习框架,结合低秩适配技术(Low-Rank Adaptation, LoRA),对模型进行了轻量级优化,使其能够在保证高精度的同时,减少计算资源消耗。并引入EIoU (Enhanced IoU)损失函数,进一步提升了目标框的定位精度,并增强了模型在复杂医疗废弃物分类任务中的鲁棒性。结果表明,在依据国家医疗废弃物管理条例构建覆盖5大类20子类的医疗废弃物图像数据集上取得了良好效果,相比于基线模型GroundingDINO,以及阿里云发布的视觉理解大模型Qwen2.5-vl-72B,本实验基于GroundingDINO微调的GroundingDINO-MW在精确度、召回率、mAP以及F1指标上全面超越这几个检测模型。也充分证明了相较于原始模型可以更好地用在开放场景下的医疗废弃物分类识别中。
随着互联网技术和人工智能的迅猛发展,医疗电商平台在现代医药服务中扮演着越来越重要的角色。本研究提出了一种基于大语言模型(LLM)的中文医学对话系统模型MedAsst,并探讨其在医疗电商平台中的应用。该模型以Qwen2-7B为基础,通过LoRA方法在147万条医学问答数据上进行监督微调。本文在医学多项选择题测试和自定义医学问答数据集上对MedAsst的有效性进行了全面评估。实验结果显示,MedAsst在BLEU-4、ROUGE-1、ROUGE-2和ROUGE-L等评价指标上均优于其他基线模型,特别是在医学问答能力上展现出显著优势。与LlaMa-3-8B、Gemma-7B、Mistral-7B和未经微调的Qwen2-7B模型相比,MedAsst通过合理的微调策略在特定领域的任务中表现出色,证明了监督微调的必要性和有效性。本文的研究不仅提升了模型在中文医学问答任务中的表现,也展示了大语言模型在医疗电商平台中的应用潜力,为未来在更复杂场景中的优化和实际应用提供了有力支持。
随着教育规模扩大与个性化需求增长,传统人工阅卷模式因效率低、主观性强及反馈滞后等问题面临严峻挑战。本研究针对这一痛点,提出一种基于大语言模型(LLM)的智能阅卷系统,旨在通过技术创新提升评卷效率、公平性与可解释性。系统以Transformer架构为核心,采用“指令微调 + 规则约束强化学习”的混合评分算法,结合历史试卷数据与专家评分规则对LLM进行领域适配优化,有效解决主观题评分一致性难题;通过模块化设计实现数据预处理、多阶段评分、错因溯源与个性化反馈生成的全流程自动化。创新性体现于三方面:其一,融合LLM语义理解与规则引擎硬性约束,平衡算法灵活性与评估严谨性;其二,设计注意力权重可视化与评分依据高亮机制,破解教育场景下的“黑箱”信任壁垒;其三,构建轻量化微调(LoRA)与向量数据库协同架构,保障高并发场景的工程可行性。该系统为大规模考试、个性化教学提供技术支撑,推动教育评估从“经验驱动”向“数据智能”转型。未来研究将扩展至多模态答案解析与动态学情追踪,深化AI与教育融合的实践价值。
近年来,随着大语言模型(后文简称“大模型”)的出现,为解决石油钻井领域的复杂问题提供了技术基础。然而,现有AI (如DeepSeek)存在着多模态能力缺失、功能模块隔离和知识失效性边界、文件长度限制、文件格式与内存兼容性等问题,其作用多停留在通用任务优化,难以精准响应钻井工程设计、故障诊断等专业问题。针对通用开源大语言模型现存的专业术语理解偏差、行业知识融合不足导致场景适配性差等问题,为实现大模型与钻井专业知识的深度耦合,本文基于Python语言和MaxKB等开源平台,创新采用“钻井智能体–工作流”技术体系,构建了类ChatGPT的石油钻井业内智能系统(DrillingGPT),也即钻井智能体,有效提升了大模型在钻井专业问答、方案生成等任务中的准确率与逻辑合规性,旨在为通用大模型向垂直工程领域的行业落地提供方法思路与技术支持。
近年来,以ChatGPT为代表的大规模预训练模型不断突破AI技术瓶颈,AI应用场景碎片化问题有望在短期内从根本上得到解决。未来,集中式AI应用研发将会取代传统的小作坊式生产,这一趋势对支撑AI模型训练、微调和部署等环节的人工智能平台提出了更高的要求。本文针对主流人工智能平台存在部分问题,设计了一套训练、推理一体化平台。该平台通过工作流引擎实现了机器学习流水线的高效调度,利用虚拟化和容器化技术解决了硬件资源分配和调度问题,此外基于自动化表单工具实现了算子的组件化和插件化管理。本文所设计的训推一体平台将有效降低AI应用的开发门槛,促进AI应用集中式和规模化生产,推动大规模预训练模型快速渗透到各个垂直行业应用场景。
本研究基于大模型在英语对话系统中的实际应用对比了经过微调的ERNIE-Lite-8K-0922和GPT-4模型在采用Prompt策略后在英语对话系统中的能力表现。本研究采用了一系列定量指标,如BLEU、ROUGE分数、训练损失等指标,展示了模型微调的效果,使用自然度、逻辑性、上下文理解、多轮对话处理和情感表达等指标,评估了模型生成回复的质量。本研究在指出了ERNIE-Lite-8K-0922和GPT-4在英语对话系统中的性能差异的同时,还提出了需要进一步完善数据集与微调参数等方法以提高微调后的ERNIE-Lite-8K-0922在英语对话系统及特定领域的表现能力。本研究为探索是否有更加经济高效的方法在实际应用场景中将大语言模型部署为英语对话系统提供了重要参考,也为英语对话系统及相关领域的进一步发展做出了贡献。
为改善参数量较小的大模型在逻辑推理任务中的性能不足,以及微调模型复杂度高、资源受限的问题,本文采用上下文学习(ICL)方法,通过引入思维链提示(CoT)构建示例,探索在无需微调模型参数的情况下提升通用语言模型ChatGLM2-6B推理性能的可行性。以Zero-Shot-CoT生成的思维链为基准,结合随机检索与多种聚类方法优化示例选择策略。实验结果表明,不同示例选择策略可使模型推理能力平均提升10%,验证了思维链提示对推理性能的增强效果,并显示优化示例策略能够在资源受限条件下实现大模型的高效利用。本研究为提升语言模型逻辑推理能力和下游任务性能提供了新思路,并为低资源场景下的大模型应用奠定了理论基础。
随着人工智能技术的迅猛发展,大语言模型在企业信息化过程中展现出巨大的应用潜力与价值。本文深入探讨了大语言模型在企业信息化中的多种应用场景,包括智能化客户服务、企业内部知识管理、业务流程自动化以及市场分析与营销优化。分析了大语言模型的应用面临模型层面、实施层面及社会层面的多重挑战。针对这些挑战,提出了相应的对策。随着大语言模型技术的进一步成熟和应用生态的不断完善,其在企业信息化中的作用将更加重要。
最终分组结果全面勾勒了大语言模型从底层算法到顶层应用的技术全景图。研究体系分为六大核心:以PEFT及其量化版本为代表的高效算法层;以分布式和训推一体化平台为代表的基础设施层;以人类偏好对齐、隐私保护和安全红队为代表的治理层;以多模态、工具调用和RAG为代表的能力扩展层;以及涵盖医疗、金融、制造等多个行业的垂直应用层。这体现了LLM正处于从“通用大模型”向“高效、安全、专业且具备复杂交互能力的工业级工具”转型的关键阶段。