人工智能能力
大语言模型核心推理机制与强化学习优化
该组文献聚焦于提升LLM的逻辑推理能力,涵盖了思维链(CoT)提示、强化学习(如DeepSeek-R1)、推理缩放定律、推理过程加速以及针对逻辑密集型任务的策略优化。
- Chain of Thought Prompting Elicits Reasoning in Large Language Models(Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, Denny Zhou, 2022, ArXiv)
- Patience Is The Key to Large Language Model Reasoning(Yijiong Yu, 2024, ArXiv)
- Diversity-Aware Policy Optimization for Large Language Model Reasoning(Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, Kay Chen Tan, 2025, ArXiv)
- Accelerating Large Language Model Reasoning via Speculative Search(Zhihai Wang, Jie Wang, Jilai Pan, Xilin Xia, Huiling Zhen, Mingxuan Yuan, Jianye Hao, Feng Wu, 2025, ArXiv)
- Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning(DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, Qinqing Zheng, 2025, ArXiv)
- TypedThinker: Typed Thinking Improves Large Language Model Reasoning(Danqing Wang, Jianxin Ma, Fei Fang, Lei Li, 2024, ArXiv)
- ChatGPT与DeepSeek-R1比较研究:架构、推理能力与应用场景分析A Comparative Study of ChatGPT and DeepSeek-R1: Analysis of Architecture, Reasoning Capabilities, and Application Scenarios(李昌奎, 2025, Theory and Practice of Social Science)
- Evaluating the o1 reasoning large language model for cognitive bias: a vignette study.(Or Degany, Sahar Laros, Daphna Idan, Sharon Einav, 2025, Critical care (London, England))
- 逻辑推理机制中的分配律 (Distributive Law in Deduction Mechanism of Logic)(Hang Shi, Baoshan Wang, Meihua Wu, 2016, 计算机科学)
- DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.(Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z F Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H Zhang, Hanwei Xu, Honghui Ding, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jingchang Chen, Jingyang Yuan, Jinhao Tu, Junjie Qiu, Junlong Li, J L Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaichao You, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R J Chen, R L Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S S Li, Shuang Zhou, Shaoqing Wu, Tao Yun, Tian Pei, Tianyu Sun, T Wang, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W L Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X Q Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y K Li, Y Q Wang, Y X Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y X Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z Z Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang, 2025, Nature)
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling(Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong, 2025, ArXiv)
- 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training(Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, Xiangang Li, 2025, ArXiv)
- Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning(Mickael Tordjman, Zelong Liu, Murat Yuce, Valentin Fauveau, Yunhao Mei, Jérôme Hadjadj, Ian Bolger, Haidara Almansour, Carolyn Horst, Ashwin Singh Parihar, Amine Geahchan, Anis L. Meribout, Nader Yatim, Nicole Ng, P. Robson, Alexander Zhou, Sara Lewis, Mingqian Huang, Timothy Deyer, B. Taouli, Hao-Chih Lee, Zahi A. Fayad, X. Mei, 2025, Nature Medicine)
通用人工智能(AGI)理论框架、认知建模与本质探索
从宏观和理论层面探讨AGI的定义、涌现现象、类人认知架构(如意识图灵机、脑启发)、可计算性边界以及从智能向智慧演进的哲学与技术路径。
- Emergent analogical reasoning in large language models.(Taylor Webb, Keith J Holyoak, Hongjing Lu, 2023, Nature human behaviour)
- Cognitive Modeling Using Artificial Intelligence.(Michael C Frank, Noah D Goodman, 2026, Annual review of psychology)
- Engineering a Less Artificial Intelligence.(Fabian H Sinz, Xaq Pitkow, Jacob Reimer, Matthias Bethge, Andreas S Tolias, 2019, Neuron)
- The Emergence Phenomenon in Artificial Intelligence: A Warning Sign on the Path to Artificial General Intelligence.(Vera Sorin, Eyal Klang, 2024, The Israel Medical Association journal : IMAJ)
- Expression unleashed in artificial intelligence.(Ekaterina I Tolstaya, Abhinav Gupta, Edward Hughes, 2023, The Behavioral and brain sciences)
- The limits of machine intelligence: Despite progress in machine intelligence, artificial general intelligence is still a major challenge.(Henry Shevlin, Karina Vold, Matthew Crosby, Marta Halina, 2019, EMBO reports)
- Toward Artificial General Intelligence in Hydrogeological Modeling With an Integrated Latent Diffusion Framework(Chuanjun Zhan, Zhenxue Dai, J. Jiao, M. Soltanian, Huichao Yin, K. Carroll, 2025, Geophysical Research Letters)
- Multimodality of AI for Education: Towards Artificial General Intelligence(Gyeong-Geon Lee, Lehong Shi, Ehsan Latif, Yizhu Gao, Arne Bewersdorff, Matthew Nyaaba, Shuchen Guo, Zihao Wu, Zheng Liu, Hui Wang, Gengchen Mai, Tiaming Liu, Xiaoming Zhai, 2023, ArXiv)
- Towards artificial general intelligence with hybrid Tianjic chip architecture(Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, Feng Chen, Ning Deng, Si Wu, Yu Wang, Yujie Wu, Zheyu Yang, Cheng Ma, Guoqi Li, Wentao Han, Huanglong Li, Huaqiang Wu, R. Zhao, Yuan Xie, Luping Shi, 2019, Nature)
- A Theoretical Computer Science Perspective on Consciousness and Artificial General Intelligence(L. Blum, M. Blum, 2023, Engineering)
- Machines That Feel and Think: The Role of Affective Feelings and Mental Action in (Artificial) General Intelligence.(George Deane, 2022, Artificial life)
- Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement?(David Ili'c, Gilles E. Gignac, 2023, Intelligence)
- Towards artificial general intelligence via a multimodal foundation model(Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jing Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Haoran Sun, Jiling Wen, 2021, Nature Communications)
- Is ChatGPT the way toward artificial general intelligence(Frank Emmert-Streib, 2024, Discover Artificial Intelligence)
- Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact(Rizwan Qureshi, Ranjan Sapkota, Abbas Shah, Amgad Muneer, Anas Zafar, Ashmal Vayani, Maged Shoman, Abdelrahman B. M. Eldaly, Kai Zhang, Ferhat Sadak, Shaina Raza, Xinqi Fan, Ravid Shwartz-Ziv, Hong Yan, Vinjia Jain, Aman Chadha, Manoj Karkee, Jia Wu, S. Mirjalili, 2025, ArXiv)
- Sparks of Artificial General Intelligence: Early experiments with GPT-4(Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, J. Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Y. Lee, Yuan-Fang Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang, 2023, ArXiv)
- Beyond artificial intelligence: exploring artificial wisdom.(Dilip V Jeste, Sarah A Graham, Tanya T Nguyen, Colin A Depp, Ellen E Lee, Ho-Cheol Kim, 2020, International psychogeriatrics)
- On the Computability of Artificial General Intelligence(Georgios Mappouras, C. Rossides, 2025, ArXiv)
- The AI Race: Why Current Neural Network-based Architectures are a Poor Basis for Artificial General Intelligence(J´er´emie Sublime, 2024, J. Artif. Intell. Res.)
- A theory of general intelligence.(Hin Wai Lui, 2019, Medical hypotheses)
- Abstraction and analogy-making in artificial intelligence.(Melanie Mitchell, 2021, Annals of the New York Academy of Sciences)
- New directions for artificial intelligence: human, machine, biological, and quantum intelligence(Weigang Li, L. Enamoto, Denise Leyi Li, G. P. Rocha Filho, 2021, Frontiers of Information Technology & Electronic Engineering)
医疗健康领域的专业化能力与通用医疗AI(GMAI)
探讨AI在临床决策支持、罕见病咨询、病理辅助诊断及个人健康监测中的应用,强调通用医疗AI范式及其在复杂医学场景下的推理准确性与同理心。
- [Rare disease in the age of artificial intelligence.].(Carlo Alfredo Clerici, Saba Chopard, Giuseppe Levi, 2024, Recenti progressi in medicina)
- A large language model improves clinicians' diagnostic performance in complex critical illness cases.(Xintong Wu, Yu Huang, Qing He, 2025, Critical care (London, England))
- The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability(Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, Yang Liu, 2025, ArXiv)
- Physician and Artificial Intelligence Chatbot Responses to Cancer Questions From Social Media.(David Chen, Rod Parsa, Andrew Hope, Breffni Hannon, Ernie Mak, Lawson Eng, Fei-Fei Liu, Nazanin Fallah-Rad, Ann M Heesters, Srinivas Raman, 2024, JAMA oncology)
- Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study.(Yanjun Gao, Ruizhe Li, Emma Croxford, John Caskey, Brian W Patterson, Matthew Churpek, Timothy Miller, Dmitriy Dligach, Majid Afshar, 2025, JMIR AI)
- Inductive reasoning with large language models: A simulated randomized controlled trial for epilepsy.(Daniel M Goldenholz, Shira R Goldenholz, Sara Habib, M Brandon Westover, 2025, Epilepsy research)
- Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.(Thomas Savage, Ashwin Nayak, Robert Gallo, Ekanath Rangan, Jonathan H Chen, 2024, NPJ digital medicine)
- Exploring the Role of Artificial Intelligence in Smart Healthcare: A Capability and Function-Oriented Review(Syed Raza Abbas, H. Seol, Zeeshan Abbas, S. Lee, 2025, Healthcare)
- PathAsst: A Generative Foundation AI Assistant towards Artificial General Intelligence of Pathology(Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Zhongyi Shui, Xiaoxuan Yu, Yizhi Zhao, Honglin Li, Yunlong Zhang, Ruojia Zhao, Xinheng Lyu, Lin Yang, 2023, No journal)
- An evaluation framework for clinical use of large language models in patient interaction tasks.(Shreya Johri, Jaehwan Jeong, Benjamin A Tran, Daniel I Schlessinger, Shannon Wongvibulsin, Leandra A Barnes, Hong-Yu Zhou, Zhuo Ran Cai, Eliezer M Van Allen, David Kim, Roxana Daneshjou, Pranav Rajpurkar, 2025, Nature medicine)
- Reasoning with large language models for medical question answering.(Mary M Lucas, Justin Yang, Jon K Pomeroy, Christopher C Yang, 2024, Journal of the American Medical Informatics Association : JAMIA)
- Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders.(Nicholas L Rider, Yingya Li, Aaron T Chin, Daniel V DiGiacomo, Cullen Dutmer, Jocelyn R Farmer, Kirk Roberts, Guergana Savova, Mei-Sing Ong, 2025, The Journal of allergy and clinical immunology)
- Foundation models for generalist medical artificial intelligence.(Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, Pranav Rajpurkar, 2023, Nature)
- Toward expert-level medical question answering with large language models.(Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H Chen, Nigam H Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera Y Arcas, Nenad Tomašev, Yun Liu, Renee Wong, Christopher Semturs, S Sara Mahdavi, Joelle K Barral, Dale R Webster, Greg S Corrado, Yossi Matias, Shekoofeh Azizi, Alan Karthikesalingam, Vivek Natarajan, 2025, Nature medicine)
- Large Language Model-Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions: Development and Validation Study.(Verity Schaye, David DiTullio, Benedict Vincent Guzman, Scott Vennemeyer, Hanniel Shih, Ilan Reinstein, Danielle E Weber, Abbie Goodman, Danny T Y Wu, Daniel J Sartori, Sally A Santen, Larry Gruppen, Yindalon Aphinyanaphongs, Jesse Burk-Rafel, 2025, Journal of medical Internet research)
- Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining(Bingqian Lin, Zicong Chen, Mingjie Li, Haokun Lin, Hang Xu, Yi Zhu, Jian-zhuo Liu, Wenjia Cai, Lei Yang, Shen Zhao, Chenfei Wu, Ling Chen, Xiaojun Chang, Yi Yang, L. Xing, Xiaodan Liang, 2023, ArXiv)
- Domain knowledge enhanced deep learning for electrocardiogram arrhythmia classification(Jie Sun, 2023, Frontiers of Information Technology & Electronic Engineering)
- Baichuan-M2: Scaling Medical Capability with Large Verifier System(Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Dawei Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, Jixiang Hong, Kai Lu, Lin-lin Sun, Peidong Guo, Qian Ma, Rihui Xin, Shihui Yang, Shusen Zhang, Yichuan Mo, Zheng Liang, Zhishou Zhang, Hengfu Cui, Zu-Xin Zhu, Xiaochuan Wang, 2025, ArXiv)
- A personal health large language model for sleep and fitness coaching.(Justin Khasentino, Anastasiya Belyaeva, Xin Liu, Zhun Yang, Nicholas A Furlotte, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhotra, Leor Stern, Yossi Matias, Greg S Corrado, Shwetak Patel, Shravya Shetty, Jiening Zhan, Shruthi Prabhakara, Daniel McDuff, Cory Y McLean, 2025, Nature medicine)
- Evaluating AI-Generated Geriatric Case Studies for Interprofessional Education: Systematic Analysis Across 5 Platforms.(Nicole Ruggiano, Sudikshya Sahoo, Ava Brashear, Uche Nwatu, Amie Brunson, Hyunjin Noh, Heather Cole, Robert McKinney, C Victoria Framil Suarez, Ellen L Brown, Suzanne Prevost, 2026, JMIR medical education)
- Profiling Patient Transcript Using Large Language Model Reasoning Augmentation for Alzheimer's Disease Detection.(Chin-Po Chen, Jeng-Lin Li, 2024, Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference)
- The Advanced Reasoning Capabilities of Large Language Models for Detecting Contraindicated Options in Medical Exams.(Yuichiro Yano, Mizuki Ohashi, Taiju Miyagami, Hirotake Mori, Yuji Nishizaki, Hiroyuki Daida, Toshio Naito, 2025, JMIR medical informatics)
- Advancing Conversational Diagnostic AI with Multimodal Reasoning(Khaled Saab, Jan Freyberg, Chunjong Park, Tim Strother, Yong Cheng, Wei-Hung Weng, David G. T. Barrett, David Stutz, Nenad Tomašev, Anil Palepu, Valentin Li'evin, Yash Sharma, Roma Ruparel, A. Ahmed, Elahe Vedadi, K. Kanada, Cían Hughes, Yun Liu, Geoff Brown, Yang Gao, Sean Li, S. Mahdavi, J. Manyika, Katherine Chou, Yossi Matias, Avinatan Hassidim, Dale R. Webster, Pushmeet Kohli, S. Eslami, Joelle K. Barral, Adam Rodman, Vivek Natarajan, Mike Schaekermann, Tao Tu, A. Karthikesalingam, Ryutaro Tanno, 2025, ArXiv)
- Evaluation of the Reliability of AI-Based Large Language Models in Developing Orthodontic Treatment Plans.(Makara Sorel, Chaitanya Gurrala, Aditya Tadinada, 2025, Cureus)
- Intelligent diagnosis of jaundice with dynamic uncertain causality graph model(S. Hao, Shichao Geng, Lin-xiao Fan, Jia-jia Chen, Qin Zhang, Lan Li, 2017, Journal of Zhejiang University-SCIENCE B)
- 人工智能在神经医学中的应用综述 (Application Survey of Artificial Intelligence in Neurology)(Shiyu Li, Feng Wang, B. Cao, Qixiang Mei, 2017, 计算机科学)
知识图谱增强、工具调用与结构化推理技术
研究如何通过集成外部知识图谱(KG)、检索增强生成(RAG)、API调用以及在图结构上进行路径规划(RoG/PoG)来提升模型的事实性、可解释性与复杂任务处理能力。
- Large Language Model-Based Evolutionary Optimizer: Reasoning with elitism(Shuvayan Brahmachary, S. Joshi, A. Panda, K. Koneripalli, A. Sagotra, H. Patel, Ankush Sharma, Ameya D. Jagtap, K. Kalyanaraman, 2024, ArXiv)
- Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation(Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, Jian Guo, 2024, No journal)
- Gorilla: Large Language Model Connected with Massive APIs(Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez, 2023, ArXiv)
- Galactica: A Large Language Model for Science(Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, A. Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic, 2022, ArXiv)
- An Evaluation Method for Large Language Models’ Code Generation Capability(Haoran Su, Jun Ai, Dan Yu, Hong Zhang, 2023, 2023 10th International Conference on Dependable Systems and Their Applications (DSA))
- MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning(Debrup Das, Debopriyo Banerjee, Somak Aditya, Ashish Kulkarni, 2024, ArXiv)
- Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph(Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Sai Wang, Chen Lin, Yeyun Gong, Lionel M. Ni, H. Shum, Jian Guo, 2023, No journal)
- GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning(Costas Mavromatis, G. Karypis, 2024, ArXiv)
- FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering(Yuan Sui, Yufei He, Nian Liu, Xiaoxin He, Kun Wang, Bryan Hooi, 2024, No journal)
- ThoughtSource: A central hub for large language model reasoning data(Simon Ott, Konstantin Hebenstreit, Valentin Li'evin, C. Hother, M. Moradi, Maximilian Mayrhauser, Robert Praas, O. Winther, M. Samwald, 2023, Scientific Data)
- TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning(Yuan Sui, Jiaru Zou, Mengyu Zhou, Xinyi He, Lun Du, Shi Han, Dongmei Zhang, 2023, ArXiv)
- A Sarsa reinforcement learning hybrid ensemble method for robotic battery power forecasting(Fei Peng, Hui Liu, Li Zheng, 2023, Journal of Central South University)
- 基于MLN的中文事件论元推理方法 (Chinese Event Argument Inference Approach Based on Markov Logic Network)(Shaohua Zhu, Peifeng Li, Qiaoming Zhu, 2016, 计算机科学)
- Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning(Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, Shirui Pan, 2023, ArXiv)
- Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning(Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Wenjie Zhang, 2024, Proceedings of the ACM on Web Conference 2025)
- Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval(Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Jian Guo, 2024, ArXiv)
- Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph(Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Sai Wang, Chen Lin, Yeyun Gong, H. Shum, Jian Guo, 2023, ArXiv)
多模态感知、具身智能与AIGC生成进阶
关注AI处理非文本信息的能力,包括跨模态推理(视频/图像)、具身智能中的任务规划与抓取、推理分割技术以及AIGC领域的最新生成算法与挑战。
- Multistage guidance on the diffusion model inspired by human artists’ creative thinking(W. Qi, Huanghuang Deng, Taihao Li, 2023, Frontiers of Information Technology & Electronic Engineering)
- Integrating visual large language model and reasoning chain for driver behavior analysis and risk assessment.(Kunpeng Zhang, Shipu Wang, Ning Jia, Liang Zhao, Chunyang Han, Li Li, 2024, Accident; analysis and prevention)
- Reasoning Grasping via Multimodal Large Language Model(Shiyu Jin, Jinxuan Xu, Yutian Lei, Liangjun Zhang, 2024, ArXiv)
- Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences(Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Fuxiao Liu, Feihong He, Jaehong Yoon, Jaehong Yoon, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang, 2024, No journal)
- LISA: Reasoning Segmentation via Large Language Model(Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia, 2023, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- LLM3: Large Language Model-based Task and Motion Planning with Motion Failure Reasoning(Shu Wang, Muzhi Han, Ziyuan Jiao, Zeyu Zhang, Yingnian Wu, Song-Chun Zhu, Hangxin Liu, 2024, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))
- Toward Flexible and Efficient Home Context Sensing: Capability Evaluation and Verification of Image-Based Cognitive APIs(Sinan Chen, S. Saiki, Masahide Nakamura, 2020, Sensors (Basel, Switzerland))
- LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning(Junchi Wang, Lei Ke, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Six-Writings multimodal processing with pictophonetic coding to enhance Chinese language models(Weigang Li, Mayara C. Marinho, Denise Leyi Li, Vitor Vasconcelos De Oliveira, 2024, Frontiers of Information Technology & Electronic Engineering)
- Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning(Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang, 2024, ArXiv)
- GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing(Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li, 2025, ArXiv)
- TARGE: large language model-powered explainable hate speech detection.(Muhammad Haseeb Hashir, Memoona, Sung Won Kim, 2025, PeerJ. Computer science)
- 基于非结构化文本增强关联规则的知识推理方法 (Knowledge Reasoning Method Based on Unstructured Text-enhanced Association Rules)(Zhixing Li, Shiya Ren, Huaming Wang, Ke Shen, 2019, 计算机科学)
- Recent advances in artificial intelligence generated content(Junping Zhang, Lin-Yin Sun, C. Jin, Junbin Gao, Xiaobin Li, Jiebo Luo, Zhigeng Pan, Ying Tang, Jingdong Wang, 2024, Frontiers of Information Technology & Electronic Engineering)
- Advances and challenges in artificial intelligence text generation(Bing Li, Peng Yang, Yuankang Sun, Zhongjian Hu, Meng Yi, 2024, Frontiers of Information Technology & Electronic Engineering)
- 融合多任務學習類神經網路聲學模型訓練於會議語音辨識之研究(Leveraging Multi-task Learning with Neural Network Based Acoustic Modeling for Improved Meeting Speech Recognition) [In Chinese](Ming-Han Yang, Yao-Chi Hsu, Hsiao-Tsung Hung, Ying-Wen Chen, Berlin Chen, Kuan-Yu Chen, 2016)
AI能力评估体系、安全性与动态测评方法学
致力于建立科学的评估框架,涵盖指令遵循、社会心理特质(共情/人格)、安全性排行榜、多智能体合谋风险以及从静态基准向动态交互评估的转变。
- Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability(Haonan Li, Xudong Han, Zenan Zhai, Honglin Mu, Hao Wang, Zhenxuan Zhang, Yilin Geng, Shom Lin, Renxi Wang, Artem Shelmanov, Xiangyu Qi, Yuxia Wang, Donghai Hong, Youliang Yuan, Mengya Chen, Haoqin Tu, Fajri Koto, Tatsuki Kuribayashi, Cong Zeng, Rishabh Bhardwaj, Bingchen Zhao, Yawen Duan, Yi Liu, Emad A. Alghamdi, Yaodong Yang, Yi Dong, Soujanya Poria, Peng-Chong Liu, Zhengzhong Liu, Xuguang Ren, Eduard H. Hovy, Iryna Gurevych, Preslav Nakov, Monojit Choudhury, Timothy Baldwin, 2024, ArXiv)
- M-IFEval: Multilingual Instruction-Following Evaluation(Antoine Dussolle, Andrea Cardena D'iaz, Shota Sato, Peter Devine, 2025, No journal)
- Secret Collusion among AI Agents: Multi-Agent Deception via Steganography(S. Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip Torr, Lewis Hammond, Christian Schröder de Witt, 2024, Advances in Neural Information Processing Systems 37)
- Soft-HGRNs: soft hierarchical graph recurrent networks for multi-agent partially observable environments(Yixiang Ren, Zhenhui Ye, Yining Chen, Xiaohong Jiang, Guang-hua Song, 2023, Frontiers of Information Technology & Electronic Engineering)
- 三支决策代价目标函数的关系及推理研究 (Relationship and Reasoning Study for Three-way Decision Cost Objective Functions)(Jianfeng Xu, Yufan He, Lan Liu, 2018, 计算机科学)
- 基于真值支持度的直觉模糊推理方法 (Intuitionistic Fuzzy Reasoning Based on Truth-valued Support Degrees)(Benqiang Xu, Xuewei Tan, L. Zou, 2016, 计算机科学)
- 一种基于改进PLSA和案例推理的行为识别算法 (Novel Action Recognition via Improved PLSA and CBR)(Hongbin Tu, Yanyan Yue, Xinjian Zhou, Kun Luo, 2017, 计算机科学)
- 基于犹豫模糊可信度的知识推理 (Approach for Knowledge Reasoning Based on Hesitate Fuzzy Credibility)(Hongliang Zheng, Xuehui Hou, Xiaoying Song, Kuo Pang, L. Zou, 2019, 计算机科学)
- 概率图模型推理方法的研究进展 (Research and Development on Inference Technique in Probabilistic Graphical Models)(Jian-wei Liu, Lipeng Cui, Haien Li, Xiong-lin Luo, 2015, 计算机科学)
- 事件因果与时序关系识别的联合推理模型 (Joint Model of Events' Causal and Temporal Relations Identification)(Yilong Huang, Peifeng Li, Qiaoming Zhu, 2018, 计算机科学)
- 以語言模型評估學習者文句修改前後之流暢度(Using language model to assess the fluency of learners sentences edited by teachers)[In Chinese](Guannan. Pu, Po-Lin Chen, Shih-Hung Wu, 2016)
- FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability(Congying Xia, Chen Xing, Jiangshu Du, Xinyi Yang, Yihao Feng, Ran Xu, Wenpeng Yin, Caiming Xiong, 2024, ArXiv)
- Instruction-Following Evaluation for Large Language Models(Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou, 2023, ArXiv)
- Evaluating AI Evaluation: Perils and Prospects(John Burden, 2024, ArXiv)
- ArtMentor: AI-Assisted Evaluation of Artworks to Explore Multimodal Large Language Models Capabilities(Chanjin Zheng, Zengyi Yu, Yilin Jiang, Mingzi Zhang, Xunuo Lu, Jing Jin, Li Gao, 2025, Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems)
- Exploring Big Five Personality and AI Capability Effects in LLM-Simulated Negotiation Dialogues(Myke C. Cohen, Zhe Su, Hsien-Te Kao, Daniel Nguyen, Spencer Lynch, Maarten Sap, Svitlana Volkova, 2025, ArXiv)
- Gemini or ChatGPT? Capability, Performance, and Selection of Cutting-Edge Generative Artificial Intelligence (AI) in Business Management(N. Rane, Saurabh P. Choudhary, Jayesh Rane, 2024, SSRN Electronic Journal)
- Measuring Empathy in Artificial Intelligence: Insights From Psychodermatology and Implications for General Practice.(Kripa Ahuja, Peter Lio, 2024, The primary care companion for CNS disorders)
- 一种考虑等级语义关联的证据推理决策方法 (Decision Making Approach Based on Evidential Reasoning Considering SemanticRelationship among Assessment Grades)(Meijing Zhang, Yingming Wang, 2018, 计算机科学)
- 一种具有相依关系的二维云推理方法及其在预测中的应用 (Uncertainty Reasoning Based on Related Planar Cloud and Application in Prediction)(Dehao Liu, Qian Wang, 2016, 计算机科学)
- 基于认知多样性变异的鸡群算法协同优化异步实现 (Asynchronous Collaborative Chicken Swarm Optimization with Mutation Based on Cognitive Diversity)(Lin Xiao, Sitong Liu, 2017, 计算机科学)
- GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI(Naomi Simumba, Nils Lehmann, Paolo Fraccaro, H. Alemohammad, Geeth de Mel, Salman H. Khan, M. Maskey, N. Longépé, Xiao Xiang Zhu, Hannah Kerner, Juan Bernabé-Moreno, Alexander Lacoste, 2025, ArXiv)
- AI Sandbagging: Language Models can Strategically Underperform on Evaluations(Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, Francis Rhys Ward, 2024, ArXiv)
- A Conceptual Framework for AI Capability Evaluations(María Victoria Carro, Denise Alejandra Mester, Francisca Gauna Selasco, Luca Nicolás Forziati Gangi, M. Musa, Lola Ramos Pereyra, Mario A. Leiva, Juan Gustavo Corvalan, Maria Vanina Martinez, Gerardo I. Simari, 2025, ArXiv)
- Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants’ API Invocation Capabilities(Honglin Mu, Yang Xu, ylfeng, Xiaofeng Han, Yitong Li, Yutai Hou, Wanxiang Che, 2024, No journal)
- Not a Number: Identifying Instance Features for Capability-Oriented Evaluation(Ryan Burnell, John Burden, Danaja Rutar, Konstantinos Voudouris, L. Cheke, J. Hernández-Orallo, 2022, No journal)
- An Assessment of Human–AI Interaction Capability in the Generative AI Era: The Influence of Critical Thinking(Feiming Li, Xinyu Yan, Hong Su, Rong Shen, Gang Mao, 2025, Journal of Intelligence)
垂直行业智能化应用与科学研究支撑
探讨AI在特定专业场景的落地,包括6G无线网络、金融推理、供应链管理、科研全流程自动化(AI Scientist)以及在水稻生物学等科学领域的应用。
- Artificial General Intelligence (AGI)-Native Wireless Systems: A Journey Beyond 6G(Walid Saad, Omar Hashash, C. Thomas, C. Chaccour, M. Debbah, N. Mandayam, Zhu Han, 2024, Proceedings of the IEEE)
- Generative artificial intelligence in supply chain and operations management: a capability-based framework for analysis and implementation(Ilya Jackson, Dmitry A. Ivanov, Alexandre Dolgui, Jafar Namdar, 2024, International Journal of Production Research)
- Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning(Zhaowei Liu, Xin Guo, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Zixuan Wang, Jiajie Xu, Weige Cai, Ziwei Yang, Xueqian Zhao, Chaojun Li, Sheng Xu, Dezhi Chen, Yun Chen, Zuo Bai, Liwen Zhang, 2025, ArXiv)
- 基于人工智能下智能网联线控底盘系统开发研究现状与展望(林 王, 长磊 张, 焌菱 彭, 2025, 科学与技术探索)
- Effects of higher education institutes’ artificial intelligence capability on students' self-efficacy, creativity and learning performance(Shaofeng Wang, Zhuo Sun, Y. Chen, 2022, Education and Information Technologies)
- AI Scientists Fail Without Strong Implementation Capability(Minjun Zhu, Qiujie Xie, Yixuan Weng, Jian Wu, Zhen Lin, Linyi Yang, Yue Zhang, 2025, ArXiv)
- SeedLLM·Rice: A large language model integrated with rice biological knowledge graph.(Fan Yang, Huanjun Kong, Jie Ying, Zihong Chen, Tao Luo, Wanli Jiang, Zhonghang Yuan, Zhefan Wang, Zhaona Ma, Shikuan Wang, Wanfeng Ma, Xiaoyi Wang, Xiaoying Li, Zhengyin Hu, Xiaodong Ma, Minguo Liu, Xiqing Wang, Fan Chen, Nanqing Dong, 2025, Molecular plant)
- GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities(Jillian Bommarito, M. Bommarito, D. Katz, Jessica Katz, 2023, ArXiv)
- Artificial intelligence capability: Conceptualization, measurement calibration, and empirical study on its impact on organizational creativity and firm performance(Patrick Mikalef, Manjul Gupta, 2021, Inf. Manag.)
模糊逻辑与传统计算推理模型
涵盖模糊推理、命题逻辑、分布式计算环境下的算法以及记忆机制等传统或特定数学框架下的AI能力实现,作为深度学习路径的补充。
- 基于扩展模糊Petri网的知识推理方法研究 (Knowledge Reasoning Method Based on Extended Fuzzy Petri Net)(Ruqi Zhou, Yiqun Chen, Jia-li Feng, 2016, 计算机科学)
- Mamdani模糊推理算法的直觉化扩展 (Intuitionistic Extension of Mamdani Fuzzy Reasoning Arithmetic)(Jian Wang, Zhaohui Shi, Xinpeng Guo, Weiping Li, 2016, 计算机科学)
- 基于模糊命题逻辑形式系统FLcom的模糊推理及应用 (Fuzzy Reasoning and its Application Based on Fuzzy Propositonal Logic)(Xiaogang Wu, Zhenghua Pan, 2015, 计算机科学)
- 基于模糊软集的三I推理方法的性质 (Properties of Triple I Reasoning Method Based on Fuzzy Soft Set)(Binbin Xue, K. Qin, 2018, 计算机科学)
- 一种基于Spark的大规模语义数据分布式推理框架 (Spark Based Large-scale Semantic Data Distributed Reasoning Framework)(Heng Chen, 2016, 计算机科学)
- 基于蕴涵算子族L-λ-Π的模糊推理三I支持算法 (Fuzzy Reasoning Triple I Sustaining Method Based on Family of Implication Operator L-λ-Π)(Jing Shuang, X. Hui, Jinrui He, 2015, 计算机科学)
- 基于特征变换的DGA诊断范例推理方法 (DGA Fault Diagnosis Based on CBR Method with Feature Transformation)(Minglei Gao, Zhongjiang Zhang, Bo Ji, 2015, 计算机科学)
- Gdel n值命题逻辑系统中命题公式的t真度及近似推理 (t Truth Degree of Formulas and Approximate Reasoning in Gdel n-valued Propositional Logic System)(Naidiao Zhu, X. Hui, Xiaoli Gao, 2016, 计算机科学)
- 记忆和遗忘策略改进的案例推理方法 (Case Base Reasoning Method Improved by Memory and Forgetting Strategy)(Chunxiao Zhang, Hui Zhao, 2017, 计算机科学)
本报告最终将人工智能能力的研究划分为八个核心维度:从底层的推理机制优化(强化学习与CoT)到高层的AGI理论探索;从知识图谱增强的结构化推理到多模态与具身智能的感知突破;同时深入探讨了医疗、金融、科研等垂直领域的专业化应用。此外,报告还构建了完备的评估体系与安全性治理框架,并保留了对模糊逻辑等传统计算模型的关注,形成了一个从技术原理、领域应用到治理评估的完整研究谱系。
总计139篇相关文献
人工智能技术的飞速发展推动了大语言模型(LLM)的不断进步。在众多LLM中,OpenAI推出的ChatGPT和DeepSeek-AI开发的DeepSeek-R1尤为引人注目。ChatGPT基于GPT-4架构,具备强大的自然语言理解能力和广泛的应用场景,而DeepSeek-R1则通过强化学习方法优化推理能力,在数学推理和编程任务中展现了强劲的竞争力。本文基于DeepSeek-R1的最新研究成果,全面对比ChatGPT与DeepSeek-R1在模型架构、训练方法、推理能力、应用场景及开放性等方面的差异。研究发现,ChatGPT依赖监督微调(SFT)和基于人类反馈的强化学习(RLHF),在自然语言处理任务上表现突出,而DeepSeek-R1更倾向于通过强化学习优化推理能力,尤其在数学推理、代码生成等任务上表现优异。此外,ChatGPT采用闭源策略,主要用于商业应用,而DeepSeek-R1则采取开源模式,为研究社区和开发者提供更大的灵活性。本文的研究结果为人工智能研究人员和开发者提供了重要参考,以期促进LLM技术的发展,并为未来的大模型优化提供新思路。 The rapid development of artificial intelligence has driven the continuous advancement of large language models (LLMs). Among them, OpenAI's ChatGPT and DeepSeek-AI's DeepSeek-R1 have garnered significant attention. ChatGPT, built upon the GPT-4 architecture, demonstrates strong natural language understanding and wide-ranging applications, whereas DeepSeek-R1 leverages reinforcement learning techniques to optimize reasoning capabilities, excelling in mathematical reasoning and programming tasks. This paper, based on the latest research on DeepSeek-R1, provides a comprehensive comparison between ChatGPT and DeepSeek-R1 in terms of model architecture, training methods, reasoning capabilities, application scenarios, and openness. The study reveals that ChatGPT relies on supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), making it highly effective in natural language processing tasks. In contrast, DeepSeek-R1 emphasizes reinforcement learning to enhance reasoning abilities, particularly excelling in mathematical reasoning and code generation tasks. Moreover, ChatGPT follows a closed-source approach, primarily for commercial use, while DeepSeek-R1 adopts an open-source model, offering greater flexibility for researchers and developers. This study provides valuable insights for AI researchers and developers, contributing to the advancement of LLM technology and future model optimization strategies.
Artificial Intelligence (AI) is transforming smart healthcare by enhancing diagnostic precision, automating clinical workflows, and enabling personalized treatment strategies. This review explores the current landscape of AI in healthcare from two key perspectives: capability types (e.g., Narrow AI and AGI) and functional architectures (e.g., Limited Memory and Theory of Mind). Based on capabilities, most AI systems today are categorized as Narrow AI, performing specific tasks such as medical image analysis and risk prediction with high accuracy. More advanced forms like General Artificial Intelligence (AGI) and Superintelligent AI remain theoretical but hold transformative potential. From a functional standpoint, Limited Memory AI dominates clinical applications by learning from historical patient data to inform decision-making. Reactive systems are used in rule-based alerts, while Theory of Mind (ToM) and Self-Aware AI remain conceptual stages for future development. This dual perspective provides a comprehensive framework to assess the maturity, impact, and future direction of AI in healthcare. It also highlights the need for ethical design, transparency, and regulation as AI systems grow more complex and autonomous, by incorporating cross-domain AI insights. Moreover, we evaluate the viability of developing AGI in regionally specific legal and regulatory frameworks, using South Korea as a case study to emphasize the limitations imposed by infrastructural preparedness and medical data governance regulations.
This research examines the transformative potential of artificial intelligence (AI) in general and Generative AI (GAI) in particular in supply chain and operations management (SCOM). Through the lens of the resource-based view and based on key AI capabilities such as learning, perception, prediction, interaction, adaptation, and reasoning, we explore how AI and GAI can impact 13 distinct SCOM decision-making areas. These areas include but are not limited to demand forecasting, inventory management, supply chain design, and risk management. With its outcomes, this study provides a comprehensive understanding of AI and GAI's functionality and applications in the SCOM context, offering a practical framework for both practitioners and researchers. The proposed framework systematically identifies where and how AI and GAI can be applied in SCOM, focussing on decision-making enhancement, process optimisation, investment prioritisation, and skills development. Managers can use it as a guidance to evaluate their operational processes and identify areas where AI and GAI can deliver improved efficiency, accuracy, resilience, and overall effectiveness. The research underscores that AI and GAI, with their multifaceted capabilities and applications, open a revolutionary potential and substantial implications for future SCOM practices, innovations, and research.
The research paper investigates the comparative functionalities, effectiveness, and selection criteria of Gemini and ChatGPT within the field of business management. Both AI platforms offer specialized advantages applicable across various domains, including market research, strategic planning, operations management, customer service, marketing, human resources, and decision-making. Gemini utilizes Google's vast index to excel in real-time market analysis, strategic planning, and data-driven decision-making. Its robust analytical capabilities facilitate swift identification of market trends, competitor analysis, and precise forecasting. Conversely, ChatGPT specializes in providing qualitative insights, analyzing customer feedback, and facilitating creative content generation, making it particularly valuable for customer interactions and marketing efforts. Regarding performance, both models significantly enhance operational efficiency, data analysis, and customer service automation. Gemini's proficiency lies in processing extensive datasets for insights and optimization, whereas ChatGPT's adaptability and conversational skills elevate customer experiences and creative content production. The paper delineates selection criteria tailored to specific business requirements and contexts. Considerations such as data sensitivity, bias mitigation, cost-effectiveness, accessibility, customization, and integration are pivotal in selecting between Gemini and ChatGPT. While Gemini may be favoured for its factual precision and integration within the Google ecosystem, ChatGPT offers flexibility, conversational capabilities, and potential for self-hosting. Comprehending the distinct strengths and limitations of each AI model is crucial for effectively harnessing their capabilities across diverse business management scenarios. The research delivers valuable insights for businesses seeking to optimize their operations and decision-making processes through AI integration.
With the deep integration of the automotive industry and information technology, intelligent connected vehicles have become a major trend in industry development. As a crucial component of smart connected vehicles, the development of wire-controlled chassis systems has garnered significant attention. The research focuses on analyzing the coupling mechanisms between key technologies such as wire-controlled drive, wire-controlled braking, wire-controlled steering, and chassis domain control. By applying artificial intelligence (AI) for environmental perception, decision-making planning, and multi-source information fusion, this approach aims to achieve high dynamic response and collaborative control capabilities in chassis platforms, enabling excellent adaptability in complex scenarios. This paper explores the application of AI technology in the development of intelligent connected wire-controlled chassis systems, outlines key technical pathways, and proposes innovative development solutions. The study aims to provide reference solutions for the engineering implementation of intelligent connected chassis technologies.
No abstract available
No abstract available
文本生成是人工智能和自然语言处理的重要研究领域,为人工智能生成内容的快速发展提供了关键技术支撑。该任务基于自然语言处理、机器学习和深度学习等技术,通过训练模型学习语言规则,自动生成符合语法和语义要求的文本。本文对文本生成的主要研究进展进行梳理和系统性总结,对近几年文本生成相关文献进行综合调研,并详细介绍相关技术模型。此外,针对典型文本生成应用系统进行介绍。最后,对人工智能文本生成的挑战和未来研究方向进行分析和展望。得出以下结论,提高生成文本的质量、数量、交互性和适应性有助于从根本上推动人工智能文本生成的发展。 Text generation is an essential research area in artificial intelligence (AI) technology and natural language processing and provides key technical support for the rapid development of AI-generated content (AIGC). It is based on technologies such as natural language processing, machine learning, and deep learning, which enable learning language rules through training models to automatically generate text that meets grammatical and semantic requirements. In this paper, we sort and systematically summarize the main research progress in text generation and review recent text generation papers, focusing on presenting a detailed understanding of the technical models. In addition, several typical text generation application systems are presented. Finally, we address some challenges and future directions in AI text generation. We conclude that improving the quality, quantity, interactivity, and adaptability of generated text can help fundamentally advance AI text generation development.
人工智能生成内容(AIGC)是近年来人工智能(AI)领域一个研究热点,它有望取代人类以较低成本高效率执行内容生成工作,如音乐、绘画、多模态内容生成、新闻文章、总结报告、股评摘要,以至元宇宙中的内容生成和数字人。AIGC为未来AI发展和实现提供了一条新的技术路径。 在此背景下,《信息与电子工程前沿(英文)》期刊组织了一期关于AIGC最新进展的特刊。本期特刊关注AIGC理论、算法、应用及相关领域。通过吸引高质量论文,我们希望帮助学术界和工业界研究人员更深入了解AIGC背后的基本理论及其潜在应用,激励更多研究人员加入并推进AIGC领域的研究。因此,我们就以下主题(但不限于)征集论文:(1)AI生成音乐;(2)AI生成绘画;(3)AI对话模型;(4)AI新闻摘要;(5)AI与元宇宙;(6)AI与数字人;(7)AI图像编辑;(8)AI生成短视频;(9)AI生成多媒体内容;(10)ChatGPT相关工作。经严格评审,选出12篇论文,包括1篇评论、1篇观点、3篇综述、6篇研究和1篇通讯。我们将其划分为3个主要部分:ChatGPT、扩散模型、提示学习和多模态。 总体而言,本期特刊涵盖了与AIGC开发和应用相关的广泛研究主题,包括人工智能图像/文本生成、三维内容创建、以用户为中心的图形设计、特定风格的音乐生成,以及与因果表征学习、高阶扩散模型相关的工作。此外,还详细调研了概率扩散模型、提示学习和ChatGPT。 最后,感谢所有作者对本期特刊的支持,特别感谢所有评审人对专刊投稿富有见地的意见和有益建议。
No abstract available
目前文本生成图像的研究已显示出与普通画家类似的水平,但与艺术家绘画水平相比仍有很大改进空间;艺术家水平的绘画通常将多个意象的特征融合到一个意象中,以表示多层次语义信息。在预实验中,我们证实了这一点,并咨询了3个具有不同艺术欣赏能力的群体的意见,以确定画家和艺术家之间绘画水平的区别。之后,利用这些观点帮助人工智能绘画系统从普通画家水平的图像生成改进为艺术家水平的图像生成。具体来说,提出一种无需任何进一步预训练的、基于文本的多阶段引导方法,帮助扩散模型在生成的图像中向多层次语义表示迈进。实验中的机器和人工评估都验证了所提方法的有效性。此外,与之前单阶段引导方法不同,该方法能够通过控制不同阶段之间的指导步数来控制各个意象特征在绘画中的表现程度。
近年来,多智能体深度强化学习(multi-agent deep reinforcement learning, MADRL)的研究进展使其在现实世界的任务中更加实用,但其相对较差的可扩展性和部分可观测的限制为MADRL模型的性能和部署带来了更多的挑战。人类社会可以被视为一个大规模的部分可观测环境,其中每个人都具备与他人交流并记忆经验的功能。基于人类社会的启发,我们提出一种新的网络结构,称为层次图递归网络(hierarchical graph recurrent network, HGRN),用于部分可观测环境下的多智能体合作任务。具体来说,我们将多智能体系统构建为一个图,利用新颖的图卷积结构来实现异构相邻智能体之间的通信,并采用一个递归单元来使智能体具备记忆历史信息的能力。为了鼓励智能体探索并提高模型的鲁棒性,我们进而设计一种最大熵学习方法,令智能体可以学习可配置目标行动熵的随机策略。基于上述技术,我们提出一种名为Soft-HGRN的基于值的MADRL算法,及其名为SAC-HGRN的actor-critic变体。在三个同构场景和一个异构环境中进行实验;实验结果不仅表明我们的方法相比四个MADRL基线取得了明显的改进,而且证明了所提模型的可解释性、可扩展性和可转移性。 The recent progress in multi-agent deep reinforcement learning (MADRL) makes it more practical in real-world tasks, but its relatively poor scalability and the partially observable constraint raise more challenges for its performance and deployment. Based on our intuitive observation that human society could be regarded as a large-scale partially observable environment, where everyone has the functions of communicating with neighbors and remembering his/her own experience, we propose a novel network structure called the hierarchical graph recurrent network (HGRN) for multi-agent cooperation under partial observability. Specifically, we construct the multi-agent system as a graph, use a novel graph convolution structure to achieve communication between heterogeneous neighboring agents, and adopt a recurrent unit to enable agents to record historical information. To encourage exploration and improve robustness, we design a maximum-entropy learning method that can learn stochastic policies of a configurable target action entropy. Based on the above technologies, we propose a value-based MADRL algorithm called Soft-HGRN and its actor-critic variant called SAC-HGRN. Experimental results based on three homogeneous tasks and one heterogeneous environment not only show that our approach achieves clear improvements compared with four MADRL baselines, but also demonstrate the interpretability, scalability, and transferability of the proposed model.
本评论回顾1998年提出的“一次性学习”(once learning, OLM)机制, 和随后出现的用于图像分类的“一瞥学习”(one-shot learning)以及用于目标检测的“你仅看一次”(you only look once, YOLO)。基于目前人工智能(AI)研究现状, 提出将其划分为以下子学科: 人工类人智能、人工机器智能、人工仿生智能和人工量子智能。这些被认为是AI研发的主要方向, 并按以下分类标准区分: (1)以类人、机器、仿生或量子计算为本的AI研发;(2)升维或降维的信息输入;(3)小样本或大数据知识学习。
Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.
Building the next-generation wireless systems that could support services such as the metaverse, digital twins (DTs), and holographic teleportation is challenging to achieve exclusively through incremental advances to conventional wireless technologies like metasurfaces or holographic antennas. While the 6G concept of artificial intelligence (AI)-native networks promises to overcome some of the limitations of existing wireless technologies, current developments of AI-native wireless systems rely mostly on conventional AI tools such as auto-encoders and off-the-shelf artificial neural networks. However, those tools struggle to manage and cope with the complex, nontrivial scenarios faced in real-world wireless environments and the growing quality-of-experience (QoE) requirements of the aforementioned, emerging wireless use cases. In contrast, in this article, we propose to fundamentally revisit the concept of AI-native wireless systems, equipping them with the common sense necessary to transform them into artificial general intelligence (AGI)-native systems. Our envisioned AGI-native wireless systems acquire common sense by exploiting different cognitive abilities such as reasoning and analogy. These abilities in our proposed AGI-native wireless system are mainly founded on three fundamental components: a perception module, a world model, and an action-planning component. Collectively, these three fundamental components enable the four pillars of common sense that include dealing with unforeseen scenarios through horizontal generalizability, capturing intuitive physics, performing analogical reasoning, and filling in the blanks. Toward developing these components, we start by showing how the perception module can be built through abstracting real-world elements into generalizable representations. These representations are then used to create a world model, founded on principles of causality and hyperdimensional (HD) computing. Specifically, we propose a concrete definition of a world model, viewing it as an HD causal vector space that aligns with the intuitive physics of the real world—a cornerstone of common sense. In addition, we discuss how this proposed world model can enable analogical reasoning and manipulation of the abstract representations. Then, we show how the world model can drive an action-planning feature of the AGI-native network. In particular, we propose an intent-driven and objective-driven planning method that can maneuver the AGI-native network to plan its actions. These planning methods are based on brain-inspired frameworks such as integrated information theory and hierarchical abstractions that play a crucial role in enabling human-like decision-making. Next, we explain how an AGI-native network can be further exploited to enable three use cases related to human users and autonomous agent applications: 1) analogical reasoning for the next-generation DTs; 2) synchronized and resilient experiences for cognitive avatars; and 3) brain-level metaverse experiences exemplified by holographic teleportation. Finally, we conclude with a set of recommendations to ignite the quest for AGI-native systems. Ultimately, we envision this article as a roadmap for the next generation of wireless systems beyond 6G.
Can machines truly think, reason and act in domains like humans? This enduring question continues to shape the pursuit of Artificial General Intelligence (AGI). Despite the growing capabilities of models such as GPT-4.5, DeepSeek, Claude 3.5 Sonnet, Phi-4, and Grok 3, which exhibit multimodal fluency and partial reasoning, these systems remain fundamentally limited by their reliance on token-level prediction and lack of grounded agency. This paper offers a cross-disciplinary synthesis of AGI development, spanning artificial intelligence, cognitive neuroscience, psychology, generative models, and agent-based systems. We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi-agent coordination. In particular, we emphasize the rise of Agentic RAG frameworks that combine retrieval, planning, and dynamic tool use to enable more adaptive behavior. We discuss generalization strategies, including information compression, test-time adaptation, and training-free methods, as critical pathways toward flexible, domain-agnostic intelligence. Vision-Language Models (VLMs) are reexamined not just as perception modules but as evolving interfaces for embodied understanding and collaborative task completion. We also argue that true intelligence arises not from scale alone but from the integration of memory and reasoning: an orchestration of modular, interactive, and self-improving components where compression enables adaptive behavior. Drawing on advances in neurosymbolic systems, reinforcement learning, and cognitive scaffolding, we explore how recent architectures begin to bridge the gap between statistical learning and goal-directed cognition. Finally, we identify key scientific, technical, and ethical challenges on the path to AGI.
In recent years we observed rapid and significant advancements in artificial intelligence (A.I.). So much so that many wonder how close humanity is to developing an A.I. model that can achieve human level of intelligence, also known as artificial general intelligence (A.G.I.). In this work we look at this question and we attempt to define the upper bounds, not just of A.I., but rather of any machine-computable process (a.k.a. an algorithm). To answer this question however, one must first precisely define A.G.I. We borrow prior work's definition of A.G.I. [1] that best describes the sentiment of the term, as used by the leading developers of A.I. That is, the ability to be creative and innovate in some field of study in a way that unlocks new and previously unknown functional capabilities in that field. Based on this definition we draw new bounds on the limits of computation. We formally prove that no algorithm can demonstrate new functional capabilities that were not already present in the initial algorithm itself. Therefore, no algorithm (and thus no A.I. model) can be truly creative in any field of study, whether that is science, engineering, art, sports, etc. In contrast, A.I. models can demonstrate existing functional capabilities, as well as combinations and permutations of existing functional capabilities. We conclude this work by discussing the implications of this proof both as it regards to the future of A.I. development, as well as to what it means for the origins of human intelligence.
No abstract available
Deep learning models have been extensively applied to various aspects of hydrogeological modeling. However, traditional approaches often rely on separate task‐specific models, resulting in time‐consuming selection and tuning processes. This study develops an integrated Latent Diffusion Model (LDM) framework to address four key hydrogeological modeling tasks: aquifer heterogeneity structure generation, surrogate modeling for flow and transport, and direct inversion of aquifer heterogeneity structure. Using a consistent architecture and hyperparameters, the LDM demonstrates robust multi‐task processing capabilities, accurately capturing aquifer heterogeneity, enabling rapid predictions of hydraulic head and solute transport, and efficiently performing direct inversion without iterative simulations. By integrating multiple tasks within a single framework, LDM eliminates the need for task‐specific models or extensive parameter optimization, offering an efficient and adaptive general solution for deep learning‐based hydrogeological modeling. Its generalization across diverse objectives underscores its potential as a cornerstone for advancing Artificial General Intelligence in hydrogeological modeling.
Large language models (LLMs) are advanced artificial intelligence (AI) systems that can perform a variety of tasks commonly found in human intelligence tests, such as defining words, performing calculations, and engaging in verbal reasoning. There are also substantial individual differences in LLM capacities. Given the consistent observation of a positive manifold and general intelligence factor in human samples, along with group-level factors (e.g., crystallized intelligence), we hypothesized that LLM test scores may also exhibit positive intercorrelations, which could potentially give rise to an artificial general ability (AGA) factor and one or more group-level factors. Based on a sample of 591 LLMs and scores from 12 tests aligned with fluid reasoning (Gf), domain-specific knowledge (Gkn), reading/writing (Grw), and quantitative knowledge (Gq), we found strong empirical evidence for a positive manifold and a general factor of ability. Additionally, we identified a combined Gkn/Grw group-level factor. Finally, the number of LLM parameters correlated positively with both general factor of ability and Gkn/Grw factor scores, although the effects showed diminishing returns. We interpreted our results to suggest that LLMs, like human cognitive abilities, may share a common underlying efficiency in processing information and solving problems, though whether LLMs manifest primarily achievement/expertise rather than intelligence remains to be determined. Finally, while models with greater numbers of parameters exhibit greater general cognitive-like abilities, akin to the connection between greater neuronal density and human general intelligence, other characteristics must also be involved.
The success of the conversational AI system ChatGPT has triggered an avalanche of studies that explore its applications in research and education. There are also high hopes that, in addition to such particular usages, it could lead to artificial general intelligence (AGI) that means to human-level intelligence. Such aspirations, however, need to be grounded by actual scientific means to ensure faithful statements and evaluations of the current situation. The purpose of this article is to put ChatGPT into perspective and to outline a way forward that might instead lead to an artificial special intelligence (ASI), a notion we introduce. The underlying idea of ASI is based on an environment that consists only of text. We will show that this avoids the problem of embodiment of an agent and leads to a system with restricted capabilities compared to AGI. Furthermore, we discuss gated actions as a means of large language models to moderate ethical concerns.
Artificial General Intelligence is the idea that someday an hypothetical agent will arise from artificial intelligence (AI) progresses, and will surpass by far the brightest and most gifted human minds. This idea has been around since the early development of AI. Since then, scenarios on how such AI may behave towards humans have been the subject of many fictional and research works. This paper analyzes the current state of artificial intelligence progresses, and how the current AI race with the ever faster release of impressive new AI methods (that can deceive humans, outperform them at tasks we thought impossible to tackle by AI a mere decade ago, and that disrupt the job market) have raised concerns that Artificial General Intelligence (AGI) might be coming faster that we thought. In particular, we focus on 3 specific families of modern AIs to develop the idea that deep neural networks, which are the current backbone of nearly all artificial intelligence methods, are poor candidates for any AGI to arise due to their many limitations, and therefore that any threat coming from the recent AI race does not lie in AGI but in the limitations, uses, and lack of regulations of our current models and algorithms. This article appears in the AI & Society track.
As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, offering significant applications in interpreting natural images. However, the field of pathology has largely remained untapped, particularly in gathering high-quality data and designing comprehensive model frameworks. To bridge the gap in pathology MLLMs, we present PathAsst, a multimodal generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. The development of PathAsst involves three pivotal steps: data acquisition, CLIP model adaptation, and the training of PathAsst's multimodal generative capabilities. Firstly, we collect over 207K high-quality pathology image-text pairs from authoritative sources. Leveraging the advanced power of ChatGPT, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data specifically tailored for invoking eight pathology-specific sub-models we prepared, allowing the PathAsst to effectively collaborate with these models, enhancing its diagnostic ability. Secondly, by leveraging the collected data, we construct PathCLIP, a pathology-dedicated CLIP, to enhance PathAsst's capabilities in interpreting pathology images. Finally, we integrate PathCLIP with the Vicuna-13b and utilize pathology-specific instruction-tuning data to enhance the multimodal generation capacity of PathAsst and bolster its synergistic interactions with sub-models. The experimental results of PathAsst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes. We open-source our dataset, as well as a comprehensive toolkit for extensive pathology data collection and preprocessing at https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology.
The fundamental goal of artificial intelligence (AI) is to mimic the core cognitive activities of human. Despite tremendous success in the AI research, most of existing methods have only single-cognitive ability. To overcome this limitation and take a solid step towards artificial general intelligence (AGI), we develop a foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream cognitive tasks. To achieve this goal, we propose to pre-train our foundation model by self-supervised learning with weak semantic correlation data crawled from the Internet and show that promising results can be obtained on a wide range of downstream tasks. Particularly, with the developed model-interpretability tools, we demonstrate that strong imagination ability is now possessed by our foundation model. We believe that our work makes a transformative stride towards AGI, from our common practice of “weak or narrow AI” to that of “strong or generalized AI”. Artificial intelligence approaches inspired by human cognitive function have usually single learned ability. The authors propose a multimodal foundation model that demonstrates the cross-domain learning and adaptation for broad range of downstream cognitive tasks.
This paper presents a comprehensive examination of how multimodal artificial intelligence (AI) approaches are paving the way towards the realization of Artificial General Intelligence (AGI) in educational contexts. It scrutinizes the evolution and integration of AI in educational systems, emphasizing the crucial role of multimodality, which encompasses auditory, visual, kinesthetic, and linguistic modes of learning. This research delves deeply into the key facets of AGI, including cognitive frameworks, advanced knowledge representation, adaptive learning mechanisms, strategic planning, sophisticated language processing, and the integration of diverse multimodal data sources. It critically assesses AGI's transformative potential in reshaping educational paradigms, focusing on enhancing teaching and learning effectiveness, filling gaps in existing methodologies, and addressing ethical considerations and responsible usage of AGI in educational settings. The paper also discusses the implications of multimodal AI's role in education, offering insights into future directions and challenges in AGI development. This exploration aims to provide a nuanced understanding of the intersection between AI, multimodality, and education, setting a foundation for future research and development in AGI.
Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks, which is very practical in the medical domain. It can significantly reduce the requirement of large amounts of task-specific data by sufficiently sharing medical knowledge among different tasks. However, due to the challenges of designing strongly generalizable models with limited and complex medical data, most existing approaches tend to develop task-specific models. To take a step towards MAGI, we propose a new paradigm called Medical-knOwledge-enhanced mulTimOdal pretRaining (MOTOR). In MOTOR, we combine two kinds of basic medical knowledge, i.e., general and specific knowledge, in a complementary manner to boost the general pretraining process. As a result, the foundation model with comprehensive basic knowledge can learn compact representations from pretraining radiographic data for better cross-modal alignment. MOTOR unifies the understanding and generation, which are two kinds of core intelligence of an AI system, into a single medical foundation model, to flexibly handle more diverse medical tasks. To enable a comprehensive evaluation and facilitate further research, we construct a medical multimodal benchmark including a wide range of downstream tasks, such as chest x-ray report generation and medical visual question answering. Extensive experiments on our benchmark show that MOTOR obtains promising results through simple task-oriented adaptation. The visualization shows that the injected knowledge successfully highlights key information in the medical data, demonstrating the excellent interpretability of MOTOR. Our MOTOR successfully mimics the human practice of fulfilling a"medical student"to accelerate the process of becoming a"specialist". We believe that our work makes a significant stride in realizing MAGI.
We have defined the Conscious Turing Machine (CTM) for the purpose of investigating a Theoretical Computer Science (TCS) approach to consciousness. For this, we have hewn to the TCS demand for simplicity and understandability. The CTM is consequently and intentionally a simple machine. It is not a model of the brain, though its design has greatly benefited - and continues to benefit - from neuroscience and psychology. The CTM is a model of and for consciousness. Although it is developed to understand consciousness, the CTM offers a thoughtful and novel guide to the creation of an Artificial General Intelligence (AGI) . For example, the CTM has an enormous number of powerful processors, some with specialized expertise, others unspecialized but poised to develop an expertise. For whatever problem must be dealt with, the CTM has an excellent way to utilize those processors that have the required knowledge, ability, and time to work on the problem, even if it is not aware of which ones these may be.
No abstract available
Large language models (LLMs) have demonstrated impressive reasoning abilities in complex tasks. However, they lack up-to-date knowledge and experience hallucinations during reasoning, which can lead to incorrect reasoning processes and diminish their performance and trustworthiness. Knowledge graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. Nevertheless, existing KG-based LLM reasoning methods only treat KGs as factual knowledge bases and overlook the importance of their structural information for reasoning. In this paper, we propose a novel method called reasoning on graphs (RoG) that synergizes LLMs with KGs to enable faithful and interpretable reasoning. Specifically, we present a planning-retrieval-reasoning framework, where RoG first generates relation paths grounded by KGs as faithful plans. These plans are then used to retrieve valid reasoning paths from the KGs for LLMs to conduct faithful reasoning. Furthermore, RoG not only distills knowledge from KGs to improve the reasoning ability of LLMs through training but also allows seamless integration with any arbitrary LLMs during inference. Extensive experiments on two benchmark KGQA datasets demonstrate that RoG achieves state-of-the-art performance on KG reasoning tasks and generates faithful and interpretable reasoning results.
Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG. In this work, we introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and answer candidates are extracted to represent KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA. Furthermore, we develop a retrieval augmentation (RA) technique to further boost KGQA performance with GNN-RAG. Experimental results show that GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop and multi-entity questions outperforming competing approaches by 8.9--15.5% points at answer F1.
Understanding human instructions to identify the target objects is vital for perception systems. In recent years, the advancements of Large Language Models (LLMs) have introduced new possibilities for image segmentation. In this work, we delve into reasoning segmentation, a novel task that enables segmentation system to reason and interpret implicit user intention via large language model reasoning and then segment the corresponding target. Our work on reasoning segmentation contributes on both the methodological design and dataset labeling. For the model, we propose a new framework named LLM-Seg. LLM-Seg effectively connects the current foundational Segmentation Anything Model and the LLM by mask proposals selection. For the dataset, we propose an automatic data generation pipeline and construct a new reasoning segmentation dataset named LLM-Seg40K. Experiments demonstrate that our LLM-Seg exhibits competitive performance compared with existing methods. Furthermore, our proposed pipeline can efficiently produce high-quality reasoning segmentation datasets. The LLM-Seg40K dataset, developed through this pipeline, serves as a new benchmark for training and evaluating various reasoning segmentation approaches. Our code, models and dataset are at https://github.com/wangjunchi/LLMSeg.
The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM's reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.
Retrieval-augmented generation (RAG) has improved large language models (LLMs) by using knowledge retrieval to overcome knowledge deficiencies. However, current RAG methods often fall short of ensuring the depth and completeness of retrieved information, which is necessary for complex reasoning tasks. In this work, we introduce Think-on-Graph 2.0 (ToG-2), a hybrid RAG framework that iteratively retrieves information from both unstructured and structured knowledge sources in a tight-coupling manner. Specifically, ToG-2 leverages knowledge graphs (KGs) to link documents via entities, facilitating deep and knowledge-guided context retrieval. Simultaneously, it utilizes documents as entity contexts to achieve precise and efficient graph retrieval. ToG-2 alternates between graph retrieval and context retrieval to search for in-depth clues relevant to the question, enabling LLMs to generate answers. We conduct a series of well-designed experiments to highlight the following advantages of ToG-2: 1) ToG-2 tightly couples the processes of context retrieval and graph retrieval, deepening context retrieval via the KG while enabling reliable graph retrieval based on contexts; 2) it achieves deep and faithful reasoning in LLMs through an iterative knowledge retrieval process of collaboration between contexts and the KG; and 3) ToG-2 is training-free and plug-and-play compatible with various LLMs. Extensive experiments demonstrate that ToG-2 achieves overall state-of-the-art (SOTA) performance on 6 out of 7 knowledge-intensive datasets with GPT-3.5, and can elevate the performance of smaller models (e.g., LLAMA-2-13B) to the level of GPT-3.5's direct reasoning. The source code is available on https://github.com/IDEA-FinAI/ToG-2.
Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.
Large Language Models (LLMs) have achieved impressive results in various tasks but struggle with hallucination problems and lack of relevant knowledge, especially in deep complex reasoning and knowledge-intensive tasks. Knowledge Graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. However, existing KG-based LLM reasoning methods face challenges like handling multi-hop reasoning, multi-entity questions, and effectively utilizing graph structures. To address these issues, we propose Paths-over-Graph (PoG), a novel method that enhances LLM reasoning by integrating knowledge reasoning paths from KGs, improving the interpretability and faithfulness of LLM outputs. PoG tackles multi-hop and multi-entity questions through a three-phase dynamic multi-hop path exploration, which combines the inherent knowledge of LLMs with factual knowledge from KGs. In order to improve the efficiency, PoG prunes irrelevant information from the graph exploration first and introduces efficient three-step pruning techniques that incorporate graph structures, LLM prompting, and a pre-trained language model (e.g., SBERT) to effectively narrow down the explored candidate paths. This ensures all reasoning paths contain highly relevant information captured from KGs, making the reasoning faithful and interpretable in problem-solving. PoG innovatively utilizes graph structure to prune the irrelevant noise and represents the first method to implement multi-entity deep path detection on KGs for LLM reasoning tasks. Comprehensive experiments on five benchmark KGQA datasets demonstrate PoG outperforms the state-of-the-art method ToG across GPT-3.5-Turbo and GPT-4, achieving an average accuracy improvement of 18.9%. Notably, PoG with GPT-3.5-Turbo surpasses ToG with GPT-4 by up to 23.9%.
No abstract available
Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model's outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12$\times$ speedup with comparable reasoning quality.
Large language models (LLMs) such as GPT-4 have recently demonstrated impressive results across a wide range of tasks. LLMs are still limited, however, in that they frequently fail at complex reasoning, their reasoning processes are opaque, they are prone to ‘hallucinate’ facts, and there are concerns about their underlying biases. Letting models verbalize reasoning steps as natural language, a technique known as chain-of-thought prompting, has recently been proposed as a way to address some of these issues. Here we present ThoughtSource, a meta-dataset and software library for chain-of-thought (CoT) reasoning. The goal of ThoughtSource is to improve future artificial intelligence systems by facilitating qualitative understanding of CoTs, enabling empirical evaluations, and providing training data. This first release of ThoughtSource integrates seven scientific/medical, three general-domain and five math word question answering datasets.
Table reasoning tasks have shown remarkable progress with the development of large language models (LLMs), which involve interpreting and drawing conclusions from tabular data based on natural language (NL) questions. Existing solutions mainly tested on smaller tables face scalability issues and struggle with complex queries due to incomplete or dispersed data across different table sections. To alleviate these challenges, we propose TAP4LLM as a versatile pre-processor suite for leveraging LLMs in table-based tasks effectively. It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing&serialization to convert tables into various formats suitable for LLMs' understanding. In each module, we design and compare several common methods under various usage scenarios, aiming to shed light on the best practices for leveraging LLMs for table-reasoning tasks. Our experiments show that our method improves LLMs' reasoning capabilities in various tabular tasks and enhances the interaction between LLMs and tabular data by employing effective pre-processing.
Although perception systems have made remarkable ad-vancements in recent years, they still rely on explicit human instruction or pre-defined categories to identify the target objects before executing visual recognition tasks. Such systems cannot actively reason and comprehend implicit user intention. In this work, we propose a new segmentation task - reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of multimodal Large Language Models (LLMs) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving complex rea-soning and world knowledge. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation data samples results in further performance enhancement. Both quantitative and qualitative experiments show our method effectively unlocks new reasoning segmentation capabilities for multimodal LLMs. Code, models, and data are available at github.com/dvlab-research/LISA.
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
Large Language Models (LLMs) have demonstrated strong reasoning capabilities in solving complex problems. However, current approaches primarily enhance reasoning through the elaboration of thoughts while neglecting the diversity of reasoning types. LLMs typically employ deductive reasoning, proceeding step-by-step from given conditions, which limits their exploration during problem-solving. Our analysis reveals that certain problems are exclusively solvable through specific reasoning strategies like inductive, abductive, or analogical reasoning. However, incorporating diverse reasoning approaches presents two key challenges: identifying the appropriate reasoning type for each problem and exploiting this approach during problem-solving. Therefore, we propose the TypedThinker that predicts suitable reasoning types based on the problem and their previous effectiveness and provides relevant demonstrations to guide LLMs in applying these strategies. Experimental results show significant improvements across multiple benchmarks, with performance gains of 3.4% for Mistral 7B, 6.5% for LLaMA3 8B, and 7% for Qwen 2 7B on logical and mathematical reasoning tasks. TypedThinker enhances LLM reasoning without requiring knowledge distillation from larger models. It can be integrated into more advanced systems like GPT-4o or specialized models like MetaMath to diversify their reasoning approaches and improve their problem-solving capabilities.
Recent advancements in the field of large language models, particularly through the Chain of Thought (CoT) approach, have demonstrated significant improvements in solving complex problems. However, existing models either tend to sacrifice detailed reasoning for brevity due to user preferences, or require extensive and expensive training data to learn complicated reasoning ability, limiting their potential in solving complex tasks. To bridge this gap, following the concept of scaling test-time, we propose a simple method by encouraging models to adopt a more patient reasoning style without the need of introducing new knowledge or skills. To employ a preference optimization approach, we generate detailed reasoning processes as positive examples and simple answers as negative examples, thereby training the model to favor thoroughness in its responses. Our results demonstrate a performance increase of up to 2.1% on GSM8k with training just on a lightweight dataset.
No abstract available
In recent years, general-purpose large language models (LLMs) such as GPT, Gemini, Claude, and DeepSeek have advanced at an unprecedented pace. Despite these achievements, their application to finance remains challenging, due to fragmented data sources, intransparent reasoning processes, and weak transferability to business applications. In response, we introduce Fin-R1, a reasoning LLM designed for financial scenarios. With a compact size of 7 billion parameters, Fin-R1 reduces deployment costs while addressing the aforementioned challenges. Its development follows a two-stage pipeline. First, we construct Fin-R1-Data, a high-quality financial dataset consisting of 60,091 chain-of-thought (CoT) samples, distilled and filtered from multiple authoritative benchmarks to ensure consistency and reliability. Second, we train Fin-R1 using Fin-R1-Data through supervised fine-tuning (SFT), followed by reinforcement learning (RL). This stage substantially improves the model's ability to solve complex financial reasoning tasks, yielding outputs that are both accurate and interpretable. Despite its relatively small parameter scale, Fin-R1 achieves competitive empirical performance across established financial benchmarks and demonstrates practical utility in compliance checking and robo-advisory. Our code is publicly available at https://github.com/SUFE-AIFLM-Lab/Fin-R1, and has already attracted over 700 stars.
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through oversampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1's better performance without any additional verification.
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data, where the step-by-step thought process is explicitly outlined by text tokens. However, this results in lengthy inputs where many words support textual coherence rather than core reasoning information, and processing these inputs consumes substantial computation resources. In this work, we propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens generated by VQ-VAE, significantly reducing the length of reasoning traces. We explore the use of latent trace abstractions in two scenarios: 1) training the model from scratch for the Keys-Finding Maze problem, 2) fine-tuning LLMs on this hybrid data with an extended vocabulary including unseen latent tokens, for both logical and mathematical reasoning problems. To facilitate effective learning, we introduce a simple training procedure that randomly mixes latent and text tokens, which enables fast adaptation to new latent tokens. Our approach consistently outperforms the baselines methods in various benchmarks.
Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.
The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}.
No abstract available
Although large language models (LLMs) have achieved significant success in various tasks, they often struggle with hallucination problems, especially in scenarios requiring deep and responsible reasoning. These issues could be partially addressed by introducing external knowledge graphs (KG) in LLM reasoning. In this paper, we propose a new LLM-KG integrating paradigm ``$\hbox{LLM}\otimes\hbox{KG}$'' which treats the LLM as an agent to interactively explore related entities and relations on KGs and perform reasoning based on the retrieved knowledge. We further implement this paradigm by introducing a new approach called Think-on-Graph (ToG), in which the LLM agent iteratively executes beam search on KG, discovers the most promising reasoning paths, and returns the most likely reasoning results. We use a number of well-designed experiments to examine and illustrate the following advantages of ToG: 1) compared with LLMs, ToG has better deep reasoning power; 2) ToG has the ability of knowledge traceability and knowledge correctability by leveraging LLMs reasoning and expert feedback; 3) ToG provides a flexible plug-and-play framework for different LLMs, KGs and prompting strategies without any additional training cost; 4) the performance of ToG with small LLM models could exceed large LLM such as GPT-4 in certain scenarios and this reduces the cost of LLM deployment and application. As a training-free method with lower computational cost and better generality, ToG achieves overall SOTA in 6 out of 9 datasets where most previous SOTAs rely on additional training.
Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.
Conventional Task and Motion Planning (TAMP) approaches rely on manually designed interfaces connecting symbolic task planning with continuous motion generation. These domain-specific and labor-intensive modules are limited in addressing emerging tasks in real-world settings. Here, we present LLM3, a novel Large Language Model (LLM)-based TAMP framework featuring a domain-independent interface. Specifically, we leverage the powerful reasoning and planning capabilities of pre-trained LLMs to propose symbolic action sequences and select continuous action parameters for motion planning. Crucially, LLM3 incorporates motion planning feedback through prompting, allowing the LLM to iteratively refine its proposals by reasoning about motion failure. Consequently, LLM3 interfaces between task planning and motion planning, alleviating the intricate design process of handling domain-specific messages between them. Through a series of simulations in a box-packing domain, we quantitatively demonstrate the effectiveness of LLM3 in solving TAMP problems and the efficiency in selecting action parameters. Ablation studies underscore the significant contribution of motion failure reasoning to the success of LLM3. Furthermore, we conduct qualitative experiments on a physical manipulator, demonstrating the practical applicability of our approach in real-world settings.
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, prompting interest in their application as black-box optimizers. This paper asserts that LLMs possess the capability for zero-shot optimization across diverse scenarios, including multi-objective and high-dimensional problems. We introduce a novel population-based method for numerical optimization using LLMs called Language-Model-Based Evolutionary Optimizer (LEO). Our hypothesis is supported through numerical examples, spanning benchmark and industrial engineering problems such as supersonic nozzle shape optimization, heat transfer, and windfarm layout optimization. We compare our method to several gradient-based and gradient-free optimization approaches. While LLMs yield comparable results to state-of-the-art methods, their imaginative nature and propensity to hallucinate demand careful handling. We provide practical guidelines for obtaining reliable answers from LLMs and discuss method limitations and potential research directions.
Tool-augmented Large Language Models (TALMs) are known to enhance the skillset of large language models (LLMs), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented large language model for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.
Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu
Despite significant progress in robotic systems for operation within human-centric environments, existing models still heavily rely on explicit human commands to identify and manipulate specific objects. This limits their effectiveness in environments where understanding and acting on implicit human intentions are crucial. In this study, we introduce a novel task: reasoning grasping, where robots need to generate grasp poses based on indirect verbal instructions or intentions. To accomplish this, we propose an end-to-end reasoning grasping model that integrates a multimodal Large Language Model (LLM) with a vision-based robotic grasping framework. In addition, we present the first reasoning grasping benchmark dataset generated from the GraspNet-1 billion, incorporating implicit instructions for object-level and part-level grasping. Our results show that directly integrating CLIP or LLaVA with the grasp detection model performs poorly on the challenging reasoning grasping tasks, while our proposed model demonstrates significantly enhanced performance both in the reasoning grasping benchmark and real-world experiments.
Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench. We believe these results demonstrate the potential for language models as a new interface for science. We open source the model for the benefit of the scientific community.
Large Language Models (LLMs) are often challenged by generating erroneous or hallucinated responses, especially in complex reasoning tasks. Leveraging Knowledge Graphs (KGs) as external knowledge sources has emerged as a viable solution. However, existing KG-enhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this paper, we propose a unified framework, FiDeLiS, designed to improve the factuality of LLM responses by anchoring answers to verifiable reasoning steps retrieved from KGs. To achieve this, we leverage step-wise beam search with a deductive scoring function, allowing the LLM to validate reasoning process step by step, and halt the search once the question is deducible. In addition, we propose a Path-RAG module to pre-select a smaller candidate set for each beam search step, reducing computational costs by narrowing the search space. Extensive experiments show that our method, as a training-free framework, not only improve the performance but also enhance the factuality and interpretability across different benchmarks. Code is released at https://github.com/Y-Sui/FiDeLiS.
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
While large language models (LLMs) have made significant strides in natural language processing (NLP), they continue to face challenges in adequately addressing the intricacies of the Chinese language in certain scenarios. We propose a framework called Six-Writings multimodal processing (SWMP) to enable direct integration of Chinese NLP (CNLP) with morphological and semantic elements. The first part of SWMP, known as Six-Writings pictophonetic coding (SWPC), is introduced with a suitable level of granularity for radicals and components, enabling effective representation of Chinese characters and words. We conduct several experimental scenarios, including the following: (1) We establish an experimental database consisting of images and SWPC for Chinese characters, enabling dual-mode processing and matrix generation for CNLP. (2) We characterize various generative modes of Chinese words, such as thousands of Chinese idioms, used as question-and-answer (Q&A) prompt functions, facilitating analogies by SWPC. The experiments achieve 100% accuracy in answering all questions in the Chinese morphological data set (CA8-Mor-10177). (3) A fine-tuning mechanism is proposed to refine word embedding results using SWPC, resulting in an average relative error of ≤25% for 39.37% of the questions in the Chinese wOrd Similarity data set (COS960). The results demonstrate that SWMP/SWPC methods effectively capture the distinctive features of Chinese and offer a promising mechanism to enhance CNLP with better efficiency. 大型语言模型(LLMs)在自然语言处理中已取得显著成就,但在某些场景下,仍然面临解决中文语言处理复杂性的挑战。本文提出“六书”多模态处理(SWMP)框架,旨在考虑汉语形、声、音、像、意、会特性,便于中文语言多模态处理。在SWMP统一的理论框架下,提出“六书”形声编码(SWPC,简称“六书编码”)方法,使得对汉字的表达既能与语法有机结合,又反映汉语灵活应用的特点。文中设计的实验场景包括:(1)实验性建立汉字字根、偏旁(形部)和部件(声部)的图像和“六书”编码(SWPC)的数据库,实现汉语文字和图形的双模态处理;(2)表征若干汉词生成机制,建立提示性问/答模式,进行类比推理。使用SWPC处理中文形态关系数据集(CA8-Mor-10177)的所有问题,精度可达100%。(3)建立“六书”形声编码对词嵌入生成结果微调机制。对中文单词相似度数据集(COS960)中39.37%的问题,相似度计算与人工基础评估结果的平均相对误差低于25%。这些优于目前同类基准精度的结果表明,“六书编码”尝试体现汉语细腻的局部表征和整体关联等特点,可作为对现行汉语语言处理理论和技术的有效补充。
Deep learning provides an effective way for automatic classification of cardiac arrhythmias, but in clinical decision-making, pure data-driven methods working as black-boxes may lead to unsatisfactory results. A promising solution is combining domain knowledge with deep learning. This paper develops a flexible and extensible framework for integrating domain knowledge with a deep neural network. The model consists of a deep neural network to capture the statistical pattern between input data and the ground-truth label, and a knowledge module to guarantee consistency with the domain knowledge. These two components are trained interactively to bring the best of both worlds. The experiments show that the domain knowledge is valuable in refining the neural network prediction and thus improves accuracy. 深度学习为心律失常的自动分类提供了一种有效的方法,但在临床决策中,纯数据驱动的方法以黑盒形式运行,可能会导致不良预测结果。将领域知识与深度学习相结合是一种很有前景的解决方案。本文开发了一个灵活且可扩展的框架,用于集成领域知识与深度神经网络。该模型由深度神经网络和知识推理模块组成,深度神经网络用于捕捉输入数据的统计模式,知识模块用于确保与领域知识的一致性。这两个组成部分经过交互训练,以实现两种机制的最佳效果。实验表明,领域知识可以较好地改善神经网络的预测结果,从而提高预测精度。
No abstract available
As AI systems advance and integrate into society, well-designed and transparent evaluations are becoming essential tools in AI governance, informing decisions by providing evidence about system capabilities and risks. Yet there remains a lack of clarity on how to perform these assessments both comprehensively and reliably. To address this gap, we propose a conceptual framework for analyzing AI capability evaluations, offering a structured, descriptive approach that systematizes the analysis of widely used methods and terminology without imposing new taxonomies or rigid formats. This framework supports transparency, comparability, and interpretability across diverse evaluations. It also enables researchers to identify methodological weaknesses, assists practitioners in designing evaluations, and provides policymakers with an accessible tool to scrutinize, compare, and navigate complex evaluation landscapes.
This paper presents an evaluation framework for agentic AI systems in mission-critical negotiation contexts, addressing the need for AI agents that can adapt to diverse human operators and stakeholders. Using Sotopia as a simulation testbed, we present two experiments that systematically evaluated how personality traits and AI agent characteristics influence LLM-simulated social negotiation outcomes--a capability essential for a variety of applications involving cross-team coordination and civil-military interactions. Experiment 1 employs causal discovery methods to measure how personality traits impact price bargaining negotiations, through which we found that Agreeableness and Extraversion significantly affect believability, goal achievement, and knowledge acquisition outcomes. Sociocognitive lexical measures extracted from team communications detected fine-grained differences in agents'empathic communication, moral foundations, and opinion patterns, providing actionable insights for agentic AI systems that must operate reliably in high-stakes operational scenarios. Experiment 2 evaluates human-AI job negotiations by manipulating both simulated human personality and AI system characteristics, specifically transparency, competence, adaptability, demonstrating how AI agent trustworthiness impact mission effectiveness. These findings establish a repeatable evaluation methodology for experimenting with AI agent reliability across diverse operator personalities and human-agent team dynamics, directly supporting operational requirements for reliable AI systems. Our work advances the evaluation of agentic AI workflows by moving beyond standard performance metrics to incorporate social dynamics essential for mission success in complex operations.
Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce''capability''groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
The emergence of Artificial Intelligence (AI) Scientist represents a paradigm shift in scientific discovery, with large language models (LLMs) taking the lead as the primary executor in the entire scientific workflow from idea generation to experiment implementation. Recent AI Scientist studies demonstrate sufficient capabilities for independent scientific discovery, with the generated research reports gaining acceptance at the ICLR 2025 workshop and ACL 2025, arguing that a human-level AI Scientist, capable of uncovering phenomena previously unknown to humans, may be imminent. Despite this substantial progress, AI Scientist has yet to produce a groundbreaking achievement in the domain of computer science on par with automated scientific tools. Based on extensive quantitative evidence from existing benchmarks in complex engineering tasks and a systematic evaluation assess 28 research papers generated by five advanced AI Scientist systems, we argue that \textbf{the fundamental bottleneck for AI Scientists lies in their capability to execute the requisite verification procedures.} Current AI Scientist systems lack the execution capabilities needed to execute rigorous experiments and produce high-quality scientific papers. To better illustrate the root cause of this \textbf{implementation gap}, we provide an in-depth discussion on the fundamental limitations of AI Scientist. This position paper aims to call for the participants in the community to bridge the implementation gap.
(1) Background: In the era of generative AI (GenAI), assessing AI literacy is essential for understanding how effectively non-expert users can interact with AI. However, existing assessment tools primarily focus on users’ understanding of AI principles or rely on self-reported scales, neglecting critical thinking and actual interaction capabilities. To address this gap, this study aims to design and validate evaluation indicators targeting the behavioral process of human–GenAI interactions and analyze the impact of critical thinking. (2) Methods: Grounded in information literacy and critical thinking frameworks, this study operationalized human–AI interaction capabilities into behavioral indicators and rubrics through observation, surveys, and pilot studies. Data were collected from 121 undergraduates completing two real-world tasks with GenAI, and their interaction processes were documented and evaluated. (3) Results: The indicators showed acceptable inter-rater and internal consistency reliability. Exploratory and Confirmatory Factor Analysis confirmed a three-dimensional structure. Further analysis showed that interaction capabilities varied across gender, academic background, AIGC use frequency, critical thinking disposition levels, and question chain logic. (4) Conclusions: The developed evaluation indicators are reliable and valid. Further analysis reveals that a high critical thinking disposition can offset the disadvantage of lower usage frequency. This highlights the significance of critical thinking in enhancing human–GenAI interaction capabilities.
Cognitive Application Program Interface (API) is an API of emerging artificial intelligence (AI)-based cloud services, which extracts various contextual information from non-numerical multimedia data including image and audio. Our interest is to apply image-based cognitive APIs to implement flexible and efficient context sensing services in a smart home. In the existing approach with machine learning by us, with the complexity of recognition object and the number of the defined contexts increases by users, it still requires directly manually labeling a moderate scale of data for training and continually try to calling multiple cognitive APIs for feature extraction. In this paper, we propose a novel method that uses a small scale of labeled data to evaluate the capability of cognitive APIs in advance, before training features of the APIs with machine learning, for the flexible and efficient home context sensing. In the proposed method, we exploit document similarity measures and the concepts (i.e., internal cohesion and external isolation) integrate into clustering results, to see how the capability of different cognitive APIs for recognizing each context. By selecting the cognitive APIs that relatively adapt to the defined contexts and data based on the evaluation results, we have achieved the flexible integration and efficient process of cognitive APIs for home context sensing.
Can Multimodal Large Language Models (MLLMs), with capabilities in perception, recognition, understanding, and reasoning, act as independent assistants in art evaluation dialogues? Current MLLM evaluation methods, reliant on subjective human scoring or costly interviews, lack comprehensive scenario coverage. This paper proposes a process-oriented Human-Computer Interaction (HCI) space design for more accurate MLLM assessment and development. This approach aids teachers in efficient art evaluation and records interactions for MLLM capability assessment. We introduce ArtMentor, a comprehensive space integrating a dataset and three systems for optimized MLLM evaluation. It includes 380 sessions from five art teachers across nine critical dimensions. The modular system features entity recognition, review generation, and suggestion generation agents, enabling iterative upgrades. Machine learning and natural language processing ensure reliable evaluations. Results confirm GPT-4o’s effectiveness in assisting teachers in art evaluation dialogues. Our contributions are available at https://artmentor.github.io/.
To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.
An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.
As AI systems appear to exhibit ever-increasing capability and generality, assessing their true potential and safety becomes paramount. This paper contends that the prevalent evaluation methods for these systems are fundamentally inadequate, heightening the risks and potential hazards associated with AI. I argue that a reformation is required in the way we evaluate AI systems and that we should look towards cognitive sciences for inspiration in our approaches, which have a longstanding tradition of assessing general intelligence across diverse species. We will identify some of the difficulties that need to be overcome when applying cognitively-inspired approaches to general-purpose AI systems and also analyse the emerging area of"Evals". The paper concludes by identifying promising research pathways that could refine AI evaluation, advancing it towards a rigorous scientific domain that contributes to the development of safe AI systems.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FoFo fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FoFo's role in guiding the selection of domain-specific AI agents. FoFo is released here at https://github.com/SalesforceAIResearch/FoFo.
Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants’ API Invocation Capabilities
With the rise of Large Language Models (LLMs), AI assistants’ ability to utilize tools, especially through API calls, has advanced notably. This progress has necessitated more accurate evaluation methods. Many existing studies adopt static evaluation, where they assess AI assistants’ API call based on pre-defined dialogue histories. However, such evaluation method can be misleading, as an AI assistant might fail in generating API calls from preceding human interaction in real cases. Instead of the resource-intensive method of direct human-machine interactions, we propose Automated Dynamic Evaluation (AutoDE) to assess an assistant’s API call capability without human involvement. In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions, using a LLM-based user agent, equipped with a user script to ensure human alignment. Experimental results highlight that AutoDE uncovers errors overlooked by static evaluations, aligning more closely with human assessment. Testing four AI assistants using our crafted benchmark, our method further mirrored human evaluation compared to conventional static evaluations.
The global economy is increasingly dependent on knowledge workers to meet the needs of public and private organizations. While there is no single definition of knowledge work, organizations and industry groups still attempt to measure individuals' capability to engage in it. The most comprehensive assessment of capability readiness for professional knowledge workers is the Uniform CPA Examination developed by the American Institute of Certified Public Accountants (AICPA). In this paper, we experimentally evaluate OpenAI's `text-davinci-003` and prior versions of GPT on both a sample Regulation (REG) exam and an assessment of over 200 multiple-choice questions based on the AICPA Blueprints for legal, financial, accounting, technology, and ethical tasks. First, we find that `text-davinci-003` achieves a correct rate of 14.4% on a sample REG exam section, significantly underperforming human capabilities on quantitative reasoning in zero-shot prompts. Second, `text-davinci-003` appears to be approaching human-level performance on the Remembering&Understanding and Application skill levels in the Exam absent calculation. For best prompt and parameters, the model answers 57.6% of questions correctly, significantly better than the 25% guessing rate, and its top two answers are correct 82.1% of the time, indicating strong non-entailment. Finally, we find that recent generations of GPT-3 demonstrate material improvements on this assessment, rising from 30% for `text-davinci-001` to 57% for `text-davinci-003`. These findings strongly suggest that large language models have the potential to transform the quality and efficiency of future knowledge work.
Large language models are becoming increasingly popular in various professional fields. One of their applications is providing code suggestions. However, the differences in code generation capabilities of different large language models and the problems they may make in giving code suggestions have not been well studied. This paper proposes a method for evaluating the code generation capabilities of large language models and applies it to several commonly used models, including ChatGPT, Claude, Spark, and Bing AI. Through experimental evaluation and data analysis, we find that search-based large language models, such as Bing AI, exhibit stronger code generation capabilities than pre-trained models, such as ChatGPT, Claude, and Spark. We also find that the current large language models possess strong natural language understanding abilities, and errors in code suggestions are more likely to be due to code problems rather than understanding problems.
In AI evaluation, performance is often calculated by averaging across various instances. But to fully understand the capabilities of an AI system, we need to understand the factors that cause its pattern of success and failure. In this paper, we present a new methodology to identify and build informative instance features that can provide explanatory and predictive power to analyse the behaviour of AI systems more robustly. The methodology builds on these relevant features that should relate monotonically with success, and represents patterns of performance in a new type of plots known as ‘agent characteristic grids’. We illustrate this methodology with the Animal-AI competition as a representative example of how we can revisit existing competitions and benchmarks in AI—even when evaluation data is sparse. Agents with the same average performance can show very different patterns of performance at the instance level. With this methodology, these patterns can be visualised, explained and predicted, progressing towards a capability-oriented evaluation rather than relying on a less informative average performance score.
One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of"verifiable instructions"such as"write in more than 400 words"and"mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval
Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.
Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.
Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI's actual capability. These conflicting interests lead to the problem of sandbagging, which we define as strategic underperformance on an evaluation. In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted or password-locked to target specific scores on a capability evaluation. We have mediocre success in password-locking a model to mimic the answers a weaker model would give. Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.
As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.
Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.
The exceptionally rapid development of highly flexible, reusable artificial intelligence (AI) models is likely to usher in newfound capabilities in medicine. We propose a new paradigm for medical AI, which we refer to as generalist medical AI (GMAI). GMAI models will be capable of carrying out a diverse set of tasks using very little or no task-specific labelled data. Built through self-supervision on large, diverse datasets, GMAI will flexibly interpret different combinations of medical modalities, including data from imaging, electronic health records, laboratory results, genomics, graphs or medical text. Models will in turn produce expressive outputs such as free-text explanations, spoken recommendations or image annotations that demonstrate advanced medical reasoning abilities. Here we identify a set of high-impact potential applications for GMAI and lay out specific technical capabilities and training datasets necessary to enable them. We expect that GMAI-enabled applications will challenge current strategies for regulating and validating AI devices for medicine and will shift practices associated with the collection of large medical datasets.
Conceptual abstraction and analogy-making are key abilities underlying humans' abilities to learn, reason, and robustly adapt their knowledge to new domains. Despite a long history of research on constructing artificial intelligence (AI) systems with these abilities, no current AI system is anywhere close to a capability of forming humanlike abstractions or analogies. This paper reviews the advantages and limitations of several approaches toward this goal, including symbolic methods, deep learning, and probabilistic program induction. The paper concludes with several proposals for designing challenge tasks and evaluation measures in order to make quantifiable and generalizable progress in this area.
The problem of generating generally capable agents is an important frontier in artificial intelligence (AI) research. Such agents may demonstrate open-ended, versatile, and diverse modes of expression, similar to humans. We interpret the work of Heintz & Scott-Phillips as a minimal sufficient set of socio-cognitive biases for the emergence of generally expressive AI, separate yet complementary to existing algorithms.
Artificial intelligence (AI) chatbots pose the opportunity to draft template responses to patient questions. However, the ability of chatbots to generate responses based on domain-specific knowledge of cancer remains to be tested. To evaluate the competency of AI chatbots (GPT-3.5 [chatbot 1], GPT-4 [chatbot 2], and Claude AI [chatbot 3]) to generate high-quality, empathetic, and readable responses to patient questions about cancer. This equivalence study compared the AI chatbot responses and responses by 6 verified oncologists to 200 patient questions about cancer from a public online forum. Data were collected on May 31, 2023. Random sample of 200 patient questions related to cancer from a public online forum (Reddit r/AskDocs) spanning from January 1, 2018, to May 31, 2023, was posed to 3 AI chatbots. The primary outcomes were pilot ratings of the quality, empathy, and readability on a Likert scale from 1 (very poor) to 5 (very good). Two teams of attending oncology specialists evaluated each response based on pilot measures of quality, empathy, and readability in triplicate. The secondary outcome was readability assessed using Flesch-Kincaid Grade Level. Responses to 200 questions generated by chatbot 3, the best-performing AI chatbot, were rated consistently higher in overall measures of quality (mean, 3.56 [95% CI, 3.48-3.63] vs 3.00 [95% CI, 2.91-3.09]; P < .001), empathy (mean, 3.62 [95% CI, 3.53-3.70] vs 2.43 [95% CI, 2.32-2.53]; P < .001), and readability (mean, 3.79 [95% CI, 3.72-3.87] vs 3.07 [95% CI, 3.00-3.15]; P < .001) compared with physician responses. The mean Flesch-Kincaid Grade Level of physician responses (mean, 10.11 [95% CI, 9.21-11.03]) was not significantly different from chatbot 3 responses (mean, 10.31 [95% CI, 9.89-10.72]; P > .99) but was lower than those from chatbot 1 (mean, 12.33 [95% CI, 11.84-12.83]; P < .001) and chatbot 2 (mean, 11.32 [95% CI, 11.05-11.79]; P = .01). The findings of this study suggest that chatbots can generate quality, empathetic, and readable responses to patient questions comparable to physician responses sourced from an online forum. Further research is required to assess the scope, process integration, and patient and physician outcomes of chatbot-facilitated interactions.
The ultimate goal of artificial intelligence (AI) is to develop technologies that are best able to serve humanity. This will require advancements that go beyond the basic components of general intelligence. The term "intelligence" does not best represent the technological needs of advancing society, because it is "wisdom", rather than intelligence, that is associated with greater well-being, happiness, health, and perhaps even longevity of the individual and the society. Thus, the future need in technology is for artificial wisdom (AW). We examine the constructs of human intelligence and human wisdom in terms of their basic components, neurobiology, and relationship to aging, based on published empirical literature. We review the development of AI as inspired and driven by the model of human intelligence, and consider possible governing principles for AW that would enable humans to develop computers which can operationally utilize wise principles and result in wise acts. We review relevant examples of current efforts to develop such wise technologies. AW systems will be based on developmental models of the neurobiology of human wisdom. These AW systems need to be able to a) learn from experience and self-correct; b) exhibit compassionate, unbiased, and ethical behaviors; and c) discern human emotions and help the human users to regulate their emotions and make wise decisions. A close collaboration among computer scientists, neuroscientists, mental health experts, and ethicists is necessary for developing AW technologies, which will emulate the qualities of wise humans and thus serve the greatest benefit to humanity. Just as human intelligence and AI have helped further the understanding and usefulness of each other, human wisdom and AW can aid in promoting each other's growth.
Despite enormous progress in machine learning, artificial neural networks still lag behind brains in their ability to generalize to new situations. Given identical training data, differences in generalization are caused by many defining features of a learning algorithm, such as network architecture and learning rule. Their joint effect, called "inductive bias," determines how well any learning algorithm-or brain-generalizes: robust generalization needs good inductive biases. Artificial networks use rather nonspecific biases and often latch onto patterns that are only informative about the statistics of the training data but may not generalize to different scenarios. Brains, on the other hand, generalize across comparatively drastic changes in the sensory input all the time. We highlight some shortcomings of state-of-the-art learning algorithms compared to biological brains and discuss several ideas about how neuroscience can guide the quest for better inductive biases by providing useful constraints on representations and network architecture.
No abstract
General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)
This paper proposes a theoretical framework for the biological learning mechanism as a general learning system. The proposal is as follows. The bursting and tonic modes of firing patterns found in many neuron types in the brain correspond to two separate modes of information processing, with one mode resulting in awareness, and another mode being subliminal. In such a coding scheme, a neuron in bursting state codes for the highest level of perceptual abstraction representing a pattern of sensory stimuli, or volitional abstraction representing a pattern of muscle contraction sequences. Within the 50-250 ms minimum integration time of experience, the bursting neurons form synchrony ensembles to allow for binding of related percepts. The degree which different bursting neurons can be merged into the same synchrony ensemble depends on the underlying cortical connections that represent the degree of perceptual similarity. These synchrony ensembles compete for selective attention to remain active. The dominant synchrony ensemble triggers episodic memory recall in the hippocampus, while forming new episodic memory with current sensory stimuli, resulting in a stream of thoughts. Neuromodulation modulates both top-down selection of synchrony ensembles, and memory formation. Episodic memory stored in the hippocampus is transferred to semantic and procedural memory in the cortex during rapid eye movement sleep, by updating cortical neuron synaptic weights with spike timing dependent plasticity. With the update of synaptic weights, new neurons become bursting while previous bursting neurons become tonic, allowing bursting neurons to move up to a higher level of perceptual abstraction. Finally, the proposed learning mechanism is compared with the back-propagation algorithm used in deep neural networks, and a proposal of how the credit assignment problem can be addressed by the current theory is presented.
Recent progress in artificial intelligence (AI) is exciting, but can AI models tell us about the human mind? AI models have a long history of being used as theoretical artifacts in cognitive science, but one key difference in the current generation of models is that they are stimulus computable, meaning that they can operate over stimuli that are similar to those experienced by people. This advance creates important opportunities for deepening our understanding of the human mind. We argue here that the most exciting of these is the use of AI models as cognitive models, wherein they are trained using human-scale input data and evaluated using careful experimental probes. Such cognitive models constitute a substantial advance that can inform theories of human intelligence by helping to explain and predict behavior.
What role do affective feelings (feelings/emotions/moods) play in adaptive behaviour? What are the implications of this for understanding and developing artificial general intelligence? Leading theoretical models of brain function are beginning to shed light on these questions. While artificial agents have excelled within narrowly circumscribed and specialised domains, domain-general intelligence has remained an elusive goal in artificial intelligence research. By contrast, humans and nonhuman animals are characterised by a capacity for flexible behaviour and general intelligence. In this article I argue that computational models of mental phenomena in predictive processing theories of the brain are starting to reveal the mechanisms underpinning domain-general intelligence in biological agents, and can inform the understanding and development of artificial general intelligence. I focus particularly on approaches to computational phenomenology in the active inference framework. Specifically, I argue that computational mechanisms of affective feelings in active inference-affective self-modelling-are revealing of how biological agents are able to achieve flexible behavioural repertoires and general intelligence. I argue that (i) affective self-modelling functions to "tune" organisms to the most tractable goals in the environmental context; and (ii) affective and agentic self-modelling is central to the capacity to perform mental actions in goal-directed imagination and creative cognition. I use this account as a basis to argue that general intelligence of the level and kind found in biological agents will likely require machines to be implemented with analogues of affective self-modelling.
Despite recent breakthroughs in machine learning, current artificial systems lack key features of biological intelligence. Whether the current limitations can be overcome is an open question, but critical to answer, given the implications for society.
No abstract
To investigate approaches of reasoning with large language models (LLMs) and to propose a new prompting approach, ensemble reasoning, to improve medical question answering performance with refined reasoning and reduced inconsistency. We used multiple choice questions from the USMLE Sample Exam question files on 2 closed-source commercial and 1 open-source clinical LLM to evaluate our proposed approach ensemble reasoning. On GPT-3.5 turbo and Med42-70B, our proposed ensemble reasoning approach outperformed zero-shot chain-of-thought with self-consistency on Steps 1, 2, and 3 questions (+3.44%, +4.00%, and +2.54%) and (2.3%, 5.00%, and 4.15%), respectively. With GPT-4 turbo, there were mixed results with ensemble reasoning again outperforming zero-shot chain-of-thought with self-consistency on Step 1 questions (+1.15%). In all cases, the results demonstrated improved consistency of responses with our approach. A qualitative analysis of the reasoning from the model demonstrated that the ensemble reasoning approach produces correct and helpful reasoning. The proposed iterative ensemble reasoning has the potential to improve the performance of LLMs in medical question answering tasks, particularly with the less powerful LLMs like GPT-3.5 turbo and Med42-70B, which may suggest that this is a promising approach for LLMs with lower capabilities. Additionally, the findings show that our approach helps to refine the reasoning generated by the LLM and thereby improve consistency even with the more powerful GPT-4 turbo. We also identify the potential and need for human-artificial intelligence teaming to improve the reasoning beyond the limits of the model.
The recent advent of large language models has reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data. Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training. In human cognition, this capacity is closely tied to an ability to reason by analogy. Here we performed a direct comparison between human reasoners and a large language model (the text-davinci-003 variant of Generative Pre-trained Transformer (GPT)-3) on a range of analogical tasks, including a non-visual matrix reasoning task based on the rule structure of Raven's Standard Progressive Matrices. We found that GPT-3 displayed a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings; preliminary tests of GPT-4 indicated even better performance. Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.
To investigate the potential of using artificial intelligence (AI), specifically large language models (LLMs), for synthesizing information in a simulated randomized clinical trial (RCT) for an anti-seizure medication, cenobamate, demonstrating the feasibility of inductive reasoning via medical chart review. An LLM-generated simulated RCT was conducted, featuring a placebo arm and a full-strength drug arm with a cohort of 240 patients divided 1:1. Seizure counts were simulated using a realistic seizure diary simulator. The study utilized LLMs to generate clinical notes with four neurologist writing styles and random extraneous details. A secondary LLM pipeline synthesized data from these notes. The efficacy and safety of cenobamate in seizure control were evaluated by both an LLM-based pipeline and a human reader. The AI analysis closely mirrored human analysis, demonstrating the drug's efficacy with marginal differences (<3 %) in identifying both drug efficacy and reported symptoms. The AI successfully identified the number of seizures, symptom reports, and treatment efficacy, with statistical analysis comparing the 50 %-responder rate and median percentage change between the placebo and drug arms, as well as side effect rates in each arm. This study highlights the potential of AI to accurately analyze noisy clinical notes to inductively produce clinical knowledge. Here, treatment effect sizes and symptom frequencies derived from unstructured simulated notes were inferred despite many distractors. The findings emphasize the relevance of AI in future clinical research, offering a scalable and efficient alternative to traditional labor-intensive data mining.
Clinical reasoning (CR) is an essential skill; yet, physicians often receive limited feedback. Artificial intelligence holds promise to fill this gap. We report the development of named entity recognition (NER), logic-based and large language model (LLM)-based assessments of CR documentation in the electronic health record across 2 institutions (New York University Grossman School of Medicine [NYU] and University of Cincinnati College of Medicine [UC]). The note corpus consisted of internal medicine resident admission notes (retrospective set: July 2020-December 2021, n=700 NYU and 450 UC notes and prospective validation set: July 2023-December 2023, n=155 NYU and 92 UC notes). Clinicians rated CR documentation quality in each note using a previously validated tool (Revised-IDEA), on 3-point scales across 2 domains: differential diagnosis (D0, D1, and D2) and explanation of reasoning, (EA0, EA1, and EA2). At NYU, the retrospective set was annotated for NER for 5 entities (diagnosis, diagnostic category, prioritization of diagnosis language, data, and linkage terms). Models were developed using different artificial intelligence approaches, including NER, logic-based model: a large word vector model (scispaCy en_core_sci_lg) with model weights adjusted with backpropagation from annotations, developed at NYU with external validation at UC, NYUTron LLM: an NYU internal 110 million parameter LLM pretrained on 7.25 million clinical notes, only validated at NYU, and GatorTron LLM: an open source 345 million parameter LLM pretrained on 82 billion words of clinical text, fined tuned on NYU retrospective sets, then externally validated and further fine-tuned at UC. Model performance was assessed in the prospective sets with F At NYU, the NYUTron LLM performed best: the D0 and D2 models had AUROC/AUPRC 0.87/0.79 and 0.89/0.86, respectively. The D1, EA0, and EA1 models had insufficient performance for implementation (AUROC range 0.57-0.80, AUPRC range 0.33-0.63). For the D1 classification, the approach pivoted to a stepwise approach taking advantage of the more performant D0 and D2 models. For the EA model, the approach pivoted to a binary EA2 model (ie, EA2 vs not EA2) with excellent performance, AUROC/AUPRC 0.85/ 0.80. At UC, the NER, D-logic-based model was the best performing D model (F This is the first multi-institutional study to apply LLMs for assessing CR documentation in the electronic health record. Such tools can enhance feedback on CR. Lessons learned by implementing these models at distinct institutions support the generalizability of this approach.
Alzheimer's disease (AD) stands as the predominant cause of dementia, characterized by a gradual decline in speech and language capabilities. Recent deeplearning advancements have facilitated automated AD detection through spontaneous speech. However, common transcript-based detection methods directly model text patterns in each utterance without a global view of the patient's linguistic characteristics, resulting in limited discriminability and interpretability. Despite the enhanced reasoning abilities of large language models (LLMs), there remains a gap in fully harnessing the reasoning ability to facilitate AD detection and model interpretation. Therefore, we propose a patient-level transcript profiling framework leveraging LLM-based reasoning augmentation to systematically elicit linguistic deficit attributes. The summarized embeddings of the attributes are integrated into an Albert model for AD detection. The framework achieves 8.51% ACC and 8.34% F1 improvements on the ADReSS dataset compared to the baseline without reasoning augmentation. Our further analysis shows the effectiveness of our identified linguistic deficit attributes and the potential to use LLM for AD detection interpretation.
Electronic health records (EHRs) and routine documentation practices play a vital role in patients' daily care, providing a holistic record of health, diagnoses, and treatment. However, complex and verbose EHR narratives can overwhelm health care providers, increasing the risk of diagnostic inaccuracies. While large language models (LLMs) have showcased their potential in diverse language tasks, their application in health care must prioritize the minimization of diagnostic errors and the prevention of patient harm. Integrating knowledge graphs (KGs) into LLMs offers a promising approach because structured knowledge from KGs could enhance LLMs' diagnostic reasoning by providing contextually relevant medical information. This study introduces DR.KNOWS (Diagnostic Reasoning Knowledge Graph System), a model that integrates Unified Medical Language System-based KGs with LLMs to improve diagnostic predictions from EHR data by retrieving contextually relevant paths aligned with patient-specific information. DR.KNOWS combines a stack graph isomorphism network for node embedding with an attention-based path ranker to identify and rank knowledge paths relevant to a patient's clinical context. We evaluated DR.KNOWS on 2 real-world EHR datasets from different geographic locations, comparing its performance to baseline models, including QuickUMLS and standard LLMs (Text-to-Text Transfer Transformer and ChatGPT). To assess diagnostic reasoning quality, we designed and implemented a human evaluation framework grounded in clinical safety metrics. DR.KNOWS demonstrated notable improvements over baseline models, showing higher accuracy in extracting diagnostic concepts and enhanced diagnostic prediction metrics. Prompt-based fine-tuning of Text-to-Text Transfer Transformer with DR.KNOWS knowledge paths achieved the highest ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence) and concept unique identifier F DR.KNOWS offers a robust approach for enhancing diagnostic accuracy and reasoning by integrating structured KG knowledge into LLM-based clinical workflows. Although further work is required to address KG biases and extend generalizability, DR.KNOWS represents progress toward trustworthy artificial intelligence-driven clinical decision support, with a human evaluation framework focused on diagnostic safety and alignment with clinical standards.
Large language models (LLMs) have demonstrated potential in assisting clinical decision-making. However, studies evaluating LLMs' diagnostic performance on complex critical illness cases are lacking. We aimed to assess the diagnostic accuracy and response quality of an artificial intelligence (AI) model, and evaluate its potential benefits in assisting critical care residents with differential diagnosis of complex cases. This prospective comparative study collected challenging critical illness cases from the literature. Critical care residents from tertiary teaching hospitals were recruited and randomly assigned to non-AI-assisted physician and AI-assisted physician groups. We selected a reasoning model, DeepSeek-R1, for our study. We evaluated the model's response quality using Likert scales, and we compared the diagnostic accuracy and efficiency between groups. A total of 48 cases were included. Thirty-two critical care residents were recruited, with 16 residents assigned to each group. Each resident handled an average of 3 cases. DeepSeek-R1's responses received median Likert grades of 4.0 (IQR 4.0-5.0; 95% CI 4.0-4.5) for completeness, 5.0 (IQR 4.0-5.0; 95% CI 4.5-5.0) for clarity, and 5.0 (IQR 4.0-5.0; 95% CI 4.0-5.0) for usefulness. The AI model's top diagnosis accuracy was 60% (29/48; 95% CI 0.456-0.729), with a median differential diagnosis quality score of 5.0 (IQR 4.0-5.0; 95% CI 4.5-5.0). Top diagnosis accuracy was 27% (13/48; 95% CI 0.146-0.396) in the non-AI-assisted physician group versus 58% (28/48; 95% CI 0.438-0.729) in the AI-assisted physician group. Median differential quality scores were 3.0 (IQR 0-5.0; 95% CI 2.0-4.0) without and 5.0 (IQR 3.0-5.0; 95% CI 3.0-5.0) with AI assistance. The AI model showed higher diagnostic accuracy than residents, and AI assistance significantly improved residents' accuracy. The residents' diagnostic time significantly decreased with AI assistance (median, 972 s; IQR 570-1320; 95% CI 675-1200) versus without (median, 1920 s; IQR 1320-2640; 95% CI 1710-2370). For diagnostically difficult critical illness cases, DeepSeek-R1 generates high-quality information, achieves reasonable diagnostic accuracy, and significantly improves residents' diagnostic accuracy and efficiency. Reasoning models are suggested to be promising diagnostic adjuncts in intensive care units.
Driver behavior is a critical factor in driving safety, making the development of sophisticated distraction classification methods essential. Our study presents a Distracted Driving Classification (DDC) approach utilizing a visual Large Language Model (LLM), named the Distracted Driving Language Model (DDLM). The DDLM introduces whole-body human pose estimation to isolate and analyze key postural features-head, right hand, and left hand-for precise behavior classification and better interpretability. Recognizing the inherent limitations of LLMs, particularly their lack of logical reasoning abilities, we have integrated a reasoning chain framework within the DDLM, allowing it to generate clear, reasoned explanations for its assessments. Tailored specifically with relevant data, the DDLM demonstrates enhanced performance, providing detailed, context-aware evaluations of driver behaviors and corresponding risk levels. Notably outperforming standard models in both zero-shot and few-shot learning scenarios, as evidenced by tests on the 100-Driver dataset, the DDLM stands out as an advanced tool that promises significant contributions to driving safety by accurately detecting and analyzing driving distractions.
One of the major barriers to using large language models (LLMs) in medicine is the perception they use uninterpretable methods to make clinical decisions that are inherently different from the cognitive processes of clinicians. In this manuscript we develop diagnostic reasoning prompts to study whether LLMs can imitate clinical reasoning while accurately forming a diagnosis. We find that GPT-4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy. This is significant because an LLM that can imitate clinical reasoning to provide an interpretable rationale offers physicians a means to evaluate whether an LLMs response is likely correct and can be trusted for patient care. Prompting methods that use diagnostic reasoning have the potential to mitigate the "black box" limitations of LLMs, bringing them one step closer to safe and effective use in medicine.
Cognitive biases, systematic deviations from logical judgment, are well documented in clinical decision-making, particularly in clinical settings characterized by high decision load, limited time, and diagnostic uncertainty-such as critical care. Prior work demonstrated that large language models, particularly GPT-4, reproduce many of these biases, sometimes to a greater extent than human clinicians. We tested whether the o1 model (o1-2024-12-17), a newly released AI system with enhanced reasoning capabilities, is susceptible to cognitive biases that commonly affect medical decision-making. Following the methodology established by Wang and Redelmeier [15], we used ten pairs of clinical scenarios, each designed to test a specific cognitive bias known to influence clinicians. Each scenario had two versions, differed by subtle modifications designed to trigger the bias (such as presenting mortality rates versus survival rates). The o1 model generated 90 independent clinical recommendations for each scenario version, totalling 1,800 responses. We measured cognitive bias as systematic differences in recommendation rates between the paired scenarios, which should not occur with unbiased reasoning. The o1 model's performance was compared against previously published results from both the GPT-4 model and historical human clinician studies. The o1 model showed no measurable cognitive bias in seven of the ten vignettes. In two vignettes, the o1 model showed significant bias, but its absolute magnitude was lower than values previously reported for GPT-4 and human clinicians. In a single vignette, Occam's razor, the o1 model exhibited consistent bias. Therefore, although overall bias appears less frequent overall with the reasoning model than with GPT-4, it was worse in one vignette. The model was more prone to bias in vignettes that included a gap-closing cue, seemingly resolving the clinical uncertainty. Across eight vignette versions, intra‑scenario agreement exceeded 94%, indicating lower decision variability than previously described with GPT‑4 and human clinicians. Reasoning models may reduce cognitive bias and random variation in judgment (i.e., "noise"). However, our findings caution that reasoning models are still not entirely immune to cognitive bias. These findings suggest that reasoning models may impart some benefits as decision-support tools in medicine, but they also imply a need to explore further the circumstances in which these tools may fail.
Although large language models (LLMs) show promise for clinical healthcare applications, their utility for personalized health monitoring using wearable device data remains underexplored. Here we introduce the Personal Health Large Language Model (PH-LLM), designed for applications in sleep and fitness. PH-LLM is a version of the Gemini LLM that was finetuned for text understanding and reasoning when applied to aggregated daily-resolution numerical sensor data. We created three benchmark datasets to assess multiple complementary aspects of sleep and fitness: expert domain knowledge, generation of personalized insights and recommendations and prediction of self-reported sleep quality from longitudinal data. PH-LLM achieved scores that exceeded a sample of human experts on multiple-choice examinations in sleep medicine (79% versus 76%) and fitness (88% versus 71%). In a comprehensive evaluation involving 857 real-world case studies, PH-LLM performed similarly to human experts for fitness-related tasks and improved over the base Gemini model in providing personalized sleep insights. Finally, PH-LLM effectively predicted self-reported sleep quality using a multimodal encoding of wearable sensor data, further demonstrating its ability to effectively contextualize wearable modalities. This work highlights the potential of LLMs to revolutionize personal health monitoring via tailored insights and predictions from wearable data and provides datasets, rubrics and benchmark performance to further accelerate personal health-related LLM research.
Generative artificial intelligence (GAI) is transforming health care in a variety of ways; however, the present utility of GAI for supporting clinicians who treat rare disease such as primary immune disorders (PIs) is not well studied. We evaluated the ability of 6 state-of-the-art large language models (LLMs) for providing clinical guidance about PIs. To quantitatively and qualitatively measure the utility of current, open-source LLMs for diagnosing and providing helpful clinical decision support about PIs. Five expert clinical immunologists each provided 5 real-world, anonymized PI case vignettes via multi-turn prompting to 6 LLMs (OpenAI GPT-4o, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Mistral-7B-Instruct-v0.3, Mistral-Large-Instruct-2407, Mixtral-8x7B-Instruct-v0.1). We assessed the diagnostic accuracy of the LLMs and the quality of clinical reasoning using the Revised-IDEA (R-IDEA) score. Qualitative LLM assessment was made by immunologist narratives. Performance accuracy (>88%) and R-IDEA scores (≥8) were superior for 3 models (GPT-4o, Llama-3.1-70B-Instruct, Mistral-Large-Instruct-2407), with GPT-4o achieving the highest diagnostic accuracy (96.2%). Conversely, the remaining 3 models fell below acceptable accuracy rates near 60% or lower and had poor R-IDEA scores (≤0.55), with Mistral-7B-Instruct-v0.3 attaining the worst diagnostic accuracy (42.3%). Compared with the 3 best-performing LLMs, the 3 worst-performing LLMs had a substantially lower median R-IDEA score (P < .001). Interclass correlation coefficient for R-IDEA score assignments varied substantially by LLM, ranging from good to poor agreement, and did not appear to correlate with either diagnostic accuracy or median R-IDEA score. Qualitatively, immunologists identified several themes (eg, correctness, differential diagnosis appropriateness, relative conciseness of explanations) of relevance to PIs. LLM can support diagnosis and management of PIs; however, further tuning is needed to optimize LLMs for best practice recommendations.
Enhancing clinical reasoning and reducing diagnostic errors are essential in medical practice; OpenAI-o1, with advanced reasoning capabilities, performed better than GPT-4 on 15 Japanese National Medical Licensing Examination questions (accuracy: 100% vs 80%; contraindicated option detection: 87% vs 73%), though findings are preliminary due to the small sample size.
The proliferation of user-generated content on social networking sites has intensified the challenge of accurately and efficiently detecting inflammatory and discriminatory speech at scale. Traditional manual moderation methods are impractical due to the sheer volume and complexity of online discourse, necessitating automated solutions. However, existing deep learning models for hate speech detection typically function as black-box systems, providing binary classifications without interpretable insights into their decision-making processes. This opacity significantly limits their practical utility, particularly in nuanced content moderation tasks. To address this challenge, our research explores leveraging the advanced reasoning and knowledge integration capabilities of state-of-the-art language models, specifically Mistral-7B, to develop transparent hate speech detection systems. We introduce a novel framework wherein large language models (LLMs) generate explicit rationales by identifying and analyzing critical textual features indicative of hate speech. These rationales are subsequently integrated into specialized classifiers designed to perform explainable content moderation. We rigorously evaluate our methodology on multiple benchmark English-language social media datasets. Results demonstrate that incorporating LLM-generated explanations significantly enhances both the interpretability and accuracy of hate speech detection. This approach not only identifies problematic content effectively but also clearly articulates the analytical rationale behind each decision, fulfilling the critical demand for transparency in automated content moderation.
Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a 'passing' score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.
Rice biology research involves complex decision-making, requiring researchers to navigate a rapidly expanding body of knowledge encompassing extensive literature and multiomics data. The exponential increase in biological data and scientific publications presents significant challenges for efficiently extracting meaningful insights. Although large language models (LLMs) show promise for knowledge retrieval, their application to rice-specific research has been limited by the absence of specialized models and the challenge of synthesizing multimodal data integral to the field. Moreover, the lack of standardized evaluation frameworks for domain-specific tasks impedes the effective assessment of model performance. To address these challenges, we introduce SeedLLM·Rice (SeedLLM), a 7-billion-parameter model trained on 1.4 million rice-related publications, representing nearly 98.24% of global rice research output. Additionally, we present a novel human-centric evaluation framework designed to assess LLM performance in rice biology tasks. Initial evaluations demonstrate that SeedLLM outperforms general-purpose models, including OpenAI GPT-4o1 and DeepSeek-R1, achieving win rates of 57% to 88% on rice-specific tasks. Furthermore, SeedLLM is integrated with the Rice Biological Knowledge Graph (RBKG), which consolidates genome annotations for Nipponbare and large-scale synthesis of transcriptomic and proteomic information from over 1800 studies. This integration enhances the ability of SeedLLM to address complex research questions requiring the fusion of textual and multiomics data. To facilitate global collaboration, we provide free access to SeedLLM and the RBKG via an interactive web portal (https://seedllm.org.cn/). SeedLLM represents a transformative tool for rice biology research, enabling unprecedented discoveries in crop improvement and climate adaptation through advanced reasoning and comprehensive data integration.
The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor-patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.
Background and aim Orthodontic treatment planning is a complex process requiring a detailed understanding of dental, skeletal, and soft tissue relationships. Traditionally, treatment decisions are made through clinical expertise and evidence-based guidelines. However, the recent evolution of AI, particularly large language models (LLMs), has warranted an evaluation of their capabilities in streamlining clinical workflows. The aim of this study was to evaluate the proficiency and effectiveness of AI-based LLMs, specifically OpenAI's ChatGPT-4o and Google's Gemini 2.0 Flash Experimental (free version), in generating orthodontic treatment plans based on real clinical cases. Materials and methods Ten published orthodontic case reports from reputed peer-reviewed journals were selected for the study and summarized into standardized clinical inputs, including patient age, occlusal relationships, skeletal and dental findings, and radiographic observations. These inputs were submitted to ChatGPT-4o and Gemini 2.0 Flash Experimental (free version) with prompts to generate extremely detailed, comprehensive treatment plans. The outputs were evaluated independently by two experienced orthodontists and one orthodontic resident using a four-point ordinal scale assessing clinical accuracy, completeness, and relevance of the treatment plan. Inter-rater reliability was assessed using Krippendorff's alpha. Results ChatGPT-4o produced treatment plans with higher clinical alignment and evaluator consensus, as indicated by Krippendorff's alpha (α = 0.935), while Gemini's plans showed greater variability and moderate agreement (α = 0.692). ChatGPT generated orthodontic treatment plans that incorporated more relevant clinical details and demonstrated stronger alignment with evidence-based standards, as assessed by the orthodontic reviewers. In contrast, Gemini generated treatment plans based on minimally accurate facts. Conclusion LLMs such as ChatGPT-4o and Gemini 2.0 Flash Experimental (free version) demonstrate potential as valuable complementary tools in orthodontic treatment planning, especially in routine cases, but do not appear to have the ability to replace clinical expertise.
The text examines the impact of artificial intelligence (AI) in the context of rare diseases, exploring how patients turn to AI resources for health information, especially in situations where doctor-patient communication is limited. The article features the case of a doctor specializing in clinical psychology and psychotherapy, diagnosed with thymoma and Good's syndrome, who uses AI resources during his illness. The capabilities of five chatbots based on Large Language Models (LLMs), such as GPT-3.5, GPT-4, Bing Chat, Google Bard, and Anthropic Claude are explored. The AIs were queried on various aspects of the disease, from pre-diagnosis and diagnosis to therapeutic, psychological, and caregiver management issues. The responses were evaluated by five experts based on criteria such as: accuracy, relevance, coherence, clarity, practical utility, ethical considerations, empathy, and capacity to respond to questions and concerns. The results indicate consistency in the evaluators' assessments, with generally high scores across all dimensions. Particularly, systems like Bard and GPT-4 received high ratings in terms of information accuracy and the ability to respond to questions and concerns. Bing and Claude were appreciated for their empathy and tone. Overall, the AI systems' responses were considered appropriate, respectful of ethics and privacy, and useful in the clinical context. The article emphasizes the importance of understanding the reliability and precision of responses provided by AI systems in the clinical field. Although these systems offer high-quality responses, there is significant variability in their performance. Healthcare professionals must be aware of these differences and use such tools cautiously. AI can provide support in some aspects of care but cannot replace genuine human empathy and understanding. Integrating AI into clinical practice presents potential but also challenges, particularly the possibility of providing incorrect information. The AI systems demonstrate the ability to provide useful advice on clinical and psychological issues, but their use requires caution. It is crucial to distinguish the benefits of AI for patients from the challenges it presents for healthcare professionals. As AI technology continues to evolve, it is essential that its integration into the clinical field is accompanied by continuous research and evaluations, to ensure safe and effective use in the healthcare sector.
Simulation-based learning (SBL) has become standard practice in educating health care professionals to apply their knowledge and skills in patient care. While SBL has demonstrated its value in education, many educators find the process of developing new, unique scenarios to be time-intensive, creating limits to the variety of issues students may experience within educational settings. Generative artificial intelligence (AI) platforms, such as ChatGPT (OpenAI), have emerged as a potential tool for developing simulation case studies more efficiently, though little is known about the performance of AI in generating high-quality case studies for interprofessional education. This study aimed to generate geriatric case scenarios across 5 AI platforms by a transdisciplinary team and systematically evaluate them for quality, accuracy, and bias. Ten geriatric case studies were generated using the same prompt from 5 different generative AI platforms (N=50): ChatGPT, Claude (Anthropic AI), Copilot (Microsoft), Gemini (Google), and Grok (xAI). An evaluation tool was developed to collect evaluative data to assess the content and quality of each case, sociodemographic data of the featured patient, the appropriateness of each case for interprofessional education, and potential bias. Case quality was evaluated using the Simulation Scenario Evaluation Tool (SSET). Each case was evaluated by 3 team members who had experience in SBL education. Assessment scores were averaged, and qualitative responses were extracted to triangulate patterns found in the quantitative data. While each AI platform was able to generate 10 unique case studies, the quality of studies varied within and across platforms. Generally, evaluators felt that the content in the cases was accurate, though some cases were not realistic. Some patient populations and common conditions among older adults were underrepresented or absent across the cases. All cases were set within traditional health care settings (eg, hospitals and routine medical visits). No cases featured home-based care. Based on the average SSET scores, reviewers assessed ChatGPT to be the highest overall performer (mean 3.27, SD 0.45, 95% CI 2.95-3.59) while Grok received the lowest scores (mean 1.61, SD 1.26, 95% CI 0.71-2.51). Platforms performed best at generating learning objectives (mean 3.35, SD 1.08, 95% CI 3.04-3.65) and lowest on their ability to describe supplies and materials that may be available in hypothetical scenarios (mean 1.27, SD 0.84, 95% CI 1.03-1.51). This study is the first to systematically evaluate and compare multiple generative AI platforms for case study generation using a validated assessment tool (SSET) and provides evidence-based guidance on selecting and using AI tools effectively. The findings offer practical direction for educators navigating available generative AI tools to enhance training for health care professionals, including specific strategies for prompt engineering that can improve the quality of SBL resources in interprofessional education. These insights enable educators to leverage AI capabilities while maintaining pedagogical rigor.
Abstract Artificial intelligence (AI) has been heralded by many as the next source of business value. Grounded on the resource-based theory of the firm and on recent work on AI at the organizational context, this study (1) identifies the AI-specific resources that jointly create an AI capability and provides a definition, (2) develops an instrument to capture the AI capability of the firms, and (3) examines the relationship between an AI capability and organizational creativity and performance. Findings empirically support the suggested theoretical framework and corresponding instrument and provide evidence that an AI capability results in increased organizational creativity and performance.
No abstract available
No abstract available
本报告最终将人工智能能力的研究划分为八个核心维度:从底层的推理机制优化(强化学习与CoT)到高层的AGI理论探索;从知识图谱增强的结构化推理到多模态与具身智能的感知突破;同时深入探讨了医疗、金融、科研等垂直领域的专业化应用。此外,报告还构建了完备的评估体系与安全性治理框架,并保留了对模糊逻辑等传统计算模型的关注,形成了一个从技术原理、领域应用到治理评估的完整研究谱系。