从提示词工程到可视化交互
提示词工程的自动化框架与体系化管理
该组文献探讨了如何将提示词设计从经验技巧转向系统工程,涵盖了自动化优化策略(如强化学习、压缩技术)、IDE集成管理、软件工程原则的应用以及形式化理论框架。
- RECAP-Reinforced, Explainable, and Cost-Aware Prompting: A Framework for Understandable Prompt Optimization Based on Cognitive Science(Raghupathi Appala, Maanasa Kotte, Pallavi Tejaswi Kakaraparthi, 2025, International Journal For Multidisciplinary Research)
- Interactive Learning for LLM Reasoning(Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, Chengwei Qin, 2025, ArXiv Preprint)
- Meta Prompting for AI Systems(Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao, 2023, ArXiv Preprint)
- When Prompt Engineering Meets Software Engineering: CNL-P as Natural and Robust "APIs" for Human-AI Interaction(Zhenchang Xing, Yang Liu, Zhuo Cheng, Qing Huang, Dehai Zhao, Daniel Sun, Chenhua Liu, 2025, ArXiv)
- Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates(Hui Wei, Shenghua He, Tian Xia, Fei Liu, Andy Wong, Jingyang Lin, Mei Han, 2024, ArXiv Preprint)
- Prompt-with-Me: in-IDE Structured Prompt Management for LLM-Driven Software Engineering(Ziyou Li, Agnia Sergeyuk, Maliheh Izadi, 2025, ArXiv Preprint)
- Reporting LLM Prompting in Automated Software Engineering: A Guideline Based on Current Practices and Expectations(Alexander Korn, Lea Zaruchas, Chetan Arora, Andreas Metzger, Sven Smolka, Fanyu Wang, Andreas Vogelsang, 2026, ArXiv Preprint)
- Foundation Model Engineering: Engineering Foundation Models Just as Engineering Software(Dezhi Ran, Mengzhou Wu, Wei Yang, Tao Xie, 2024, ArXiv Preprint)
- Prompt Engineering Guidelines for Using Large Language Models in Requirements Engineering(Krishna Ronanki, Simon Arvidsson, Johan Axell, 2025, ArXiv Preprint)
- Promptware Engineering: Software Engineering for Prompt-Enabled Systems(Zhenpeng Chen, Chong Wang, Weisong Sun, Xuanzhe Liu, Jie M. Zhang, Yang Liu, 2025, ArXiv Preprint)
- Universal Conditional Logic: A Formal Language for Prompt Engineering(Anthony Mikinka, 2025, ArXiv Preprint)
- RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning(Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P. Xing, Zhiting Hu, 2022, ArXiv Preprint)
- PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models(Jinyi Li, Yihuai Lan, Lei Wang, Hao Wang, 2024, ArXiv Preprint)
- How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting(Aman Gupta, Yingying Zhuang, Zhou Yu, Ziji Zhang, Anurag Beniwal, 2025, ArXiv Preprint)
- Optimizing Human-AI Interaction: Innovations in Prompt Engineering(Rushali Deshmukh, R. Raut, M. Bhavsar, S. Gurav, Y. Patil, 2025, 2025 3rd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT))
- Generative Query Reformulation Using Ensemble Prompting, Document Fusion, and Relevance Feedback(Kaustubh D. Dhole, Ramraj Chandradevan, Eugene Agichtein, 2024, ArXiv Preprint)
- Achieving Tool Calling Functionality in LLMs Using Only Prompt Engineering Without Fine-Tuning(Shengtao He, 2024, ArXiv)
思维链推理的可解释性分析与动态控制
此类文献深入挖掘大模型思维链(CoT)的内部机制,通过识别思维锚点、监控推理进度、自我纠错及可视化推理路径,提升模型逻辑的透明度、可靠性与可控性。
- Generating Descriptive Explanations of Machine Learning Models Using LLM(A. Pang, Hyeju Jang, Shiaofen Fang, 2024, 2024 IEEE International Conference on Big Data (BigData))
- Visualizing the Chain of Thought in Large Language Models(B. Ilgen, Georges Hattab, T. Rhyne, 2026, IEEE Computer Graphics and Applications)
- Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance(Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar Khot, 2023, ArXiv Preprint)
- Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics(Sheldon Yu, Yuxin Xiong, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, Julian J. McAuley, 2025, No journal)
- Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process(Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Straiton Hard, Rajiv Mathews, Lun Wang, 2025, ArXiv)
- Understanding Reasoning in Chain-of-Thought from the Hopfieldian View(Lijie Hu, Liang Liu, Shu Yang, Xin Chen, Zhen Tan, Muhammad Asif Ali, Mengdi Li, Di Wang, 2024, ArXiv Preprint)
- AprèsCoT: Explaining LLM Answers with Knowledge Graphs and Chain of Thought(Moein Shirdel, Joel Rorseth, P. Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta, 2025, No journal)
- AutoCrit: A Meta-Reasoning Framework for Self-Critique and Iterative Error Correction in LLM Chains-of-Thought(Yinghao Sang, 2025, 2025 6th International Conference on Machine Learning and Computer Application (ICMLCA))
- Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs(Roy Eisenstadt, Itamar Zimerman, Lior Wolf, 2025, ArXiv)
- Thought Anchors: Which LLM Reasoning Steps Matter?(Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy, 2025, ArXiv)
- Layered Chain-of-Thought Prompting for Multi-Agent LLM Systems: A Comprehensive Approach to Explainable Large Language Models(Manish Sanwal, 2025, ArXiv)
- DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs(Shidong Cao, Hongzhan Lin, Yuxuan Gu, Ziyang Luo, Jing Ma, 2026, ArXiv Preprint)
- SIM-CoT: Supervised Implicit Chain-of-Thought(Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiao-wen Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin, 2025, ArXiv)
多模态视觉提示与交织推理技术
研究如何通过视觉线索(如热力图、坐标、草图)与文本提示的深度融合,构建“视觉思维链”,增强多模态模型在复杂空间、图像理解及跨模态任务中的推理能力。
- Attention Prompting on Image for Large Vision-Language Models(Runpeng Yu, Weihao Yu, Xinchao Wang, 2024, ArXiv Preprint)
- Prompt–RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering(C. Chappuis, Valérie Zermatten, Sylvain Lobry, B. L. Saux, D. Tuia, 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Beyond Embeddings: The Promise of Visual Table in Visual Reasoning(Yiwu Zhong, Zi-Yuan Hu, Michael R. Lyu, Liwei Wang, 2024, ArXiv Preprint)
- Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis(Aleksa Jelaca, Ying Jiao, Chang Tian, Marie-Francine Moens, 2025, ArXiv Preprint)
- Visual Agentic Reinforcement Fine-Tuning(Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang, 2025, ArXiv Preprint)
- Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models(Sangmin Woo, Kang Zhou, Yun Zhou, Shuai Wang, Sheng Guan, Haibo Ding, Lin Lee Cheong, 2025, No journal)
- CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update(Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li, 2023, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought(Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, Libo Qin, 2025, ArXiv Preprint)
- VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning(Lingxiao Li, Yifan Wang, Xinyan Gao, Chen Tang, Xiangyu Yue, Chenyu You, 2025, ArXiv Preprint)
- MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning(Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li, 2025, ArXiv Preprint)
- See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning(Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Hao Zhang, Chuang Gan, 2023, ArXiv)
- March in Chat: Interactive Prompting for Remote Embodied Referring Expression(Yanyuan Qiao, Yuankai Qi, Zheng Yu, J. Liu, Qi Wu, 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- Improving Visual Object Tracking through Visual Prompting(Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin, 2024, ArXiv Preprint)
- Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models(Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang, 2023, ArXiv)
- Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought(Kesen Zhao, Beier Zhu, Junbao Zhou, Xingyu Zhu, Zhongqi Yue, Hanwang Zhang, 2026, ArXiv Preprint)
- S-Chain: Structured Visual Chain-of-Thought For Medicine(Khai Le-Duc, Duy M. H. Nguyen, Phuong T. H. Trinh, Tien-Phat Nguyen, Nghiem T. Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Mau Nguyen, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen, Thanh Nguyen-Tang, Pengtao Xie, Daniel Sonntag, James Zou, Mathias Niepert, Anh Totti Nguyen, 2025, ArXiv Preprint)
- Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings(Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, William Yang Wang, 2023, ArXiv Preprint)
- Charts-of-Thought: Enhancing LLM Visualization Literacy Through Structured Data Extraction(Amit Kumar Das, Mohammad Tarun, Klaus Mueller, 2025, ArXiv Preprint)
- ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization(Mengsha Liu, Daoyuan Chen, Yaliang Li, Guian Fang, Ying Shen, 2024, No journal)
- ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models(Xiwei Liu, Yulong Li, Xinlin Zhuang, Xuhui Li, Jianxu Chen, Haolin Yang, Imran Razzak, Yutong Xie, 2026, ArXiv Preprint)
- Exploring Text-Guided Information Fusion Through Chain-of-Reasoning for Pansharpening(Xueheng Li, Xuanhua He, Ke Cao, Jie Zhang, Chengjun Xie, Man Zhou, Danfeng Hong, Bo Huang, 2025, IEEE Transactions on Geoscience and Remote Sensing)
- A Multiview‐Integrated Framework for Traffic Scene Understanding Based on YOLO and LLM(Yixuan Zhao, Tianwen Ma, Zihe Wang, Ziyu Zhang, Chenxi Li, Shuai Liu, Zhiyong Cui, Mengqi Lv, Haiyang Yu, Z. Peng, 2026, Journal of Advanced Transportation)
- Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning(Wenting Lu, Didi Zhu, Tao Shen, Donglin Zhu, Ayong Ye, Chao Wu, 2026, ArXiv Preprint)
- When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought(Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye, 2025, ArXiv)
- Imagine while Reasoning in Space: Multimodal Visualization-of-Thought(Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli'c, Furu Wei, 2025, ArXiv)
- Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios(Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, Alexandre Alahi, 2025, ArXiv Preprint)
- OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning(Zhao-yu Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng, 2025, ArXiv)
交互式可视化界面生成与分析系统
这些文献聚焦于利用LLM自动生成图形用户界面(GUI)、交互式仪表盘及可视化分析工具,支持用户通过自然语言与直接操作相结合的方式探索数据、微调模型并实现低代码开发。
- ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing(Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, Elena L. Glassman, 2023, Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems)
- PromptAid: Visual Prompt Exploration, Perturbation, Testing and Iteration for Large Language Models(Aditi Mishra, Bretho Danzy, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, Chris Bryan, 2025, IEEE Transactions on Visualization and Computer Graphics)
- Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models(Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, H. Pfister, Alexander M. Rush, 2022, IEEE Transactions on Visualization and Computer Graphics)
- PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation(Yingchaojie Feng, Xingbo Wang, Kam Kwai Wong, Sijia Wang, Yuhong Lu, Minfeng Zhu, Baicheng Wang, Wei Chen, 2023, ArXiv Preprint)
- GeoPet: Interactive Prompt Engineering for Enhancing Tool Calling of Large Language Models in Geospatial Tasks(Xuan Guo, Chenchen Gao, Weifan Niu, Xinzong Wei, Chengqi Hua, Junnan Liu, Mingliang Xu, 2025, 2025 IEEE 18th Pacific Visualization Conference (PacificVis))
- Low-code LLM: Visual Programming over LLMs(Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, Jonathan Tien, Nan Duan, 2023, ArXiv)
- Prompt Sapper: A LLM-Empowered Production Tool for Building AI Chains(Yu Cheng, Jieshan Chen, Qing Huang, Zhenchang Xing, Xiwei Xu, Qinghua Lu, 2023, ACM Transactions on Software Engineering and Methodology)
- Zero-Shot Prompting Approaches for LLM-based Graphical User Interface Generation(Kristian Kolthoff, Felix Kretzer, Lennart Fiebig, Christian Bartelt, Alexander Maedche, Simone Paolo Ponzetto, 2024, ArXiv)
- InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions(Juntong Chen, Jiang Wu, Jiajing Guo, Vikram Mohanty, Xueming Li, Jorge Piazentin Ono, Wenbin He, Liu Ren, Dongyu Liu, 2025, Computer Graphics Forum)
- NL2INTERFACE: Interactive Visualization Interface Generation from Natural Language Queries(Yiru Chen, Ryan Li, Austin Mac, Tianbao Xie, Tao Yu, Eugene Wu, 2022, ArXiv Preprint)
- ChatGraPhT: A Visual Conversation Interface for Multi-Path Reflection with Agentic LLM Support(G. Kimm, Linus Tan, 2025, ArXiv)
- MisVisFix: An Interactive Dashboard for Detecting, Explaining, and Correcting Misleading Visualizations using Large Language Models(Amit Kumar Das, Klaus Mueller, 2025, ArXiv Preprint)
- SeaView: Software Engineering Agent Visual Interface for Enhanced Workflow(Timothy Bula, Saurabh Pujar, Luca Buratti, Mihaela A. Bornea, Avi Sil, 2025, ArXiv)
- A Visualization System for LLM-based Fault Entity Extraction of Power Equipment(Li-gang You, Xiaogang Guo, Dou Wang, Zhenwei Zhang, Kai Zhong, Nan Chen, Heyu Wang, 2025, 2025 IEEE 2nd International Conference on Big Data Science and Engineering (ICBDSE))
- Enhancing Interaction with Large Language Models: A Catalog of Prompt Engineering Techniques(Aniruddha Girish Pai, 2025, 2025 International Conference on Computing Technologies (ICOCT))
- VisPath: Automated Visualization Code Synthesis via Multi-Path Reasoning and Feedback-Driven Optimization(Wonduk Seo, Seungyong Lee, Daye Kang, Zonghao Yuan, Seunghyun Lee, 2025, ArXiv)
- PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models(Aditi Mishra, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, Chris Bryan, 2023, ArXiv)
- Visualizing Program Behavior: A Study of Enhanced Program Diagrams Using LLM(Ying Li, Runze Yang, ShiJie Gui, Peng Shi, Xuefei Huang, Da Yang, Xiaozhou Zhang, Yiming Gai, 2024, 2024 IEEE Frontiers in Education Conference (FIE))
- POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models(Jianben He, Xingbo Wang, Shiyi Liu, Guande Wu, Claudio Silva, Huamin Qu, 2024, 2025 IEEE 18th Pacific Visualization Conference (PacificVis))
- Interactive Reasoning: Visualizing and Controlling Chain-of-Thought Reasoning in Large Language Models(Rock Yuren Pang, K. Feng, Shangbin Feng, Chu Li, Weijia Shi, Yulia Tsvetkov, Jeffrey Heer, Katharina Reinecke, 2025, Proceedings of the 31st International Conference on Intelligent User Interfaces)
- DeepVIS: Bridging Natural Language and Data Visualization Through Step-Wise Reasoning(Zhihao Shuai, Boyan Li, Siyu Yan, Yuyu Luo, Weikai Yang, 2025, IEEE Transactions on Visualization and Computer Graphics)
- Analyzing the Sensitivity of Prompt Engineering Techniques in Natural Language Interfaces for 2.5D Software Visualization(Daniel Atzberger, Adrian Jobst, M. Tytarenko, Willy Scheibel, Jürgen Döllner, Tobias Schreck, 2025, Companion Proceedings of the ACM on Web Conference 2025)
- Prompt4VFD: A Visual Analytic System for Fault Diagnosis of Power Equipment with LLMs(Jin Wen, Xiaogang Guo, Xiangyu Zhu, Jiehang Cao, Qin Guo, Zhenwei Zhang, Heyu Wang, 2025, 2025 IEEE 2nd International Conference on Big Data Science and Engineering (ICBDSE))
- ComViewer: An Interactive Visual Tool to Help Viewers Seek Social Support in Online Mental Health Communities(Shiwei Wu, Mingxiang Wang, Chuhan Shi, Zhenhui Peng, 2024, Proceedings of the ACM on Human-Computer Interaction)
- Graphologue: Exploring Large Language Model Responses with Interactive Diagrams(Peiling Jiang, Jude Rayan, Steven W. Dow, Haijun Xia, 2023, Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology)
- Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning(Kaviraj Pather, E. Hadjigeorgiou, Arben Krasniqi, C. Schmit, I. Rusu, M. Pons, Kabir Khan, 2025, ArXiv)
- Psy-Copilot: Visual Chain of Thought for Counseling(Keqi Chen, Zekai Sun, Huijun Lian, Yingming Gao, Ya Li, 2025, ArXiv)
- Interactive Exploration and Explanation of Spatio-Temporal Anomalies with Graph-LLM Integration(Juanpablo Heredia, Leighton Leandro Estrada-Rayme, Jeremy Matos-Cangalaya, Jorge Poco, 2025, 2025 38th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI))
- Low-code LLM: Graphical User Interface over Large Language Models(Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, Jonathan Tien, Nan Duan, Furu Wei, 2023, No journal)
垂直领域智能体与复杂任务自动化
展示了LLM在医疗诊断、金融分析、工业机器人、科学实验及具身智能等特定领域的深度应用,强调了领域知识与多智能体协作在解决复杂工作流中的关键作用。
- An Efficient Voice-Interactive Grasping Method for Humanoid Robots Based on LLM(Shiming Yang, Yiwen Liu, Yiwen Zhan, Changhai Zha, Zhiyong Huang, Jason Gu, 2025, 2025 37th Chinese Control and Decision Conference (CCDC))
- Enhancing visual-LLM for construction site safety compliance via prompt engineering and Bi-stage retrieval-augmented generation(Koi Xiaowen Guo, Peter Kok-Yiu Wong, Jack C. P. Cheng, Chak-Fu Chan, P. Leung, Xingyu Tao, 2025, Automation in Construction)
- MetaOpenFOAM 2.0: Large Language Model Driven Chain of Thought for Automating CFD Simulation and Post-Processing(Yuxuan Chen, Xu Zhu, Hua Zhou, Zhuyin Ren, 2025, ArXiv)
- LLM-based ambiguity detection in natural language instructions for collaborative surgical robots(Ana Davila, Jacinto Colan, Yasuhisa Hasegawa, 2025, ArXiv Preprint)
- An End-to-End Large Model Framework of Wearable Augmented Vision Device for the Visually Impaired(Yang Song, Zhijun Li, Yu Kang, Guoxin Li, Haisheng Xia, 2025, IEEE Transactions on Automation Science and Engineering)
- LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology(Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, Rafael Ferreira da Silva, 2025, ArXiv Preprint)
- Language-Vision Embodied Agents in Robotic Systems for Industrial Sorting(Jiahao Zhang, Junsuo Qu, Dan Yang, Shiwen Chen, Jiale Chen, Jinghui Chao, Peng Li, 2025, 2025 International Conference on Meta-Networking (MEET))
- Chat2Layout: Interactive 3D Furniture Layout With a Multimodal LLM(Can Wang, Hongliang Zhong, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao, 2024, IEEE Transactions on Visualization and Computer Graphics)
- SageCopilot: An LLM-Empowered Autonomous Agent for Data Science as a Service(Yuan Liao, Jiang Bian, Yuhui Yun, Shuo Wang, Yubo Zhang, Jiaming Chu, Tao Wang, Yuchen Li, Xuhong Li, Shilei Ji, Haoyi Xiong, 2026, IEEE Transactions on Services Computing)
- Domain-Specific Interactive Prompting for Generalized Nuclei Classification(Binbin Zheng, Aiqiu Wu, Kai Fan, Ao Li, Minghui Wang, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- An Intelligent LLM-Powered Personalized Assistant for Digital Banking Using LangGraph and Chain of Thoughts(Md. Easin Arafat, Sourav Saha, Tamás Orosz, 2024, 2024 IEEE 22nd Jubilee International Symposium on Intelligent Systems and Informatics (SISY))
- Prompt and circumstance: A word-by-word LLM prompting approach to interlinear glossing for low-resource languages(Micha Elsner, David Y. Liu, 2025, ArXiv)
- VOICE: Visual Oracle for Interaction, Conversation, and Explanation(Donggang Jia, Alexandra Irger, Ondřej Strnad, Johanna Björklund, A. Ynnerman, I. Viola, 2023, IEEE Transactions on Visualization and Computer Graphics)
- Chat, Summary and Diagnosis: A LLM - Enhanced Conversational Agent for Interactive Depression Detection(Xiaoheng Zhang, Weigang Cui, Junjie Wang, Yang Li, 2024, 2024 4th International Conference on Industrial Automation, Robotics and Control Engineering (IARCE))
- DocCHA: Towards LLM-Augmented Interactive Online diagnosis System(Xinyi Liu, Dachun Sun, Yi R. Fung, Dilek Hakkani-Tur, Tarek F. Abdelzaher, 2025, ArXiv)
- MedPromptExtract (Medical Data Extraction Tool): Anonymization and Hi-fidelity Automated data extraction using NLP and prompt engineering(Roomani Srivastava, Suraj Prasad, Lipika Bhat, Sarvesh Deshpande, Barnali Das, Kshitij Jadhav, 2024, ArXiv)
- I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots(Giulio Antonio Abbo, Tony Belpaeme, 2023, 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI))
- FathomGPT: A Natural Language Interface for Interactively Exploring Ocean Science Data(Nabin Khanal, Chun Meng Yu, Jui-Cheng Chiu, Anav Chaudhary, Ziyue Zhang, Kakani Katija, Angus G. Forbes, 2024, ArXiv Preprint)
- RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit(Jiongnan Liu, Jiajie Jin, Zihan Wang, Jiehan Cheng, Zhicheng Dou, Ji-Rong Wen, 2023, ArXiv Preprint)
- MedPromptExtract (Medical Data Extraction Tool): Anonymization and High-Fidelity Automated Data Extraction Using Natural Language Processing and Prompt Engineering.(Roomani Srivastava, Lipika Bhat, Suraj Prasad, Sarvesh Deshpande, Barnali Das, Kshitij Jadhav, 2025, The journal of applied laboratory medicine)
- ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment(Anthony Hevia, S. Chintalapati, Veronica Ka Wai Lai, T. Nguyen, W. Wong, Terry Klassen, L. Wang, 2025, ArXiv)
- Question Answering for Decisionmaking in Green Building Design: A Multimodal Data Reasoning Method Driven by Large Language Models(Yihui Li, Xiaoyue Yan, Hao Zhou, Borong Lin, 2024, ArXiv)
- VST-LLM HRI: Multimodal Human-Robot Interaction via Large Language Model Prompts(Weikai Ding, Shijun Xiao, Zhengguo Zhu, Teng Chen, Guoteng Zhang, 2025, 2025 IEEE 5th International Conference on Computer Communication and Artificial Intelligence (CCAI))
- Visual-Conversational Interface for Evidence-Based Explanation of Diabetes Risk Prediction(Reza Samimi, Aditya Bhattacharya, Lucija Gosak, Gregor Štiglic, K. Verbert, 2025, Proceedings of the 7th ACM Conference on Conversational User Interfaces)
- LLM-IE: a python package for biomedical generative information extraction with large language models(Enshuo Hsu, Kirk Roberts, 2024, JAMIA Open)
- A Preliminary Fundamental Financial Analysis Framework Using Structured LLM Prompting - A Case Study(Ishan Gupta, N. Sharma, Abhay Kaushal, Rajeswara Rao Kvs, 2025, 2025 9th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS))
- Scalable Blockchain Analytics: an LLM-Powered Approach(Mostafa Chegenizadeh, Zixian Pang, Junyong Cao, Lundrim Azemi, 2025, 2025 IEEE International Conference on Blockchain and Cryptocurrency (ICBC))
- Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study(Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, Dongmei Zhang, 2023, ArXiv Preprint)
- Visual Prompt Selection Framework for Real-Time Object Detection and Interactive Segmentation in Augmented Reality Applications(Eungyeol Song, Doeun Oh, B. Oh, 2024, Applied Sciences)
- From Solitary Directives to Interactive Encouragement! LLM Secure Code Generation by Natural Language Prompting(Shigang Liu, Bushra Sabir, Seung Ick Jang, Yuval Kansal, Yansong Gao, Kristen Moore, A. Abuadbba, Surya Nepal, 2024, ArXiv)
- CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts(Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik, 2026, ArXiv)
- Towards Automated Data Sciences with Natural Language and SageCopilot: Practices and Lessons Learned(Yuan Liao, Jiang Bian, Yuhui Yun, Shuo Wang, Yubo Zhang, Jiaming Chu, Tao Wang, Kewei Li, Yuchen Li, Xuhong Li, Shilei Ji, Haoyi Xiong, 2024, ArXiv)
- Optimizing LLM Strategies for Playing Mendikot using Prompt Engineering(Aadi Juthani, 2024, International Journal For Multidisciplinary Research)
- The Art of Tool Interface Design(Yunnan Wu, P. Chen, Deshank Baranwal, J. Zhou, J. Yuan, 2025, ArXiv)
- Integrating Vision-Language Models for Enhanced Robotic Grasping and Interaction Using RGB Image and Prompt(Nguyen Khac, Nguyen Truong Thinh, 2025, International Journal of Mechanical Engineering and Robotics Research)
- VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection(Zeyi Huang, Yuyang Ji, A. Rajan, Zefan Cai, Wen Xiao, Junjie Hu, Yong Jae Lee, 2025, ArXiv)
- ChatHSI: Reliable LLM-Powered Human-Swarm Interaction Framework(Yiheng Zhang, Shen Bohan, Le Liu, Shizhou Zhang, Peng Wang, Lingyun Yu, Di Xu, 2025, Proceedings of the 18th International Symposium on Visual Information Communication and Interaction)
- LLM-based Interactive Imitation Learning for Robotic Manipulation(Jonas Werner, Kun Chu, C. Weber, Stefan Wermter, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- SVGThinker: Instruction-Aligned and Reasoning-Driven Text-to-SVG Generation(Hanqi Chen, Zhongyin Zhao, Ye Chen, Zhujin Liang, Bingbing Ni, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Efficient Visual Prompt Engineering for Creative Story Writing(Felix Deng, 2024, Journal of Student Research)
- Visual Story-Writing: Writing by Manipulating Visual Representations of Stories(Damien Masson, Zixin Zhao, Fanny Chevalier, 2024, ArXiv Preprint)
- Chain-of-Cooking: Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance(Mengling Xu, Ming Tao, Bing-Kun Bao, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling(Min Zhang, Zilin Wang, Liyan Chen, Kunhong Liu, Juncong Lin, 2024, 2025 IEEE International Conference on Multimedia and Expo (ICME))
- RetouchGPT: LLM-based Interactive High-Fidelity Face Retouching via Imperfection Prompting(Wen Xue, Chun Ding, Ruotao Xu, Si Wu, Yong Xu, Hau-San Wong, 2025, No journal)
- Harnessing AI for Scientific Illustration: Exploring Tropical Cyclone Dynamics Using ChatGPT and Midjourney(Hung-Cheng Chen, 2025, 2025 IEEE International Conference on Computation, Big-Data and Engineering (ICCBE))
- ChatVis: Large Language Model Agent for Generating Scientific Visualizations(Tom Peterka, Tanwi Mallick, Orcun Yildiz, David Lenz, Cory Quammen, Berk Geveci, 2025, 2025 IEEE 15th Symposium on Large Data Analysis and Visualization (LDAV))
- StackSight: Unveiling WebAssembly through Large Language Models and Neurosymbolic Chain-of-Thought Decompilation(Weike Fang, Zhejian Zhou, Junzhou He, Weihang Wang, 2024, No journal)
- The implementation solution for automatic visualization of tabular data in relational databases based on large language models(Hao Yang, Zhaoyong Yang, Ruyang Zhao, Xiaoran Li, Gaoqi Rao, 2024, 2024 International Conference on Asian Language Processing (IALP))
- Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study(Yang Wu, Yao Wan, Hongyu Zhang, Yulei Sui, Wucai Wei, Wei Zhao, Guandong Xu, Hai Jin, 2024, Proceedings of the ACM on Management of Data)
以人为本的交互设计、安全性与协作伦理
从用户研究视角出发,探讨人机协作中的认知偏差、无障碍设计、创意共创模式,以及提示词注入攻击等安全漏洞与伦理对齐问题。
- DeBiasMe: De-biasing Human-AI Interactions with Metacognitive AIED (AI in Education) Interventions(Chaeyeon Lim, 2025, ArXiv Preprint)
- Improving Student-AI Interaction Through Pedagogical Prompting: An Example in Computer Science Education(Ruiwei Xiao, Xinying Hou, Runlong Ye, Majeed Kazemitabaar, Nicholas Diana, Michael Liut, John Stamper, 2025, ArXiv)
- VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents(Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng, Lin Lu, Nay Oo, Shuicheng Yan, Bryan Hooi, 2025, ArXiv)
- The Role of Interface Design on Prompt-mediated Creativity in Generative AI(M. Torricelli, Mauro Martino, Andrea Baronchelli, L. Aiello, 2023, Proceedings of the 16th ACM Web Science Conference)
- How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in Software Engineering(Christoph Treude, Marco A. Gerosa, 2025, ArXiv Preprint)
- PromptPilot: Improving Human-AI Collaboration Through LLM-Enhanced Prompt Engineering(Niklas Gutheil, Valentin Mayer, Leopold Müller, Jörg Römmelt, Niklas Kühl, 2025, ArXiv)
- Multi-Turn Human-LLM Interaction Through the Lens of a Two-Way Intelligibility Protocol(Harshvardhan Mestha, Karan Bania, Shreyas V Sathyanarayana, Sidong Liu, Ashwin Srinivasan, 2024, ArXiv Preprint)
- Prompt Culture and Visual Creativity in AI-Assisted Design: Perspectives from University Students(A. Utami, Trias Widha Andari, Putranta Satrio, Sonhaji Arif, Ragil Noviyanti, Amalia Hartiningrum, Akhya' Muhammad Khaidzir, 2025, 2025 International Conference on ICT for Smart Society (ICISS))
- Coffee Masterclass: An Experience of Co-Creation with Prompt Engineering and Generative AI for Immersive Environments Development(Alexander Rozo-Torres, Wilson J. Sarmiento, 2024, 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW))
- Infusing Theory of Mind into Socially Intelligent LLM Agents(EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, V. Shwartz, 2025, ArXiv)
- Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts(J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, Qian Yang, 2023, Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems)
- Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines(Cansu Koyuturk, Emily Theophilou, Sabrina Patania, Gregor Donabauer, Andrea Martinenghi, Chiara Antico, Alessia Telari, Alessia Testa, Sathya Bursic, Franca Garzotto, Davinia Hernandez-Leo, Udo Kruschwitz, Davide Taibi, Simona Amenta, Martin Ruskov, Dimitri Ognibene, 2025, ArXiv Preprint)
- Understanding and Improving Accessibility in AI-Generated Interfaces through Interactive Prompt Engineering Methods(Alexandra E. Gurita, 2025, Companion Proceedings of the 30th International Conference on Intelligent User Interfaces)
- SAID: A Social Media AI-generated Interface Dataset Using Prompt Engineering Methods Focused On Accessibility(Alexandra E. Gurita, 2025, No journal)
- Reflexive Prompt Engineering: A Framework for Responsible Prompt Engineering and AI Interaction Design(Christian Djeffal, 2025, Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency)
- Reflexive Prompt Engineering: A Framework for Responsible Prompt Engineering and Interaction Design(Christian Djeffal, 2025, ArXiv Preprint)
- A Systematization of Security Vulnerabilities in Computer Use Agents(Daniel Jones, Giorgio Severi, Martin Pouliot, Gary Lopez, Joris de Gruyter, Santiago Zanella-Béguelin, Justin Song, Blake Bullwinkel, Pamela Cortez, Amanda Minnich, 2025, ArXiv)
- Pay Attention! Human-Centric Improvements of LLM-based Interfaces for Assisting Software Test Case Development(Bill Shi, P. O. Kristensson, 2024, Adjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology)
- How People Prompt to Create Interactive VR Scenes(Setareh Aghel Manesh, Tianyi Zhang, Yuki Onishi, Kotaro Hara, Scott Bateman, Jiannan Li, Anthony Tang, 2024, ArXiv Preprint)
- Exploring Diagnostic Prompting Approach for Multimodal LLM-based Visual Complexity Assessment: A Case Study of Amazon Search Result Pages(Divendar Murtadak, Yoon Kim, Trilokya Akula, 2025, ArXiv Preprint)
- Counterfeit medicine detection by visual inspection of package design using multimodal LLMs with text and image prompt engineering(Yona Zakaria, Eiki Ishidera, Rui Ishiyama, T. Matsui, Hiroiko Suwa, Yuki Matsuda, K. Yasumoto, 2025, No journal)
- Intelligent Interaction Strategies for Context-Aware Cognitive Augmentation(Xiangrong, Zhu, Yuan Xu, Tianjian Liu, Jingwei Sun, Yu Zhang, Xin Tong, 2025, ArXiv Preprint)
- Visual Attention Prompted Prediction and Learning(Yifei Zhang, Siyi Gu, Bo Pan, Guangji Bai, Meikang Qiu, Xiaofeng Yang, Liang Zhao, 2023, ArXiv Preprint)
- A Behavior-Driven Adaptive User Interface Generation Framework with Iterative Preference Modeling and Prompt Fusion(Juan Chen, Bo Chen, Jingyi Lei, Xiao-Hui He, Ling Chen, Won SukLing Kim, 2026, Signal, Image and Video Processing)
- Research on the Practical Paths of Computer Technology Application in Visual Communication Design Teaching in the Context of Artificial Intelligence(Fan Zhang, 2025, Proceedings of the 2025 International Conference on Artificial Intelligence, Virtual Reality and Interaction Design)
- Prompting LLM for Embedded SRS Generation: A Case Study of Elder Care System(Chunhui Wang, Jiaqi Zhao, Zhi Jin, 2025, 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW))
生成式可视化内容的理解、验证与摘要
侧重于大模型对现有图表和可视化内容的逆向解析,包括数据提取、自动摘要生成以及对生成图像内容正确性的验证研究。
- Validation of Generative Visual Solutions Using Prompt Engineering and Caption Based Visual Reasoning Models(Manali Arora, Chirag Garg, Deepanshu Mangla, 2025, 2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS))
- SimVecVis: A Dataset for Enhancing MLLMs in Visualization Understanding(Can Liu, Chunlin Da, Xiaoxiao Long, Yuxiao Yang, Yu Zhang, Yong Wang, 2025, 2025 IEEE Visualization and Visual Analytics (VIS))
- An Empirical Study of Counterfactual Visualization to Support Visual Causal Inference(Arran Zeyu Wang, David Borland, David Gotz, 2024, ArXiv Preprint)
- End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models(Raymond Choi, Frank Burns, Chase Lawrence, 2025, ArXiv Preprint)
本报告综合了从底层提示词工程方法论到高层可视化交互系统的全方位研究。研究脉络呈现出明显的演进趋势:首先,提示词工程正从“炼金术”走向体系化的软件工程管理;其次,推理机制通过思维链(CoT)及其可视化手段实现了从黑盒到透明的跨越;第三,多模态技术的介入使得“视觉思维”成为增强模型逻辑能力的关键;第四,交互式可视化界面与自然语言的融合,极大地降低了垂直领域(如医疗、工业、创意设计)应用AI的门槛;最后,研究视角回归“以人为本”,深入探讨了交互安全性、认知负荷及人机协作的伦理边界。整体而言,该领域正致力于构建一个逻辑透明、交互直观且行业深耕的智能交互生态。
总计165篇相关文献
As interface designers increasingly adopt AI-powered design tools, ensuring accessibility compliance presents unique challenges for human-AI collaboration. Through a mixed-methods study combining systematic evaluation of 200 generated interfaces across five AI design tools and interviews with six professional designers with extensive accessibility experience, we investigated patterns in designer-AI collaboration for accessible interface design. While 72% of interfaces exhibited accessibility violations, these were relatively minor (M = 1.2, SD = 0.4). Our analysis revealed distinct interaction patterns between designers and AI tools: iterative refinement (42% of cases) where designers progressively improved accessibility through dialogue, requirement prioritization between accessibility compliance and visual design, design consistency during targeted modifications. Component-level iterations had a median of 3 rounds, with average values suggesting similar behavior (M = 3.2, SD = 0.8), each targeting specific accessibility issues. Notably, the difference between control and accessibility-oriented prompts was minimal (Δ = 0.2), suggesting that successful accessibility implementation requires active designer engagement beyond initial prompting. We contribute empirical insights into designer-AI dialogue patterns and provide implications for developing more effective accessibility-focused AI design tools.
This work presents the design and development process of an immersive experience applying a co-creation approach between humans and generative artificial intelligence tools. From the point of view of any user, Coffee Masterclass is an immersive experience that brings anyone to the art and pleasure of preparing specialty coffees. However, the Coffee Masterclass is the result of the inclusion of prompt engineering outputs in each stage of the building process. The co-creation approach is included in all development processes, i.e., from the narrative to the visual content generated through code writing, which has been co-created between the creative team and GenAI. This work tells details of this approach, including how the generative artificial intelligence tools were used in each stage of immersive experience development. This work shows the advantage of involvement in a development team of people with skills in prompt engineering and interaction with Large Language Models. Also, it includes recommendations to other development teams, including generative artificial intelligence tools by future developments.
This study investigates the integration of a large language model (LLM) enhanced by prompt engineering and game theory to effectively engage in the strategic card game Mendikot. By refining complex prompts and leveraging a tailored visual understanding of game dynamics, we significantly bolster the decision-making prowess of the LLM. Our methodology involved the systematic simplification of game prompts to facilitate deeper learning and faster response times, coupled with the implementation of a visual recognition system to interpret and react to game states dynamically. The results illustrate that the adapted LLM outperforms traditional AI approaches in strategic decision-making tasks, underscoring a substantial improvement in both the accuracy and efficiency of game-play. This research not only demonstrates a viable model for enhancing AI interaction in recreational gaming but also opens avenues for deploying advanced AI strategies in complex strategic environments, offering insights into the broader application of AI in leisure and competitive arenas. The findings suggest that AI can transcend conventional gaming roles, potentially transforming strategic gameplay in digital and physical platforms.
—Object detection and grasping is one of the critical challenges in robotic research, particularly when working in complex environments with diverse objects in terms of shape and position. Although methods using RGB images have shown promising results in simpler scenarios, they still face numerous issues in more complex scenes, especially when objects overlap. Furthermore, prior research has primarily focused on object grasping, without focusing on addressing the interaction capabilities between robots and users during the grasping process. Recent advancements in vision-language models have opened up significant potential for the development of human-robot interaction systems based on multimodal data. This paper presents an integrated model combining computer vision and language models to enhance object detection and grasping capabilities in real-world environments. The proposed approach consists of three key steps (1) identifying the locations of objects and generating segmentation masks using a visual-language model; (2) grasp candidates are predicted from the generated masks and bounding boxes via the Grasp Detection Head; and (3) the candidates are optimized and refined using the Grasp Refinement Head. The integration of vision-language models in the proposed approach not only enhances the ability of robot to understand the semantics of language, enabling more accurate grasping decisions, but also strengthens the interaction capabilities of robot with users. Experimental results demonstrate that the proposed model achieves higher grasping accuracy compared to existing methods, particularly in complex scenes with multiple objects. Additionally, the model also shows its ability to understand complex contexts through Interactive Grasp experiments.
We present VOICE, a novel approach to science communication that connects large language models’ conversational capabilities with interactive exploratory visualization. VOICE introduces several innovative technical contributions that drive our conversational visualization framework. Based on the collected design requirements, we introduce a two-layer agent architecture that can perform task assignment, instruction extraction, and coherent content generation. We employ fine-tuning and prompt engineering techniques to tailor agents’ performance to their specific roles and accurately respond to user queries. Our interactive text-to-visualization method generates a flythrough sequence matching the content explanation. In addition, natural language interaction provides capabilities to navigate and manipulate 3D models in real-time. The VOICE framework can receive arbitrary voice commands from the user and respond verbally, tightly coupled with a corresponding visual representation, with low latency and high accuracy. We demonstrate the effectiveness of our approach by implementing a proof-of-concept prototype and applying it to the molecular visualization domain: analyzing three 3D molecular models with multiscale and multi-instance attributes. Finally, we conduct a comprehensive evaluation of the system, including quantitative and qualitative analyses on our collected dataset, along with a detailed public user study and expert interviews. The results confirm that our framework and prototype effectively meet the design requirements and cater to the needs of diverse target users.
The rise of Large Language Models (LLMs) and generative visual analytics systems has transformed data‐driven insights, yet significant challenges persist in accurately interpreting users analytical and interaction intents. While language inputs offer flexibility, they often lack precision, making the expression of complex intents inefficient, error‐prone, and time‐intensive. To address these limitations, we investigate the design space of multimodal interactions for generative visual analytics through a literature review and pilot brainstorming sessions. Building on these insights, we introduce a highly extensible workflow that integrates multiple LLM agents for intent inference and visualization generation. We develop InterChat, a generative visual analytics system that combines direct manipulation of visual elements with natural language inputs. This integration enables precise intent communication and supports progressive, visually driven exploratory data analyses. By employing effective prompt engineering, and contextual interaction linking, alongside intuitive visualization and interaction designs, InterChat bridges the gap between user interactions and LLM‐driven visualizations, enhancing both interpretability and usability. Extensive evaluations, including two usage scenarios, a user study, and expert feedback, demonstrate the effectiveness of InterChat. Results show significant improvements in the accuracy and efficiency of handling complex visual analytics tasks, highlighting the potential of multimodal interactions to redefine user engagement and analytical depth in generative visual analytics.
Large language models (LLMs) have exhibited impressive abilities for multimodal content comprehension and reasoning with proper prompting in zero- or few-shot settings. Despite the proliferation of interactive systems developed to support prompt engineering for LLMs across various tasks, most have primarily focused on textual or visual inputs, thus neglecting the complex interplay between modalities in multimodal inputs. This oversight hinders the development of effective prompts that guide models’ multimodal reasoning processes by fully exploiting the rich context provided by multiple modalities. In this paper, we present POEM, a visual analytics system to facilitate efficient prompt engineering for steering the multimodal reasoning performance of LLMs. The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts. Through diverse recommendations of demonstration examples and instructional principles, POEM supports users in iteratively crafting and refining prompts to better align and enhance model knowledge with human insights. The effectiveness and efficiency of our system are validated through quantitative and qualitative evaluations with experts.
With the rapid development of generative artificial intelligence technology, profound changes have taken place in the ecological environment of the visual communication design industry, and the traditional teaching model has struggled to meet the industry's demands. This research constructs a systematic reform framework covering courses, teaching, and support systems. In terms of curriculum reconstruction, core knowledge such as artificial intelligence algorithms and human-computer interaction is incorporated. As for the innovation of the teaching model, the Bayesian optimization algorithm is utilized to optimize the AI prompt engineering. Meanwhile, natural language processing technology (NLP) is employed to explore design requirements, and AI tools like Adobe Firefly and MidJourney are adopted to carry out school-enterprise collaborative projects. Furthermore, the teacher AI ability training system is improved, a multi-dimensional evaluation model is established, and ethical norms are formulated. Eventually, a “technology-empowered and humanistic-oriented” teaching system will be formed to cultivate visual communication design talents with the ability of human-computer collaborative innovation.
Fault diagnosis is essential for identifying malfunctions in power equipment. However, traditional model-based methods face challenges in scalability, adaptability, and the ability of integrating evolving expert knowledge. In this paper, we propose a novel end-to-end fault diagnosis framework integrating a LLM and a fault knowledge base. The proposed framework constructs a knowledge base for efficient fault retrieval and uses prompt engineering to enhance LLMs’ reasoning capabilities for fault diagnosis. We also develop Prompt4VFD, namely, a visual analytic system that facilitates rapid prompt construction and refinement through visualizations and achieves user-friendly interaction. A quantitative study and two case studies are conducted to validate the effectiveness of the proposed fault diagnosis framework and the usability of the Prompt4VFD system.
No abstract available
Human-swarm interaction (HSI) is critical for scalable control of UAV swarm systems. Traditional interfaces struggle with generalization and user workload, especially in immersive environments. Hence, we present ChatHSI, a framework leveraging large language models (LLMs) for swarm task planning. ChatHSI integrates prompt engineering, action validation, and a human-in-the-loop mechanism to improve planning feasibility and executability. We implement ChatHSI in an immersive simulation to improve users’ spatial and situational awareness. Our method shows improved task efficiency, reduced workload, and higher usability in user studies. Ablation study proves the effectiveness of prompt context and action validation. The results show the feasibility of LLM-driven interaction for immersive swarm control and point toward adaptive, intuitive, and scalable HSI systems.
This paper proposes a Visual-Speech-Text Large Language Model framework for Human-Robot Interaction (VSTLLM HRI). By designing a Modality Language Model (MLM), the framework achieves a closed-loop system for robot perception, task planning, and control. Without requiring fine-tuning of the Large Language Model (LLM), the framework leverages visual semantic extraction, speech command conversion, and prompt engineering guidance to accomplish tasks. We conducted experiments on a bipedal robot to validate the adaptability and control performance of the framework in complex terrain task scenarios. The experimental results demonstrated that the proposed method exhibited good generalization capabilities. The related project files and programs have been uploaded to https://github.com/dwk-Suga/LLMandVLM.git.
Pansharpening aims to enhance the spatial resolution of low-resolution multispectral (LRMS) images by integrating high-frequency information from a corresponding texture-rich panchromatic (PAN) image, while maintaining the spectral integrity of the LRMS image. Although text-guided multimodal learning has made considerable strides in the natural image domain, its potential for pansharpening remains underexplored, primarily due to the limited availability of multimodal remote sensing datasets. To this end, we construct an entirely new pansharpening framework by making efforts from three key aspects: 1) text-equipped multimodal data collection through chain-of-reasoning; 2) large model prior-driven multimodal information fusion; and 3) visual information interaction through prompt engineering, leveraging textual information to guide the pansharpening process within a multimodal fusion framework. We initially utilize the generic large language model priors to generate descriptive captions for multispectral (MS) images, forming a multimodal pansharpening dataset. By integrating super-resolved imagery and segmentation maps generated by segment anything, we apply chain-of-thought (CoT) prompting to generate spatially focused captions across diverse satellite datasets. These captions enhance visual features and provide high-level contextual information, improving semantic understanding for pansharpening. Building on the aforementioned multimodal data, we tailor two text-guided information fusion modules: a textual enhancement block (TEB) standing on a large model prior and a textual modulated block (TMB) utilizing text information to effectively guide and refine the pansharpening fusion process. Extensive experiments on multiple satellite datasets demonstrate that our proposed framework outperforms state-of-the-art (SOTA) methods, highlighting its effectiveness and superior performance in pansharpening.
Aiming at the limitations of multimodal interaction and decision-making in industrial sorting robots, this paper proposes a multi-agent collaborative embodied intelligence architecture. By reconstructing the input space through prompt engineering, the vision-language foundation models (DeepSeek-V3/Qwen-VL) are integrated without weight updates, establishing a three-tier agent framework with linguistic, visual, and executive modules. Prompt templates activate the implicit kinematic priors of large models to achieve end-to-end zero-shot mapping from natural language commands and RGB scenes to robotic arm joint trajectories, forming a closed-loop “perception-decisionexecution” chain. Experiments demonstrate high instruction execution rates, low joint angle errors, and millisecond-level responses under zero-shot conditions, advancing industrial sorting toward high generalizability and low-cost deployment.
With the advancement of assistant robots, the focus has increasingly shifted towards enabling robots to perform complex tasks through human language. To achieve this, three major challenges must be overcome: insufficient human-robot interaction, ambiguity of human language in the open world, and cost-effective deployment. In this paper, we propose an efficient grasping method that allows non-expert users to effortlessly obtain the desired object transferred by a robot solely through intuitive human-robot dialogue. Our approach is centered around an Large Language Model(LLM)-based decision-making module and integrates voice interaction, visual perception, and grasping control module. The vision module perceives the object's class and pose, and then the decision-making module aligns the visual and linguistic data to conduct causal and spatial reasoning. Our approach is cost-effective and easy to implement, requiring only prompt engineering and visual network post-processing. The system has been successfully implemented on our self-independently developed D11 humanoid robot, achieving a 92% grasping success rate in various scenarios and meeting high user satisfaction regarding intelligence and efficiency.
Blockchain analytics is essential for understanding network centralization and operational dynamics, yet it faces challenges such as human error, technical complexity, diverse data formats, and multi-step processes. To address these limitations, this paper proposes a scalable framework powered by large language models (LLMs) to streamline blockchain data analysis. The proposed framework adopts a modular multi-agent architecture with specialized agents: the SQL Agent for data retrieval, the Analysis Agent for data processing, and the Visualization Agent for generating visual representations. This design ensures scalability, efficiency, and adaptability while minimizing reliance on technical expertise and reducing errors. The framework’s effectiveness is demonstrated through a case study on Cardano blockchain data analysis, encompassing tasks such as stake address analysis, staking pool dynamics, delegated stake distribution, and rewards analysis. Comprehensive prompt engineering further optimizes the interaction between LLM agents and blockchain data. As the first study to explore the integration of LLM agents in blockchain analytics, this work highlights the transformative potential of LLM-powered systems for scalable and automated blockchain data analysis.
Traffic scene understanding plays a crucial role in reasoning about and predicting relationships among entities in traffic images. It focuses on analyzing behavioral interaction patterns and global semantic associations to support higher‐level traffic requirements. However, few existing frameworks can achieve comprehensive scene understanding and semantic description in complex traffic environments. In particular, effective multiview semantic association modeling is still lacking. To address these challenges, we propose multiview large language model (MVLLM), which integrates YOLO‐based object detection with the reasoning ability of large language models (LLMs). Through prompt engineering, MVLLM utilizes the visual information extracted by YOLO to constrain the semantic space and guide the reasoning behavior, thereby enhancing the scene parsing capability. Meanwhile, we design a Chain‐of‐Thought (CoT) reasoning mechanism to establish spatiotemporal associations across multiple views and to integrate their scene understanding with semantic descriptions. The framework enables intent understanding for vehicles in dynamic environments, enhancing driving safety. It also provides comprehensive semantic descriptions for traffic management agencies, supporting holistic analyses of vehicles, roads, and environmental contexts.
Using generative AI tools, particularly ChatGPT and Midjourney, scientifically accurate and visually compelling illustrations of tropical cyclone dynamics are created focusing on the eyewall's structure and intensification processes. The novel approach using AI-driven prompt generation and image synthesis allows for detailed visual representations that balance scientific rigor and visual aesthetics. AI-generated illustrations are created effectively in the capturing and critical meteorological process, such as angular momentum conservation, boundary layer inflow, and latent heat release. For example, the interaction between deep convection and boundary layer dynamics is visualized, resulting in images to depict the vortex and intensification of wind speeds near the cyclone's core. Human expertise is integrated into the AI workflow. Using ChatGPT, generated images were quantitatively and qualitatively evaluated, assessing the alignment with established scientific facts. This evaluation is used to iteratively refine prompts, significantly enhancing the clarity and accuracy of subsequent visualizations. The results underscore the potential of combining AI-generated content to create scientifically and aesthetically appealing illustrations. However, challenges persist, particularly in ensuring that the AI-generated images maintain high scientific accuracy. Ongoing optimization and careful curation of prompts are essential to fully harness the capabilities of AI tools in advancing scientific communication in meteorology.
Visual impairments significantly affect individuals’ ability to perform essential tasks such as communication, object search, and navigation. Traditional wearable augmented vision devices rely on modular designs that separate functions like perception and path planning, leading to cumulative errors and inefficiencies in real-world applications. To address these challenges, we propose an end-to-end multimodal large model framework, UniANS, specifically designed for wearable augmented vision devices. UniANS integrates visual perception, speech interaction, and path planning into a unified framework. Such integration improves task coordination, reduces error propagation, and enhances overall performance. We also propose a prompt design strategy with a mixture of cluster-conditional low-rank adaptation experts architectures and dual-branch encoders, combined with advanced preprocessing techniques for visual and speech modules. The framework has been validated through ablation studies, showing superior performance in accuracy and task effectiveness compared to existing methods. We further showcase its capabilities in addressing challenges related to communication, object search, and indoor navigation tasks. The design of UniANS enhances mobility and quality of life for visually impaired individuals. Note to Practitioners—UniANS is an end-to-end multimodal framework for wearable vision devices that assist visually impaired users. Unlike traditional modular designs, which suffer from error accumulation due to separated perception, interaction, and planning, UniANS integrates these functions into a unified system; such integration improves coordination and robustness. It employs a progressive training strategy with cluster-conditional low-rank adaptation experts and dual-branch encoders, along with advanced preprocessing, to enhance performance across dynamic environments. In practice, UniANS achieves strong results in communication, object search, and indoor navigation without pre-built maps, making it well-suited for unfamiliar settings. Experimental results show it outperforms existing methods in accuracy and effectiveness, offering practitioners a reliable tool to enhance user mobility. We advocate UniANS as a foundation for next-generation, user-centered assistive technologies with potential for further expansion into complex tasks and improved user experience.
Utilizing Large Language Models (LLMs) for complex tasks is challenging, often involving a time-consuming and uncontrollable prompt engineering process. This paper introduces a novel human-LLM interaction framework, Low-code LLM. It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses. Through visual interaction with a graphical user interface, users can incorporate their ideas into the process without writing trivial prompts. The proposed Low-code LLM framework consists of a Planning LLM that designs a structured planning workflow for complex tasks, which can be correspondingly edited and confirmed by users through low-code visual programming operations, and an Executing LLM that generates responses following the user-confirmed workflow. We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability. We demonstrate its benefits using four typical applications. By introducing this framework, we aim to bridge the gap between humans and LLMs, enabling more effective and efficient utilization of LLMs for complex tasks. The code, prompts, and experimental details are available at https://github.com/moymix/TaskMatrix/tree/main/LowCodeLLM. A system demonstration video can be found at https://www.youtube.com/watch?v=jb2C1vaeO3E.
In the rapidly evolving landscape of human-robot interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents a ready-to-use implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4o mini) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, en-sures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. The system can be customised and is available as a stand-alone application, a Furhat robot implementation, and a ROS2 package.
Large Vision Language Models (LVLMs) often suffer from object hallucination, which undermines their reliability. Surprisingly, we find that simple object-based visual prompting -- overlaying visual cues (e.g., bounding box, circle) on images -- can significantly mitigate such hallucination; however, different visual prompts (VPs) vary in effectiveness. To address this, we propose Black-Box Visual Prompt Engineering (BBVPE), a framework to identify optimal VPs that enhance LVLM responses without needing access to model internals. Our approach employs a pool of candidate VPs and trains a router model to dynamically select the most effective VP for a given input image. This black-box approach is model-agnostic, making it applicable to both open-source and proprietary LVLMs. Evaluations on benchmarks such as POPE and CHAIR demonstrate that BBVPE effectively reduces object hallucination.
We introduce the novel approach of validation of artificially generated images which helps to validate the images based on the prompt given for the generated image. Existing methods involve generation of images based on the prompts, diffusion of images and modification of images, but fail to determine the correctness of the generated image with respect to the generated content and prompt given at user end. The prompt used for generating the image using generative artificial intelligence solutions can be comprehensive and can hold more than single perspective. To address this issue while validating computer visual solutions, we propose a method for the validation of generative visual solutions using prompt engineering and caption based visual reasoning models. The proposed solution determines the different perspectives and comprehensiveness of the prompts based on entities and attributes and then, multiple test cases are formed considering different perspectives, in more detailed and comprehensive format. Hence, proposed solution validates the generated image based on the text prompt engineered for comprehensive understanding based on the complexity of the prompt suitable for visual reasoning models.
No abstract available
With the growing capabilities of large language models (LLMs), they are increasingly applied in areas like intelligent customer service, code generation, and knowledge management. Natural language (NL) prompts act as the ``APIs''for human-LLM interaction. To improve prompt quality, best practices for prompt engineering (PE) have been developed, including writing guidelines and templates. Building on this, we propose Controlled NL for Prompt (CNL-P), which not only incorporates PE best practices but also draws on key principles from software engineering (SE). CNL-P introduces precise grammar structures and strict semantic norms, further eliminating NL's ambiguity, allowing for a declarative but structured and accurate expression of user intent. This helps LLMs better interpret and execute the prompts, leading to more consistent and higher-quality outputs. We also introduce an NL2CNL-P conversion tool based on LLMs, enabling users to write prompts in NL, which are then transformed into CNL-P format, thus lowering the learning curve of CNL-P. In particular, we develop a linting tool that checks CNL-P prompts for syntactic and semantic accuracy, applying static analysis techniques to NL for the first time. Extensive experiments demonstrate that CNL-P enhances the quality of LLM responses through the novel and organic synergy of PE and SE. We believe that CNL-P can bridge the gap between emerging PE and traditional SE, laying the foundation for a new programming paradigm centered around NL.
Responsible prompt engineering has emerged as a critical pracitce for ensuring that generative artificial intelligence (AI) systems are aligned with ethical, legal, and social principles. As generative AI applications become increasingly powerful and ubiquitous, the way we instruct and interact with them through prompts has profound implications for fairness, accountability, and transparency. It is, therefore, necessary to examine how strategic prompt engineering can embed ethical and legal considerations and societal values directly into AI interactions, moving beyond mere technical optimization for functionality. This article proposes “Reflexive Prompt Engineering”, a comprehensive framework for responsible prompt engineering that encompasses five interconnected components: prompt design, system selection, system configuration, performance evaluation, and prompt management. Drawing from empirical evidence, the paper demonstrates how each component can be leveraged to promote improved societal outcomes while mitigating potential risks. The analysis reveals that effective prompt engineering requires a delicate balance between technical precision and ethical consciousness, combining the systematic rigor and focus on functionality with the nuanced understanding of social impact. Through examination of emerging practices, this article illustrates how responsible prompt engineering serves as a crucial connection between AI development and deployment, enabling organizations to align AI outputs without modifying underlying model architectures. This approach links with broader “Responsibility by Design” principles, embedding ethical considerations directly into the implementation process rather than treating them as post-hoc additions. The article concludes by identifying key research directions and practical guidelines for advancing the field of responsible prompt engineering as an essential component of AI literacy.
This study explores the potential use of multimodal large language models (LLMs) in detecting counterfeit drugs through visual inspection of medicine packaging designs. Specifically, we investigate how ChatGPT-4o can provide clear explanations of design differences between genuine and counterfeit packaging. We combine structured textual prompts with three distinct image configurations: (1) query-only images, (2) query-plus-reference images, and (3) query-plus-reference-plus-difference. This setup allows for context-aware comparative analysis, helping the model to effectively identify and explain packaging design inconsistencies—key indicators of counterfeit or substandard medicines. Experimental results show that ChatGPT-4o achieves a binary classification accuracy of up to 74.6% in distinguishing authentic from counterfeit medicine packaging. Furthermore, user evaluations reveal that ChatGPT-4o delivers high levels of clarity, ease of understanding, reliability in identifying discrepancies, detail, and overall quality of analysis. These findings underscore the notable potential of ChatGPT-4o to enhance explainability and usability in counterfeit detection workflows, particularly by enabling accurate, actionable insights without requiring training on counterfeit-specific datasets, which are often challenging to collect.
In the realm of artificial intelligence., effective communication between humans and large language models (LLMs) such as ChatGPT is vital. This research paper proposes a structured methodology for prompt generation., developed through an in-depth analysis of existing research studies., to optimize interactions with LLMs. The exploration in the crucial role of prompt engineering in enhancing the quality and relevance of AI-generated outputs is also made. By conducting a comprehensive review of the literature., the aim is to identify best practices for prompt design., emphasizing the importance of clarity., specificity., and contextualization. Our findings indicate that tailored prompts can significantly improve accuracy across various applications., including health care., education., and customer service. Furthermore., the challenges faced are address the challenges of maintaining consistency and reliability in AI responses., highlighting the necessity for standardized guidelines in prompt engineering. This work underscores the transformative potential of prompt engineering., advocating for its integration in high-stakes environments to promote effective and ethical communication with AI systems.
No abstract available
Large Language Models (LLMs) are extensively utilized for generating stories, showcasing their ability to handle complex, creative tasks. To begin the process of story generation, an initial textual prompt is required. The prompt is iteratively refined such that the discrepancy between the user’s expectations and the story generated from the prompt is minimized. Each iteration is a time-consuming process; the user needs to read and analyze the story in order to refine the prompt. A key insight from cognitive research suggests that analyzing visual data is 60,000 times faster than textual analysis. This paper proposes visual prompt engineering for story generation wherein textual prompts are transformed into images using a diffusion model, then refined based on the discrepancy between the user’s expectations and the generated image. This refined prompt is then used to generate a story. The entire process is repeated until the user is satisfied with the story. This method leverages the relative speed of image processing to enhance the quality of text generation per iteration. Experiments show that for the same number of iterations, stories generated by visual prompt engineering outperformed those generated by text-based prompts in terms of story quality.
The elderly care system, as a typical software-hardware co-designed embedded system, requires its requirements specification process to start with understanding user intention. By clarifying core elements such as control software, devices, interfaces, and interactive information, it systematically acquires system requirements and software requirements to ultimately form standardized documentation. This process faces complexity challenges arising from the intersection of multi-dimensional knowledge, notably the difficulty in maintaining the completeness, consistency, and traceability of requirements. This paper proposes a guided large language model (LLM)-based software requirements specification (SRS) generation method. This method starts with user intention analysis, integrates requirements modeling knowledge to design guidance templates, directs LLMs to generate hierarchical requirements information from software-hardware environments to interaction relationship construction, and organizes it into structured documents according to the IEEE requirements specification template. Validated through the elderly care system case, this method effectively improves the quality of requirements documents.
Face retouching aims to remove facial imperfections from image and videos while at the same time preserving face attributes. The existing methods are designed to perform non-interactive end-to-end retouching, while the ability to interact with users is highly demanded in downstream applications. In this paper, we propose RetouchGPT, a novel framework that leverages Large Language Models (LLMs) to guide the interactive retouching process. Towards this end, we design an instruction-driven imperfection prediction module to accurately identify imperfections by integrating textual and visual features. To learn imperfection prompts, we further incorporate a LLM-based embedding module to fuse multi-modal conditioning information. The prompt-based feature modification is performed in each transformer block, such that the imperfection features are suppressed and replaced with the features of normal skin progressively. Extensive experiments have been performed to verify effectiveness of our design elements and demonstrate that RetouchGPT is a useful tool for interactive face retouching and achieves superior performance over state-of-the-arts.
Large Language Models (LLMs) have shown remarkable potential in code generation, making them increasingly important in the field. However, the security issues of generated code have not been fully addressed, and the usability of LLMs in code generation still requires further exploration. This work introduces SecCode, a framework that leverages an innovative interactive encouragement prompting (EP) technique for secure code generation with \textit{only NL} prompts. This approach ensures that the prompts can be easily shared and understood by general users. SecCode functions through three stages: 1) Code Generation using NL Prompts; 2) Code Vulnerability Detection and Fixing, utilising our proposed encouragement prompting; 3) Vulnerability Cross-Checking and Code Security Refinement. These stages are executed in multiple interactive iterations to progressively enhance security. By using both proprietary LLMs (i.e., GPT-3.5 Turbo, GPT-4 and GPT-4o) and open-source LLMs (i.e., Llama 3.1 8B Instruct, DeepSeek Coder V2 Lite Instruct) evaluated on three benchmark datasets, extensive experimental results show that our proposed SecCode greatly outperforms compared baselines, generating secure code with a high vulnerability correction rate. For example, SecCode exhibits a high fix success rate of over 76\% after running 5 automated EP interactive iterations and over 89\% after running 10 automated EP interactive iterations. To the best of our knowledge, this work is the first to formulate secure code generation with NL prompts only. We have open-sourced our code and encourage the community to focus on secure code generation.
Large Language Models (LLMs) leverage chain-of-thought (CoT) prompting to provide step-by-step rationales, improving performance on complex tasks. Despite its benefits, vanilla CoT often fails to fully verify intermediate inferences and can produce misleading explanations. In this work, we propose Layered Chain-of-Thought (Layered-CoT) Prompting, a novel framework that systematically segments the reasoning process into multiple layers, each subjected to external checks and optional user feedback. We expand on the key concepts, present three scenarios -- medical triage, financial risk assessment, and agile engineering -- and demonstrate how Layered-CoT surpasses vanilla CoT in terms of transparency, correctness, and user engagement. By integrating references from recent arXiv papers on interactive explainability, multi-agent frameworks, and agent-based collaboration, we illustrate how Layered-CoT paves the way for more reliable and grounded explanations in high-stakes domains.
Accurate nuclei classification serves as a critical cornerstone for disease diagnosis and treatment, yet challenged by the heterogeneity of tissue types, staining procedures, and imaging techniques. Recently, vision-language models (VLMs) have demonstrated impressive success in the natural image field and advanced potential in medical imaging. However, the adaptation of VLMs to nuclei classification still poses several challenges, including limited generalization capability and coarse image-text feature alignment. In this paper, we propose SIGNPrompt, a domain-Specific Interactive Prompt learning framework for Generalized Nuclei classification. Specifically, to unleash the generalization capability of VLM-based models, we introduce a prior-guided domain adapting module that integrates nuclei prior information from the large language model (LLM), enabling flexible and robust adaptation to the inherent heterogeneity across pathological domains. Moreover, we develop a multi-modal interactive prompting mechanism to refine image-text feature alignment by leveraging the interdependence between visual and language prompting, thus enhancing the discriminability of nuclei categories. In addition, a simple yet effective noise-adding strategy is proposed to mitigate the overfitting problem in prompt learning. Extensive experiments on diverse public benchmarks and challenging zero-shot scenarios validate that SIGNPrompt consistently outperforms state-of-the-art (SOTA) methods in both accuracy and generalization.
Large pre-trained vision and language models have demonstrated remarkable capacities for various tasks. However, solving the knowledge-based visual reasoning tasks remains challenging, which requires a model to comprehensively understand image content, connect the external world knowledge, and perform step-by-step reasoning to answer the questions correctly. To this end, we propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to the key concepts from candidates adaptively. It then transforms them into text context for prompting with a visual captioning model and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, verify the generated rationale with a cross-modality classifier and ensure that the rationale can infer the predicted output consistently. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our IPVR enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines.
Many Vision-and-Language Navigation (VLN) tasks have been proposed in recent years, from room-based to object-based and indoor to outdoor. The REVERIE (Remote Embodied Referring Expression) is interesting since it only provides high-level instructions to the agent, which are closer to human commands in practice. Nevertheless, this poses more challenges than other VLN tasks since it requires agents to infer a navigation plan only based on a short instruction. Large Language Models (LLMs) show great potential in robot action planning by providing proper prompts. Still, this strategy has not been explored under the REVERIE settings. There are several new challenges. For example, the LLM should be environment-aware so that the navigation plan can be adjusted based on the current visual observation. Moreover, the LLM planned actions should be adaptable to the much larger and more complex REVERIE environment. This paper proposes a March-in-Chat (MiC) model that can talk to the LLM on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP). Our MiC model outperforms the previous state-of-the-art by large margins by SPL and RGSPL metrics on the REVERIE benchmark. The source code is available at https://github.com/YanyuanQiao/MiC
Recent advancements in machine learning provide methods to train autonomous agents capable of handling the increasing complexity of sequential decision-making in robotics. Imitation Learning (IL) is a prominent approach, where agents learn to control robots based on human demonstrations. However, IL commonly suffers from violating the independent and identically distributed (i.i.d) assumption in robotic tasks. Interactive Imitation Learning (IIL) achieves improved performance by allowing agents to learn from interactive feedback from human teachers. Despite these improvements, both approaches come with significant costs due to the necessity of human involvement. Leveraging the emergent capabilities of Large Language Models (LLMs) in reasoning and generating human-like responses, we introduce LLM-iTeach — a novel IIL framework that utilizes an LLM as an interactive teacher to enhance agent performance while alleviating the dependence on human resources. Firstly, LLM-iTeach uses a hierarchical prompting strategy that guides the LLM in generating a policy in Python code. Then, with a designed similarity-based feedback mechanism, LLM-iTeach provides corrective and evaluative feedback interactively during the agent’s training. We evaluate LLM-iTeach against baseline methods such as Behavior Cloning (BC), an IL method, and CEILing, a state-of-the-art IIL method using a human teacher, on various robotic manipulation tasks. Our results demonstrate that LLM-iTeach surpasses BC in the success rate and achieves or even outscores that of CEILing, highlighting the potential of LLMs as cost-effective, human-like teachers in interactive learning environments. We further demonstrate the method’s potential for generalization by evaluating it on additional tasks. The code and prompts are provided at: https://github.com/Tubicor/LLM-iTeach.
Partly automated creation of interlinear glossed text (IGT) has the potential to assist in linguistic documentation. We argue that LLMs can make this process more accessible to linguists because of their capacity to follow natural-language instructions. We investigate the effectiveness of a retrieval-based LLM prompting approach to glossing, applied to the seven languages from the SIGMORPHON 2023 shared task. Our system beats the BERT-based shared task baseline for every language in the morpheme-level score category, and we show that a simple 3-best oracle has higher word-level scores than the challenge winner (a tuned sequence model) in five languages. In a case study on Tsez, we ask the LLM to automatically create and follow linguistic instructions, reducing errors on a confusing grammatical feature. Our results thus demonstrate the potential contributions which LLMs can make in interactive systems for glossing, both in making suggestions to human annotators and following directions.
Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations -- paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.
We present ROBOTO2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBOTO2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBOTO2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.
Understanding spatiotemporal anomalies is critical in domains such as urban safety, mobility, and environmental monitoring. These scenarios involve complex dynamics that are effectively modeled using graph-based representations, where the spatial structure is encoded through data connectivity, and each node corresponds to a time series. Anomaly detection in such data is crucial for identifying unusual or significant events, but it requires complex methods involving pattern recognition, prediction, and classification. Interpreting these anomalies remains challenging. To address this, we introduce an interactive system that combines spatiotemporal visualizations with Large Language Models (LLMs) to generate context-aware explanations by unifying temporal, spatial, and textual insights. We guide the LLM using a structured prompting strategy grounded in the data to reduce hallucinations and improve plausibility. As a demonstration of functionality, we analyze crime anomalies in São Paulo, uncovering links to events such as Carnival and religious holidays.
A Preliminary Fundamental Financial Analysis Framework Using Structured LLM Prompting - A Case Study
Financial analysts routinely calculate standard ratios for company evaluation, often using inconsistent Excel templates that require manual updates and lack interactive visualization capabilities. This paper presents a web-based financial analysis platform that automates preliminary analysis through 15 fundamental financial ratios across five categories (liquidity, solvency, profitability, efficiency, and risk assessment), providing structured inputs to large language models (LLMs) for intelligent insights. Our research demonstrates that LLMs generate significantly more accurate analysis when provided with pre-calculated, contextually-rich metrics rather than raw financial statements - achieving 73% higher relevance scores, 81% better risk identification, and 65% more accurate comparative analysis. The platform, built with modular architecture supporting any state-of-the-art LLM API (GPT-4, Claude, Gemini), processes CSV data to calculate metrics and generate interactive dashboards with AI-powered commentary. We validate the framework through comprehensive comparative analysis of companies with contrasting business models, showing how structured inputs enable nuanced, context-aware insights that adapt to specific financial situations. The tool reduces analysis time required significantly while ensuring computational consistency and providing institutional-quality interpretation that would typically require senior analyst expertise.
Large language models (LLMs) show strong reasoning via chain-of-thought (CoT) prompting, but the process is opaque, which makes verification, debugging, and control difficult in high-stakes settings. We present Vis-CoT, a human-in-the-loop framework that converts linear CoT text into an interactive reasoning graph. Users can visualize the logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new, user-defined premises. This shifts interaction from passive observation to active collaboration, steering models toward more accurate and trustworthy conclusions. Across GSM8K and StrategyQA, Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines. A user study also shows large gains in perceived usability and trust. Vis-CoT points to a practical path for more reliable, understandable, and collaborative reasoning by combining LLMs with targeted human oversight.
Automatic furniture layout is long desired for convenient interior design. Leveraging the remarkable visual reasoning capabilities of multimodal large language models (MLLMs), recent methods address layout generation in a static manner, lacking the feedback-driven refinement essential for interactive user engagement. We introduce Chat2Layout, a novel interactive furniture layout generation system that extends the functionality of MLLMs into the realm of interactive layout design. To achieve this, we establish a unified vision-question paradigm for in-context learning, enabling seamless communication with MLLMs to steer their behavior without altering model weights. Within this framework, we present a novel training-free visual prompting mechanism. This involves a visual-text prompting technique that assist MLLMs in reasoning about plausible layout plans, followed by an Offline-to-Online search (O2O-Search) method, which identifies the minimal set of informative references to provide exemplars for visual-text prompting. By employing an agent system with MLLMs as the core controller, we enable bidirectional interaction. The agent not only comprehends the 3D environment and user requirements through linguistic and visual perception but also plans tasks and reasons about actions to generate and arrange furniture within the virtual space. Furthermore, the agent iteratively updates based on visual feedback from execution results. Experimental results demonstrate that our approach facilitates language-interactive generation and arrangement for diverse and complex 3D furniture.
Graphical user interface (GUI) prototyping represents an essential activity in the development of interactive systems, which are omnipresent today. GUI prototypes facilitate elicitation of requirements and help to test, evaluate, and validate ideas with users and the development team. However, creating GUI prototypes is a time-consuming process and often requires extensive resources. While existing research for automatic GUI generation focused largely on resource-intensive training and fine-tuning of LLMs, mainly for low-fidelity GUIs, we investigate the potential and effectiveness of Zero-Shot (ZS) prompting for high-fidelity GUI generation. We propose a Retrieval-Augmented GUI Generation (RAGG) approach, integrated with an LLM-based GUI retrieval re-ranking and filtering mechanism based on a large-scale GUI repository. In addition, we adapt Prompt Decomposition (PDGG) and Self-Critique (SCGG) for GUI generation. To evaluate the effectiveness of the proposed ZS prompting approaches for GUI generation, we extensively evaluated the accuracy and subjective satisfaction of the generated GUI prototypes. Our evaluation, which encompasses over 3,000 GUI annotations from over 100 crowd-workers with UI/UX experience, shows that SCGG, in contrast to PDGG and RAGG, can lead to more effective GUI generation, and provides valuable insights into the defects that are produced by the LLMs in the generated GUI prototypes.
Depression is a significant mental illness that affects how individuals express their emotions and engage with others, making communication challenging. Most depression assessment tools utilize self-report questionnaires, such as the Patient Health Questionnaire (PHQ-9). These psychometric instruments can be easily adapted to electronic forms. However, this approach cannot provide human-like explanations and interactions, leading to a poor interactivity. Furthermore, we have identified critical limitations in previous prompting methods. They are either constrained to queries using a single identifiable relation, or being agnostic to input contexts, making it difficult to capture variabilities that occur across different inference steps. To solve these issues, we develop a large language models LLM-enhanced conversational agent for depression detection, which makes it more effective and interactive. Specifically, we first explore an iterative knowledge-aware prompter (IKP), a new prompting paradigm which inject specific knowledge from language models progressively for multi-step reasoning, which learns to synthesize prompts conditioned on the current step's contexts. Second, our proposed system introduces a multi-step diagnosis (MSD) approach. Our system not only delivers a diagnosis but also generates a symptom summary through interactive conversations. Our proposed agent enables users to have interactive natural language dialogues with the system, enhancing their personalized comprehension of mental states. Our experiments demonstrate the effectiveness of the iterative knowledge-aware prompter design.
The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models. Within just two years of development, it was unprecedentedly of high-quality, diversity, and creativity that the state-of-the-art models could generate. However, a prevalent limitation persists in the effective communication with these popular T2I models, such as Stable Diffusion, using natural language descriptions. This typically makes an engaging image hard to obtain without expertise in prompt engineering with complex word compositions, magic tags, and annotations. Inspired by the recently released DALLE3 - a T2I model directly built-in ChatGPT that talks human language, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I), where people can interact with LLM for interleaved high-quality image generation/edit/refinement and question answering with stronger images and text correspondences using natural language. In addressing the iT2I problem, we present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models. We evaluate our approach for iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT, LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a convenient and low-cost way to introduce the iT2I ability for any existing LLMs and any text-to-image models without any training while bringing little degradation on LLMs' inherent capabilities in, e.g., question answering and code generation. We hope this work could draw broader attention and provide inspiration for boosting user experience in human-machine interactions alongside the image quality of the next-generation T2I systems.
Pre-trained large language models (“LLMs”) like GPT-3 can engage in fluent, multi-turn instruction-taking out-of-the-box, making them attractive materials for designing natural language interactions. Using natural language to steer LLM outputs (“prompting”) has emerged as an important design technique potentially accessible to non-AI-experts. Crafting effective prompts can be challenging, however, and prompt-based interactions are brittle. Here, we explore whether non-AI-experts can successfully engage in “end-user prompt engineering” using a design probe—a prototype LLM-based chatbot design tool supporting development and systematic evaluation of prompting strategies. Ultimately, our probe participants explored prompt designs opportunistically, not systematically, and struggled in ways echoing end-user programming systems and interactive machine learning systems. Expectations stemming from human-to-human instructional experiences, and a tendency to overgeneralize, were barriers to effective prompt design. These findings have implications for non-AI-expert-facing LLM-based tool design and for improving LLM-and-prompt literacy among programmers and the public, and present opportunities for further research.
Large Language Models (LLMs) have shown impressive reasoning abilities with the use of chain-of-thought (CoT) prompting. However, reasoning is still brittle: small errors early on propagate forward to lead to confidently asserted but erroneous conclusions. This paper presents AutoCrit, a metareasoning system that incorporates structured self-criticism and iterative error-fixing directly into the CoT procedure. AutoCrit integrates a reasoning agent, a critique agent, and an execution monitor in an active feedback loop to detect and correct inconsistency proactively step by step. On mathematical reasoning benchmarks (GSM8K), commonsense inference (CSQA2), and interactive planning (ALFWorld) benchmarks, AutoCrit achieves accuracy improvements of 12−18% over baseline CoT and reduces error propagation rates by half. Theoretical analysis of AutoCrit as an iterative fixed-point system formally establishes it rigorously and provides error-propagation limits that demonstrate its scalability. This work advances LLM reliability by showing that incorporating critique into reasoning outperforms post-hoc validation, the foundation for future reasoning-intensive applications in AI-assisted decision-making.
Theory of Mind (ToM)-an understanding of the mental states of others-is a key aspect of human social intelligence, yet, chatbots and LLM-based social agents do not typically integrate it. In this work, we demonstrate that LLMs that explicitly use ToM get better at dialogue, achieving goals more effectively. After showing that simply prompting models to generate mental states between dialogue turns already provides significant benefit, we further introduce ToMAgent (ToMA), a ToM-focused dialogue agent. ToMA is trained by pairing ToM with dialogue lookahead to produce mental states that are maximally useful for achieving dialogue goals. Experiments on the Sotopia interactive social evaluation benchmark demonstrate the effectiveness of our method over a range of baselines. Comprehensive analysis shows that ToMA exhibits more strategic, goal-oriented reasoning behaviors, which enable long-horizon adaptation, while maintaining better relationships with their partners. Our results suggest a step forward in integrating ToM for building socially intelligent LLM agents.
Effective prompt engineering is critical to realizing the promised productivity gains of large language models (LLMs) in knowledge-intensive tasks. Yet, many users struggle to craft prompts that yield high-quality outputs, limiting the practical benefits of LLMs. Existing approaches, such as prompt handbooks or automated optimization pipelines, either require substantial effort, expert knowledge, or lack interactive guidance. To address this gap, we design and evaluate PromptPilot, an interactive prompting assistant grounded in four empirically derived design objectives for LLM-enhanced prompt engineering. We conducted a randomized controlled experiment with 80 participants completing three realistic, work-related writing tasks. Participants supported by PromptPilot achieved significantly higher performance (median: 78.3 vs. 61.7; p = .045, d = 0.56), and reported enhanced efficiency, ease-of-use, and autonomy during interaction. These findings empirically validate the effectiveness of our proposed design objectives, establishing LLM-enhanced prompt engineering as a viable technique for improving human-AI collaboration.
Text-based applications and chatbots are increasingly popular for delivering banking services and educational tools, offering convenient and efficient solutions for users. Whereas, personalized assistants have transformed user engagement in the digital banking space by utilizing Large Language Models (LLMs) in conjunction with autonomous agents. This study proposes the development of an intelligent personalized assistant for digital banking, utilizing a multi-agent framework based on the LangGraph and Chain of Thoughts (COT) prompting. While COT guarantees context-aware replies, the LangGraph design maps characteristics to nodes to improve user interactions. The objectives of this system are to enhance task efficiency and elevate the capabilities of digital banking assistants. We present a customizable digital banking system powered by LLM-based models, designed to deliver an interactive and personalized banking experience. The system supports a range of services, including adding money, transferring funds, paying bills, accessing telco services like mobile recharge, managing savings interest rates, DPS schemes, fixed deposits, and answering FAQs related to banking information. Therefore, integrating COT for logical reasoning enhances the effectiveness of multi-agent systems, as each single agent benefits from the structured reasoning process. In addition, LangGraph is employed for structured data management, enabling the assistant to support and accelerate various digital banking processes efficiently. The code implementation of this work is available for public access at: https://github.com/srv-sh/digital_agent.
With the proliferation of large language model (LLM) applications since 2022, their use in education has sparked both excitement and concern. Recent studies consistently highlight students'(mis)use of LLMs can hinder learning outcomes. This work aims to teach students how to effectively prompt LLMs to improve their learning. We first proposed pedagogical prompting, a theoretically-grounded new concept to elicit learning-oriented responses from LLMs. To move from concept design to a proof-of-concept learning intervention in real educational settings, we selected early undergraduate CS education (CS1/CS2) as the example context. We began with a formative survey study with instructors (N=36) teaching early-stage undergraduate-level CS courses to inform the instructional design based on classroom needs. Based on their insights, we designed and developed a learning intervention through an interactive system with scenario-based instruction to train pedagogical prompting skills. Finally, we evaluated its instructional effectiveness through a user study with CS novice students (N=22) using pre/post-tests. Through mixed methods analyses, our results indicate significant improvements in learners'LLM-based pedagogical help-seeking skills, along with positive attitudes toward the system and increased willingness to use pedagogical prompts in the future. Our contributions include (1) a theoretical framework of pedagogical prompting; (2) empirical insights into current instructor attitudes toward pedagogical prompting; and (3) a learning intervention design with an interactive learning tool and scenario-based instruction leading to promising results on teaching LLM-based help-seeking. Our approach is scalable for broader implementation in classrooms and has the potential to be integrated into tools like ChatGPT as an on-boarding experience to encourage learning-oriented use of generative AI.
LLM-IE: a python package for biomedical generative information extraction with large language models
Abstract Objectives Despite the recent adoption of large language models (LLMs) for biomedical information extraction (IE), challenges in prompt engineering and algorithms persist, with no dedicated software available. To address this, we developed LLM-IE: a Python package for building complete IE pipelines. Materials and Methods The LLM-IE supports named entity recognition, entity attribute extraction, and relation extraction tasks. We benchmarked it on the i2b2 clinical datasets. Results The sentence-based prompting algorithm resulted in the best 8-shot performance of over 70% strict F1 for entity extraction and about 60% F1 for entity attribute extraction. Discussion We developed a Python package, LLM-IE, highlighting (1) an interactive LLM agent to support schema definition and prompt design, (2) state-of-the-art prompting algorithms, and (3) visualization features. Conclusion The LLM-IE provides essential building blocks for developing robust information extraction pipelines. Future work will aim to expand its features and further optimize computational efficiency.
Machine learning algorithms play a pivotal role in a wide range of Artificial Intelligence (AI) applications. Explaining the results and behavior of a machine learning model, however, remains a challenge. In this paper, we present a new approach to the explanation of machine learning models using a large language model (LLM). In this work, we seek natural language descriptions of the behavioral patterns of a machine learning model by a combination of prompting and model sampling. A subspace sampling technique is developed to generate ML model outputs using partial features in a user defined space. A projective visualization method is employed to guide the sampling process, including user-directed interactive sampling and feature-based sampling, so that an optimal amount of information can be provided to the LLM to ensure accurate and concise natural language explanations. Two public datasets, a student performance dataset and a weather dataset, were used to test our approach under various conditions.
Maximizing the effectiveness of Large Language Models (LLMs) requires prompt optimization, but existing approaches frequently have limited interpretability, high computational cost, and narrow generalization. We introduce RECAP, a modular, cognitively based framework for explainable and automated prompt engineering. Neurofeedback-based self-scoring, evolutionary prompt graph search, contrastive-symbolic rule induction, Pareto-based cost-accuracy optimization, an interactive debugging interface, and a shared inter-module memory layer are the six main innovations it presents. Without the need for model fine-tuning, RECAP lowers token, latency, and memory overhead while increasing prompt quality and LLM accuracy. It offers a scalable and interpretable substitute for conventional tuning pipelines and can be used in a variety of fields, including conversational AI and search.
Large language models (LLMs) have recently soared in popularity due to their ease of access and the unprecedented ability to synthesize text responses to diverse user questions. However, LLMs like ChatGPT present significant limitations in supporting complex information tasks due to the insufficient affordances of the text-based medium and linear conversational structure. Through a formative study with ten participants, we found that LLM interfaces often present long-winded responses, making it difficult for people to quickly comprehend and interact flexibly with various pieces of information, particularly during more complex tasks. We present Graphologue, an interactive system that converts text-based responses from LLMs into graphical diagrams to facilitate information-seeking and question-answering tasks. Graphologue employs novel prompting strategies and interface designs to extract entities and relationships from LLM responses and constructs node-link diagrams in real-time. Further, users can interact with the diagrams to flexibly adjust the graphical presentation and to submit context-specific prompts to obtain more information. Utilizing diagrams, Graphologue enables graphical, non-linear dialogues between humans and LLMs, facilitating information exploration, organization, and comprehension.
Implementing automation testing is difficult and as a consequence there is a growing desire for semi-automated software testing systems with humans in the loop. Leveraging the growth of LLMs, recent research has demonstrated LLMs’ potential to improve performance on test generation, reporting, and bug triaging. However, relatively little work has explored the interactivity issues that emerge in semi-automated LLM-assisted software test case development. To fill this gap, we present two user studies (N1 = 16, N2 = 24) that investigate productivity, creativity, and user attention in three semi-automated LLM-assisted interaction strategies: (1) pre-emptive prompting; (2) buffered response; and (3) guided input. We find that pre-emptively prompting the user significantly enhances branch coverage and task creativity by more than 30% while reducing user’s off-task idle time by up to 48.7%. We conclude by suggesting concrete research directions applying mixed-initiative principles for LLM-based interactive systems for semi-automated software testing.
Computer Use Agents (CUAs), autonomous systems that interact with software interfaces via browsers or virtual machines, are rapidly being deployed in consumer and enterprise environments. These agents introduce novel attack surfaces and trust boundaries that are not captured by traditional threat models. Despite their growing capabilities, the security boundaries of CUAs remain poorly understood. In this paper, we conduct a systematic threat analysis and testing of real-world CUAs under adversarial conditions. We identify seven classes of risks unique to the CUA paradigm, and analyze three concrete exploit scenarios in depth: (1) clickjacking via visual overlays that mislead interface-level reasoning, (2) indirect prompt injection that enables Remote Code Execution (RCE) through chained tool use, and (3) CoT exposure attacks that manipulate implicit interface framing to hijack multi-step reasoning. These case studies reveal deeper architectural flaws across current CUA implementations. Namely, a lack of input provenance tracking, weak interface-action binding, and insufficient control over agent memory and delegation. We conclude by proposing a CUA-specific security evaluation framework and design principles for safe deployment in adversarial and high-stakes settings.
Large language models (LLMs) have gained widespread popularity due to their ability to perform ad-hoc natural language processing (NLP) tasks with simple natural language prompts. Part of the appeal for LLMs is their approachability to the general public, including individuals with little technical expertise in NLP. However, prompts can vary significantly in terms of their linguistic structure, context, and other semantics, and modifying one or more of these aspects can result in significant differences in task performance. Non-expert users may find it challenging to identify the changes needed to improve a prompt, especially when they lack domain-specific knowledge and appropriate feedback. To address this challenge, we present PromptAid, a visual analytics system designed to interactively create, refine, and test prompts through exploration, perturbation, testing, and iteration. PromptAid uses coordinated visualizations which allow users to improve prompts via three strategies: keyword perturbations, paraphrasing perturbations, and obtaining the best set of in-context few-shot examples. PromptAid was designed through a pre-study involving NLP experts, and evaluated via a robust mixed-methods user study. Our findings indicate that PromptAid helps users to iterate over prompts with less cognitive overhead, generate diverse prompts with the help of recommendations, and analyze the performance of the generated prompts while surpassing existing state-of-the-art prompting interfaces in performance.
Computer-Use Agents (CUAs) with full system access enable powerful task automation but pose significant security and privacy risks due to their ability to manipulate files, access user data, and execute arbitrary commands. While prior work has focused on browser-based agents and HTML-level attacks, the vulnerabilities of CUAs remain underexplored. In this paper, we investigate Visual Prompt Injection (VPI) attacks, where malicious instructions are visually embedded within rendered user interfaces, and examine their impact on both CUAs and Browser-Use Agents (BUAs). We propose VPI-Bench, a benchmark of 306 test cases across five widely used platforms, to evaluate agent robustness under VPI threats. Each test case is a variant of a web platform, designed to be interactive, deployed in a realistic environment, and containing a visually embedded malicious prompt. Our empirical study shows that current CUAs and BUAs can be deceived at rates of up to 51% and 100%, respectively, on certain platforms. The experimental results also indicate that system prompt defenses offer only limited improvements. These findings highlight the need for robust, context-aware defenses to ensure the safe deployment of multimodal AI agents in real-world environments. The code and dataset are available at: https://github.com/cua-framework/agents
Geospatial tasks often require the coordination of various spatial algorithms and operations, which are usually performed through tool calling guided by natural language prompts. Crafting effective prompts is challenging due to the inherent complexity and ambiguity of natural language. In this paper, we present GeoPet, a visual analytics system designed to simplify the process of prompt engineering to improve the performance of geospatial tool callings on large language models (LLMs). At its core, GeoPet is a sophisticated tool recommendation method that accepts geospatial tasks as input, decomposes them into atomic tasks, identifies relevant tools, and extracts the most relevant tool descriptions for these atomic tasks. The system is designed to support interactive prompt engineering of geospatial tool invocations, enabling users to explore the connections between geospatial tasks and established tools and to evaluate their performance. This enables crafting and refining prompts that coordinate LLMs with human expertise. GeoPet's satisfaction, practicality, usability, and visual design are validated through two case studies and a user study. These demonstrate that the system significantly eases the burden of rapid engineering and skillfully guides LLMs in geospatial tool calling capabilities. By providing a visual and interactive system for prompt engineering, GeoPet helps users navigate complex geospatial tasks and improves the overall efficiency and accuracy of tool calling for LLMs.
This study explores how Indonesian university design students use generative AI in their creative process. The use of AI in the creative process is related to the extent to which AI is utilized in the process of creating design works, the ability to compose prompts, and how AI is interpreted amidst the automation of design technology. The study used a quantitative descriptive approach with supporting qualitative insights with 110 respondents taken using a purposive sampling technique. Surveys were given to respondents, then further explored by interviewing selected informants. This study identifies seven dimensions of engagement, including prompting behaviour, prompting strategies, perceived creativity, ethical reflection, designer identity, engagement with others' AI work, and the habit of sharing AI works. The results indicate that students primarily use generative AI as a tool for idea exploration, concept visualization, and style experimentation. Most still consider themselves designers, viewing AI as a supportive partner rather than a creative substitute. They demonstrated moderate to high levels of creative self-efficacy, ethical awareness, and adaptive identity transformation. However, active engagement with others' AI work remains limited, and while rapid sharing is considered a beneficial form of knowledge exchange, it is often treated as a personal design asset. Overall, students are undergoing a transitional phase in which they learn to integrate AI into their creative practices while maintaining human-centered control and ethical reflection. This study contributes to the growing body of work on AI-assisted design by suggesting a versatile framework for understanding how students interact with AI creatively and ethically. AI occupies a multi-layered role in the design ecosystem, serving not only as a medium for idea generation but also as a space for identity reflection and a place to negotiate the evolving role of the designer. This study highlights practical considerations for integrating critical AI literacy, ethical design frameworks, and collaborative prompt practices into design education curricula, ensuring students are well-equipped for evolving creative industry practices.
We present SAID (Social Media AI-generated Interface Dataset), a systematically curated collection of 240 social media profile interfaces generated through controlled prompt engineering focused on accessibility. As AI tools reshape interface design practices, understanding how these systems interpret and implement accessibility requirements in social media interfaces becomes more and more important. Through six distinct prompt categories examining both generic and specific accessibility requirements, our dataset captures how AI systems interpret and implement accessibility features across visual and motor impairment dimensions. The dataset combines complete interface designs in multiple formats (PNG and SVG), detailed prompt engineering methodology, and comprehensive documentation of interface components such as social identity presentation, content engagement, navigation, and interactive elements. SAID enables novel research directions from understanding AI's role in shaping accessible social media experiences to examining how automated design tools can support more inclusive social interactions.
State-of-the-art neural language models can now be used to solve ad-hoc language tasks through zero-shot prompting without the need for supervised training. This approach has gained popularity in recent years, and researchers have demonstrated prompts that achieve strong accuracy on specific NLP tasks. However, finding a prompt for new tasks requires experimentation. Different prompt templates with different wording choices lead to significant accuracy differences. PromptIDE allows users to experiment with prompt variations, visualize prompt performance, and iteratively optimize prompts. We developed a workflow that allows users to first focus on model feedback using small data before moving on to a large data regime that allows empirical grounding of promising prompts using quantitative measures of the task. The tool then allows easy deployment of the newly created ad-hoc models. We demonstrate the utility of PromptIDE (demo: http://prompt.vizhub.ai) and our workflow using several real-world use cases.
Evaluating outputs of large language models (LLMs) is challenging, requiring making—and making sense of—many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.
This study presents a novel visual prompt selection framework for augmented reality (AR) applications that integrates advanced object detection and image segmentation techniques. The framework is designed to enhance user interactions and improve the accuracy of foreground–background separation in AR environments, making AR experiences more immersive and precise. We evaluated six state-of-the-art object detectors (DETR, DINO, CoDETR, YOLOv5, YOLOv8, and YOLO-NAS) in combination with a prompt segmentation model using the DAVIS 2017 validation dataset. The results show that the combination of YOLO-NAS-L and SAM achieved the best performance with a J&F score of 70%, while DINO-scale4-swin had the lowest score of 57.5%. This 12.5% performance gap highlights the significant contribution of user-provided regions of interest (ROIs) to segmentation outcomes, emphasizing the importance of interactive user input in enhancing accuracy. Our framework supports fast prompt processing and accurate mask generation, allowing users to refine digital overlays interactively, thereby improving both the quality of AR experiences and overall user satisfaction. Additionally, the framework enables the automatic detection of moving objects, providing a more efficient alternative to traditional manual selection interfaces in AR devices. This capability is particularly valuable in dynamic AR scenarios, where seamless user interaction is crucial.
We present an agentic framework, Thinker, which achieves state of art performance in challenging reasoning tasks for realistic customer service scenarios that involve complex business logic and human interactions via long horizons. On the $\tau$-bench retail dataset, Thinker achieves 82.6\% success rate with GPT-4o (version 2024-06-01) (baseline: 68.3\%), and 81.9\% success rate with Llama-3.1 405B (baseline: 49.6\%), without any fine-tuning. Thinker effectively closes the gap in reasoning capabilities between the base models by introducing proper structure. The key features of the Thinker framework are: (1) State-Machine Augmented Generation (SMAG), which represents business logic as state machines and the LLM uses state machines as tools. (2) Delegation of tasks from the main reasoning loop to LLM-powered tools. (3) Adaptive context management. Our prompting-only solution achieves signficant gains, while still maintaining a standard agentic architecture with a ReAct style reasoning loop. The key is to innovate on the tool interface design, as exemplified by SMAG and the LLM-powered tools.
BACKGROUND The labor-intensive nature of data extraction from sources like discharge summaries (DSs) poses significant obstacles to the digitization of medical records particularly for low- and middle-income countries (LMICs). In this paper we present a completely automated method, MedPromptExtract, to efficiently extract data from DS while maintaining confidentiality. METHODS The source of data were DSs from Kokilaben Dhirubhai Ambani Hospital (KDAH) of patients having acute kidney injury (AKI). A pre-existing tool, Expert-Informed Joint Learning aGgrEatioN (EIGEN), which leverages semi-supervised learning techniques for high-fidelity information extraction, was used to anonymize the DSs, and natural language processing (NLP) was used to extract data from regular fields. We used prompt engineering and a large language model (LLM) to extract custom clinical information from free-flowing text describing the patient's stay in the hospital. Twelve features associated with the occurrence of AKI were extracted. The LLM's responses were validated against clinicians' annotations. RESULTS The MedPromptExtract tool first subjected DSs to the anonymization pipeline, which took 3 seconds per summary. Successful anonymization was verified by clinicians, thereafter the NLP pipeline extracted structured text from the anonymized pdfs at the rate of 0.2 s per summary with 100% accuracy. Finally, DSs were analysed by the LLM pipeline using Gemini Pro for the 12 features. Accuracy metrics were calculated by comparing model responses to clinicians' annotations with 7 features achieving Area Under the Curve (AUC) above 0.9, indicating the high fidelity of the extraction process. CONCLUSIONS MedPromptExtract serves as an automated adaptable tool for efficient data extraction from medical records with a dynamic user interface.
While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely"think with images".
Healthcare professionals need effective ways to use, understand, and validate AI-driven clinical decision support systems. Existing systems face two key limitations: complex visualizations and lack of grounding in scientific evidence. We present an integrated Decision Support System that combines interactive visualizations with a conversational agent for explaining diabetes risk assessments. We propose a hybrid prompt handling approach combining fine-tuned language models for analytical queries with general Large Language Models (LLMs) for broader medical questions, a methodology for grounding AI explanations in scientific evidence and a feature range analysis technique to support deeper understanding of feature contributions. We conducted a mixed-methods study with 30 healthcare professionals and found that the conversational interactions helped healthcare professionals build a clear understanding of model assessments, while the integration of scientific evidence calibrated trust in the system’s decisions. Most participants reported that the system supported both patient risk evaluation and recommendation.
Generative AI for the creation of images is becoming a staple in the toolkit of digital artists and visual designers. The interaction with these systems is mediated by prompting, a process in which users write a short text to describe the desired image’s content and style. The study of prompts offers an unprecedented opportunity to gain insight into the process of human creativity. Yet, our understanding of how people use them remains limited. We analyze more than 145,000 prompts from the logs of two Generative AI platforms (Stable Diffusion and Pick-a-Pic) to shed light on how people explore new concepts over time, and how their exploration might be influenced by different design choices in human-computer interfaces to Generative AI. We find that users exhibit a tendency towards exploration of new topics over exploitation of concepts visited previously. However, a comparative analysis of the two platforms, which differ both in scope and functionalities, reveals some stark differences. Features diverting user focus from prompting and providing instead shortcuts for quickly generating image variants are associated with a considerable reduction in both exploration of novel concepts and detail in the submitted prompts. These results carry direct implications for the design of human interfaces to Generative AI and raise new questions regarding how the process of prompting should be aided in ways that best support creativity.
Auto-regressive LLM-based software engineering (SWE) agents, henceforth SWE agents, have made tremendous progress (>60% on SWE-Bench Verified) on real-world coding challenges including GitHub issue resolution. SWE agents use a combination of reasoning, environment interaction and self-reflection to resolve issues thereby generating"trajectories". Analysis of SWE agent trajectories is difficult, not only as they exceed LLM sequence length (sometimes, greater than 128k) but also because it involves a relatively prolonged interaction between an LLM and the environment managed by the agent. In case of an agent error, it can be hard to decipher, locate and understand its scope. Similarly, it can be hard to track improvements or regression over multiple runs or experiments. While a lot of research has gone into making these SWE agents reach state-of-the-art, much less focus has been put into creating tools to help analyze and visualize agent output. We propose a novel tool called SeaView: Software Engineering Agent Visual Interface for Enhanced Workflow, with a vision to assist SWE-agent researchers to visualize and inspect their experiments. SeaView's novel mechanisms help compare experimental runs with varying hyper-parameters or LLMs, and quickly get an understanding of LLM or environment related problems. Based on our user study, experienced researchers spend between 10 and 30 minutes to gather the information provided by SeaView, while researchers with little experience can spend between 30 minutes to 1 hour to diagnose their experiment.
Large Language Models (LLMs) are increasingly used in complex knowledge work, yet linear transcript interfaces limit support for reflection. Schon's Reflective Practice distinguishes between reflection-in-action (during a task) and reflection-on-action (after a task), both benefiting from non-linear, revisitable representations of dialogue. ChatGraPhT is an interactive tool that shows dialogue as a visual map, allowing users to branch and merge ideas, edit past messages, and receive guidance that prompts deeper reflection. It supports non-linear, multi-path dialogue, while two agentic LLM assistants provide moment-to-moment and higher-level guidance. Our inquiry suggests that keeping the conversation structure visible, allowing branching and merging, and suggesting patterns or ways to combine ideas deepened user reflective engagement. Contributions are: (1) the design of a node-link, agentic LLM interface for reflective dialogue, and (2) transferable design knowledge on balancing structure and AI support to sustain reflection in complex, open-ended tasks.
Currently, the vast majority of locally deployed open-source large language models (LLMs) and some commercial model interfaces do not support stable tool calling functionality. The existing solution involves fine-tuning LLMs, which results in significant time and computational resource consumption. This paper proposes a method that enables LLMs to achieve stable tool calling capabilities using only prompt engineering and some ingenious code design. We conducted experiments on multiple LLMs that lack tool calling capabilities across various tool calling tasks, achieving a success rate of 100%.
Remote sensing visual question answering (RQA) was recently proposed with the aim of interfacing natural language and vision to ease the access of information contained in Earth Observation data for a wide audience, which is granted by simple questions in natural language. The traditional vision/language interface is an embedding obtained by fusing features from two deep models, one processing the image and another the question. Despite the success of early VQA models, it remains difficult to control the adequacy of the visual information extracted by its deep model, which should act as a context regularizing the work of the language model. We propose to extract this context information with a visual model, convert it to text and inject it, i.e. prompt it, into a language model. The language model is therefore responsible to process the question with the visual context, and extract features, which are useful to find the answer. We study the effect of prompting with respect to a black-box visual extractor and discuss the importance of training a visual model producing accurate context.
Introduction: The labour-intensive nature of data extraction from sources like discharge summaries (DS) poses significant obstacles to the digitisation of medical records particularly for low- and middle-income countries (LMICs). In this paper we present a completely automated method MedPromptExtract to efficiently extract data from DS while maintaining confidentiality. Methods: The source of data was Discharge Summaries (DS) from Kokilaben Dhirubhai Ambani Hospital (KDAH) of patients having Acute Kidney Injury (AKI). A pre-existing tool EIGEN which leverages semi-supervised learning techniques for high-fidelity information extraction was used to anonymize the DS, Natural Language Processing (NLP) was used to extract data from regular fields. We used Prompt Engineering and Large Language Model(LLM) to extract custom clinical information from free flowing text describing the patients stay in the hospital. Twelve features associated with occurrence of AKI were extracted. The LLM responses were validated against clinicians annotations. Results: The MedPromptExtracttool first subjected DS to the anonymization pipeline which took three seconds per summary. Successful anonymization was verified by clinicians, thereafter NLP pipeline extracted structured text from the anonymized pdfs at the rate of 0.2 seconds per summary with 100% accuracy.Finally DS were analysed by the LLM pipeline using Gemini Pro for the twelve features. Accuracy metrics were calculated by comparing model responses to clinicians annotations with seven features achieving AUCs above 0.9, indicating high fidelity of the extraction process. Conclusion: MedPromptExtract serves as an automated adaptable tool for efficient data extraction from medical records with a dynamic user interface. Keywords: Digitizing Medical Records, Automated Anonymisation, Information Retrieval, Large Language Models, Prompt Engineering
We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
Utilizing large language models (LLMs) to compose off-the-shelf visual tools represents a promising avenue of research for developing robust visual assistants capable of addressing diverse visual tasks. However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge. To tackle this challenge, we propose CLOVA, a Closed-LOop Visual Assistant, which operates within a framework encompassing inference, reflection, and learning phases. During the inference phase, LLMs generate programs and execute corresponding tools to complete assigned tasks. In the reflection phase, a multimodal global-local reflection scheme analyzes human feedback to determine which tools require updating. Lastly, the learning phase employs three flexible approaches to automatically gather training data and introduces a novel prompt tuning scheme to update the tools, allowing CLOVA to efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image editing. These results under-score the significance of the continual learning capability in general visual assistants.
Online mental health communities (OMHCs) offer rich posts and comments for viewers, who do not directly participate in the communications, to seek social support from others' experience. However, viewers could face challenges in finding helpful posts and comments and digesting the content to get needed support, as revealed in our formative study (N=10). In this work, we present an interactive visual tool named ComViewer to help viewers seek social support in OMHCs. With ComViewer, viewers can filter posts of different topics and find supportive comments via a zoomable circle packing visual component that adapts to searched keywords. Powered by LLM, ComViewer supports an interactive sensemaking process by enabling viewers to interactively highlight, summarize, and question any community content. A within-subjects study (N=20) demonstrates ComViewer's strengths in providing viewers with a more simplified, more fruitful, and more engaging support-seeking experience compared to a baseline OMHC interface without ComViewer. We further discuss design implications for facilitating information-seeking and sense making in online mental health communities.
The emergence of foundation models, such as large language models (LLMs) GPT-4 and text-to-image models DALL-E, has opened up numerous possibilities across various domains. People can now use natural language (i.e., prompts) to communicate with AI to perform tasks. While people can use foundation models through chatbots (e.g., ChatGPT), chat, regardless of the capabilities of the underlying models, is not a production tool for building reusable AI services. APIs like LangChain allow for LLM-based application development but require substantial programming knowledge, thus posing a barrier. To mitigate this, we systematically review, summarise, refine and extend the concept of AI chain by incorporating the best principles and practices that have been accumulated in software engineering for decades into AI chain engineering, to systematize AI chain engineering methodology. We also develop a no-code integrated development environment, Prompt Sapper, which embodies these AI chain engineering principles and patterns naturally in the process of building AI chains, thereby improving the performance and quality of AI chains. With Prompt Sapper, AI chain engineers can compose prompt-based AI services on top of foundation models through chat-based requirement analysis and visual programming. Our user study evaluated and demonstrated the efficiency and correctness of Prompt Sapper.
Large Language Models (LLMs) have gained widespread popularity due to their ability to perform ad-hoc Natural Language Processing (NLP) tasks with a simple natural language prompt. Part of the appeal for LLMs is their approachability to the general public, including individuals with no prior technical experience in NLP techniques. However, natural language prompts can vary significantly in terms of their linguistic structure, context, and other semantics. Modifying one or more of these aspects can result in significant differences in task performance. Non-expert users may find it challenging to identify the changes needed to improve a prompt, especially when they lack domain-specific knowledge and lack appropriate feedback. To address this challenge, we present PromptAid, a visual analytics system designed to interactively create, refine, and test prompts through exploration, perturbation, testing, and iteration. PromptAid uses multiple, coordinated visualizations which allow users to improve prompts by using the three strategies: keyword perturbations, paraphrasing perturbations, and obtaining the best set of in-context few-shot examples. PromptAid was designed through an iterative prototyping process involving NLP experts and was evaluated through quantitative and qualitative assessments for LLMs. Our findings indicate that PromptAid helps users to iterate over prompt template alterations with less cognitive overhead, generate diverse prompts with help of recommendations, and analyze the performance of the generated prompts while surpassing existing state-of-the-art prompting interfaces in performance.
No abstract available
Cooking process visualization is a promising task in the intersection of image generation and food analysis, which aims to generate an image for each cooking step of a recipe. However, most existing works focus on generating images of finished foods based on the given recipes, and face two challenges in visualizing the cooking process. First, the appearance of ingredients changes variously across cooking steps, it is difficult to generate the correct appearances of foods that match the textual description, leading to semantic inconsistency. Second, the current step might depend on the operations of previous step, it is crucial to maintain the contextual coherence of images in sequential order. In this work, we present a cooking process visualization model, called Chain-of-Cooking. Specifically, to generate correct appearances of ingredients, we present a Dynamic Patch Selection Module to retrieve previously generated image patches as references, which are most related to current textual contents. Furthermore, to enhance the coherence and keep the rational order of generated images, we propose a Semantic Evolution Module and a Bidirectional Chain-of-Thought (CoT) Guidance. To better utilize the semantics of previous texts, the Semantic Evolution Module establishes the semantical association between latent prompts and current cooking step, and merges it with the latent features. Then the CoT Guidance updates the merged features to guide the current cooking step remain coherent with the previous step. Moreover, we construct a dataset named CookViz, consisting of intermediate image-text pairs for the cooking process. Quantitative and qualitative experiments show that our method outperforms existing methods in generating coherent and semantic consistent cooking process.
Computational Fluid Dynamics (CFD) is widely used in aerospace, energy, and biology to model fluid flow, heat transfer, and chemical reactions. While Large Language Models (LLMs) have transformed various domains, their application in CFD remains limited, particularly for complex tasks like post-processing. To bridge this gap, we introduce MetaOpenFOAM 2.0, which leverages Chain of Thought (COT) decomposition and iterative verification to enhance accessibility for non-expert users through natural language inputs. Tested on a new benchmark covering simulation (fluid flow, heat transfer, combustion) and post-processing (extraction, visualization), MetaOpenFOAM 2.0 achieved an Executability score of 6.3/7 and a pass rate of 86.9%, significantly outperforming MetaOpenFOAM 1.0 (2.1/7, 0%). Additionally, it proved cost-efficient, averaging $0.15 per case. An ablation study confirmed that COT-driven decomposition and iterative refinement substantially improved task performance. Furthermore, scaling laws showed that increasing COT steps enhanced accuracy while raising token usage, aligning with LLM post-training scaling trends. These results highlight the transformative potential of LLMs in automating CFD workflows for industrial and research applications. Code is available at https://github.com/Terry-cyx/MetaOpenFOAM
Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.
Data visualization serves as a critical means for presenting data and mining its valuable insights. The task of chart summarization, through natural language processing techniques, facilitates in-depth data analysis of charts. However, there still are notable deficiencies in terms of visual-language matching and reasoning ability for existing approaches. To address these limitations, this study constructs a large-scale dataset of comprehensive chart-caption pairs and fine-tuning instructions on each chart. Thanks to the broad coverage of various topics and visual styles within this dataset, better matching degree can be achieved from the view of training data. Moreover, we propose an innovative chart summarization method, ChartThinker, which synthesizes deep analysis based on chains of thought and strategies of context retrieval, aiming to improve the logical coherence and accuracy of the generated summaries. Built upon the curated datasets, our trained model consistently exhibits superior performance in chart summarization tasks, surpassing 8 state-of-the-art models over 7 evaluation metrics. Our dataset and codes are publicly accessible.
This Visualization Viewpoints article explores how visualization helps uncover and communicate the internal chain-of-thought trajectories and generative pathways of large language models (LLMs) in reasoning tasks. As LLMs become increasingly powerful and widespread, a key challenge is understanding how their reasoning dynamics unfold, particularly in natural language processing (NLP) applications. Their outputs may appear coherent, yet the multistep inference pathways behind them remain largely hidden. We argue that visualization offers an effective avenue to illuminate these internal mechanisms. Moving beyond attention weights or token saliency, we advocate for richer visual tools that expose model uncertainty, highlight alternative reasoning paths, and reveal what the model omits or overlooks. We discuss examples, such as prompt trajectory visualizations, counterfactual response maps, and semantic drift flows, to illustrate how these techniques foster trust, identify failure modes, and support deeper human interaction with these systems. In doing so, visualizing the chain of thought in LLMs lays critical groundwork for transparent, interpretable, and truly collaborative human–AI reasoning.
No abstract available
Although data visualization is powerful for revealing patterns and communicating insights, creating effective visualizations requires familiarity with authoring tools and often disrupts the analysis flow. While large language models show promise for automatically converting analysis intent into visualizations, existing methods function as black boxes without transparent reasoning processes, which prevents users from understanding design rationales and refining suboptimal outputs. To bridge this gap, we propose integrating Chain-of-Thought (CoT) reasoning into the Natural Language to Visualization (NL2VIS) pipeline. First, we design a comprehensive CoT reasoning process for NL2VIS and develop an automatic pipeline to equip existing datasets with structured reasoning steps. Second, we introduce nvBench-CoT, a specialized dataset capturing detailed step-by-step reasoning from ambiguous natural language descriptions to finalized visualizations, which enables state-of-the-art performance when used for model fine-tuning. Third, we develop DeepVIS, an interactive visual interface that tightly integrates with the CoT reasoning process, allowing users to inspect reasoning steps, identify errors, and make targeted adjustments to improve visualization outcomes. Quantitative benchmark evaluations, two use cases, and a user study collectively demonstrate that our CoT framework effectively enhances NL2VIS quality while providing insightful reasoning steps to users.
Natural Language Interfaces (NLIs) backed by Large Language Models (LLMs) are used to interact with visualizations through natural language queries. Using the specific example of 2.5D treemaps, the Delphi tool was recently presented, introducing an interactive 2.5D visualization with an accompanying chat interface, where the LLM can react to user input and adapt the visualization at its own discretion. While Delphi has demonstrated effectiveness, the authors have not included an evaluation of the LLM's performance with respect to its prompt and specific task types. In this study, we systematically evaluate the impact of prompt engineering on Delphi's ability to answer factual questions related to data and visualization. Specifically, we investigate the effect of the Chain-of-Thought prompting technique by employing a questionnaire comprising 40 questions across ten low-level analytic tasks. Our findings aim to refine prompt design methodologies and enhance the usability and effectiveness of NLIs in advanced visualization systems.
Current multimodal large language models (MLLMs), while effective in natural image understanding, struggle with visualization understanding due to their inability to decode the data-to-visual mapping and extract structured information. To address these challenges, we propose SimVec, a novel simplified vector format that encodes chart elements such as mark type, position, and size. The effectiveness of SimVec is demonstrated by using MLLMs to reconstruct chart information from SimVec formats. Then, we build a new visualization dataset, SimVecVis, to enhance the performance of MLLMs in visualization understanding, which consists of three key dimensions: bitmap images of charts, their SimVec representations, and corresponding data-centric question-answering (QA) pairs with explanatory chain-of-thought (CoT) descriptions. We fine-tune state-of-the-art MLLMs (e.g., MiniCPM and Qwen-VL), using SimVecVis with different dataset dimensions. The experimental results show that it leads to substantial performance improvements of MLLMs with good spatial perception capabilities (e.g., MiniCPM) in data-centric QA tasks. Our dataset and source code are available at: https://github.com/VIDA-Lab/SimVecVis.
Procedural documents in power plants play a critical role in ensuring standardized operations and maintenance, especially in fault detection. However, manually consulting these documents during real-time troubleshooting is often inefficient. To address this challenge, we propose a visual analytics approach powered by Large Language Models (LLMs) to automatically extract faultrelated entities from power plant documentation and construct structured knowledge graphs for efficient fault diagnosis. Specifically, we utilize MinerU for PDF parsing, train a corpus classification model to filter key pages, and develop a multi-stage prompt construction method with Retrieval-Augmented Generation and Chain-of-Thought strategies to enhance LLMs’ reasoning abilities. Additionally, we introduce an interactive visualization system that presents extraction results intuitively and enables domain experts to validate, refine, and interpret the reasoning process. Quantitative evaluation and two case studies demonstrate the effectiveness and usability of our approach.
In the data analysis process, visualized data can help users gain better insights. To make it easier and faster for users to obtain visual charts from data, natural language interfaces for data visualization have emerged. Users only need to provide the visualization model with the data to be visualized and a description of their visualization needs, and the model will return a visual chart(NL2VIS). In real-world scenarios, most data is stored in relational databases. To visualize this data, it is first necessary to generate a structured query statements based on the user’s visualization requirements(NL2SQL), and then proceed with the subsequent visualization operations. This study breaks down the task of automatic visualization of tabular data in relational databases into three main steps: generating SQL, determining the chart type, and mapping data to visual channels. We utilize the Chain-of-Thought(CoT) technique of generative large language models to address the task of automatic visualization of tabular data. Finally, we evaluated our approach on the nvBench dataset, and the results show that CoT-based automatic visualization of tabular data performs well.
Recent advances in AI-driven storytelling have enhanced video generation and story visualization. However, translating dialogue-centric scripts into coherent storyboards remains a significant challenge due to limited script detail, inadequate physical context understanding, and the complexity of integrating cinematic principles. To address these challenges, we propose Dialogue Visualization, a novel task that transforms dialogue scripts into dynamic, multi-view storyboards. We introduce Dialogue Director, a training-free multimodal framework comprising three agents–Script Director, Cinematographer, and Storyboard Maker. This framework leverages large multimodal models and diffusion-based architectures, employing techniques such as Chain-of-Thought reasoning, Retrieval-Augmented Generation, and multi-view synthesis to improve script understanding, physical context comprehension, and cinematic knowledge integration. Experimental results demonstrate that Dialogue Director outperforms state-of-the-art methods in script interpretation, physical world understanding, and cinematic principle application, significantly advancing the quality and controllability of dialogue-based story visualization.
Recently, techniques such as explicit structured reasoning have demonstrated strong test-time scaling behavior by enforcing a separation between the model's internal"thinking"process and the final response. A key factor influencing answer quality in this setting is the length of the thinking stage. When the reasoning is too short, the model may fail to capture the complexity of the task. Conversely, when it is too long, the model may overthink, leading to unnecessary computation and degraded performance. This paper explores and exploits the underlying mechanisms by which LLMs understand and regulate the length of their reasoning during explicit thought processes. First, we show that LLMs encode their progress through the reasoning process and introduce an interactive progress bar visualization, which is then used to reveal insights on the model's planning dynamics. Second, we manipulate the internal progress encoding during inference to reduce unnecessary steps and generate a more concise and decisive chain of thoughts. Our empirical results demonstrate that this"overclocking"method mitigates overthinking, improves answer accuracy, and reduces inference latency. Our code is publicly available.
Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronics design, as large language models (LLMs) frequently hallucinate in granular details, violate electrical constraints, and produce non-machine-readable outputs. We present CircuitLM, a novel multi-agent LLM-aided circuit design pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics through five sequential stages: (i) LLM-based component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning by an electronics expert agent, (iv) JSON schematic synthesis, and (v) force-directed SVG visualization. Anchored by a curated, embedding-powered component knowledge base. While LLMs often violate electrical constraints, CircuitLM bridges this gap by grounding generation in a verified and dynamically extensible component database, initially comprising 50 components. To ensure safety, we incorporate a hybrid evaluation framework, namely Dual-Metric Circuit Validation (DMCV), validated against human-expert assessments, which achieves high fidelity in microcontroller-centric designs. We evaluate the system on 100 diverse embedded-systems prompts across six LLMs and introduce DMCV to assess both structural and electrical validity. This work bridges natural language input to deployable hardware designs, enabling reliable circuit prototyping by non-experts. Our code and data will be made public upon acceptance.
Large language models (LLMs) are rapidly increasing in capability, but they still struggle with highly specialized programming tasks such as scientific visualization. We present an LLM agent, ChatVis, that aids the LLM to generate Python code for ParaView scientific visualization tasks, without the need for retraining or fine-tuning the LLM. ChatVis employs chain-of-thought prompt simplification, retrieval-augmented prompt generation using a vector database of documentation and code examples, and error checking with iterative prompt feedback to correct errors until a visualization is produced. An integral part of our approach is a benchmark suite of canonical visualization tasks, ParaView regression tests, and scientific use cases that includes comprehensive evaluation metrics. We evaluate our visualization agent by comparing results with a variety of top-performing unassisted LLMs. We find that all the metrics are significantly improved with ChatVis.
Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level'steps'and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.
Scalable Vector Graphics (SVG) is a code structure used to represent visual information, and with the powerful capabilities of large language models, it holds significant research potential. Current text-to-SVG generation methods lack generalization capabilities and struggle with accurately adhering to input generation instructions. In this paper, we propose a novel approach for generating SVG using large language models, named SVGThinker, which incorporates a reasoning process to align the generation of SVG code with the visualization process, while supporting all SVG primitives. Through sequential rendering of SVG primitives, we first use a multimodal model to annotate the SVG, followed by sequential updates corresponding to the incremental additions of primitives. We then employ a supervised training framework based on Chain-of-Thought reasoning, which enhances the model's robustness and reduces the risk of errors or hallucinations. Through comparisons with state-of-the-art baseline models, our experiments show that our model generates more stable, high-quality, and editable SVG code. In contrast to image-based methods, our approach preserves the structural advantages of SVG and supports precise, hierarchical editing. We believe our work opens new directions for SVG generation, with potential applications in design, content creation, and automated SVG-based graphic generation.
While the field of natural language to SQL(NL2SQL) has made significant advancements in translating natural language instructions into executable SQL scripts for data querying and processing, achieving full automation within the broader data science pipeline–encompassing data querying, analysis, visualization, and reporting–remains a complex challenge. This study introduces SageCopilot, an advanced, industry-grade system that automates the data science pipeline by integrating Large Language Models (LLMs), Autonomous Agents (AutoAgents), and Language User Interfaces (LUIs). Designed with a two-phase architecture, SageCopilot uses an offline phase to generate high-quality demonstrations supporting In-Context Learning (ICL), which powers the online phase to transform user inputs into executable scripts for database queries, analysis, and visualization tasks. Leveraging specialized components such as NL2SQL, Text2Analyze, and Text2Viz, as well as chain-of-thought prompting for multi-turn interactions, SageCopilot achieves superior end-to-end automation. Rigorous experimentation with real-world datasets demonstrates the system’s ability to minimize human intervention while ensuring correctness and user-friendly operation.
Towards Automated Data Sciences with Natural Language and SageCopilot: Practices and Lessons Learned
While the field of NL2SQL has made significant advancements in translating natural language instructions into executable SQL scripts for data querying and processing, achieving full automation within the broader data science pipeline - encompassing data querying, analysis, visualization, and reporting - remains a complex challenge. This study introduces SageCopilot, an advanced, industry-grade system system that automates the data science pipeline by integrating Large Language Models (LLMs), Autonomous Agents (AutoAgents), and Language User Interfaces (LUIs). Specifically, SageCopilot incorporates a two-phase design: an online component refining users' inputs into executable scripts through In-Context Learning (ICL) and running the scripts for results reporting&visualization, and an offline preparing demonstrations requested by ICL in the online phase. A list of trending strategies such as Chain-of-Thought and prompt-tuning have been used to augment SageCopilot for enhanced performance. Through rigorous testing and comparative analysis against prompt-based solutions, SageCopilot has been empirically validated to achieve superior end-to-end performance in generating or executing scripts and offering results with visualization, backed by real-world datasets. Our in-depth ablation studies highlight the individual contributions of various components and strategies used by SageCopilot to the end-to-end correctness for data sciences.
In recent years, the critical role of green buildings in addressing energy consumption and environmental issues has become widely acknowledged. Research indicates that over 40% of potential energy savings can be achieved during the early design stage. Therefore, decision-making in green building design (DGBD), which is based on modeling and performance simulation, is crucial for reducing building energy costs. However, the field of green building encompasses a broad range of specialized knowledge, which involves significant learning costs and results in low decision-making efficiency. Many studies have already applied artificial intelligence (AI) methods to this field. Based on previous research, this study innovatively integrates large language models with DGBD, creating GreenQA, a question answering framework for multimodal data reasoning. Utilizing Retrieval Augmented Generation, Chain of Thought, and Function Call methods, GreenQA enables multimodal question answering, including weather data analysis and visualization, retrieval of green building cases, and knowledge query. Additionally, this study conducted a user survey using the GreenQA web platform. The results showed that 96% of users believed the platform helped improve design efficiency. This study not only effectively supports DGBD but also provides inspiration for AI-assisted design.
This paper aims to address the difficulties faced by novice programmers in grasping code structure and execution flow, improving programming thinking, and pinpointing code errors with accuracy. It proposes providing students with program behavior diagrams based on large language models (LLMs) and visualization techniques to achieve personalized guidance. Specifically, these program behavior diagrams include programming thinking visualization diagrams and code vulnerability visualization diagrams. A programming thinking visualization diagram employs static code analysis to gather code structure information, combined with the structured chain-of-thought method to collectively optimize the LLM. This enables the LLM to explain each interpretable part of the code from top to bottom, detailing the programming concepts, and displaying them on a modularized code structure diagram. The code vulnerability visualization diagram primarily utilizes the fine-tuned LLM, optimizing it based on program analysis and clustering analysis methods to accurately identify vulnerabilities in student code and display them on a code flow diagram. Its feature is to visually display to students the error location, error information, and the impact of errors on program flow, rather than providing the programming answers. Lastly, through experiments and statistical analysis of actual teaching data, this paper serves a demonstration that the enhanced models used in the visualization diagram generation process have a noticeable effect on mainstream LLMs, and that visualization diagrams hold significant value for students at different stages of learning.
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption. We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses. Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods. To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information. The auxiliary decoder is removed at inference, preserving the efficiency of implicit CoT with no added overhead. It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis. SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI by +3.0\% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on GPT-2 by 2.1\% with 2.3$\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B. Code: https://github.com/InternLM/SIM-CoT
The output quality of large language models (LLMs) can be improved via “reasoning”: generating segments of chain-of-thought (CoT) content to further condition the model prior to producing user-facing output. While these chains contain valuable information, they are verbose and lack explicit organization, making them tedious to review. Moreover, they lack opportunities for user feedback, such as removing unwanted considerations, adding desired ones, or clarifying unclear assumptions. We introduce Interactive Reasoning, an interaction design that visualizes chain-of-thought outputs as a hierarchy of topics and enables user review and modification. We implement interactive reasoning in Hippo, a prototype for AI-assisted decision making in the face of uncertain trade-offs. In a user study with 16 participants, we find that interactive reasoning in Hippo allows users to quickly identify and interrupt erroneous generations, efficiently steer the model towards customized responses, and better understand both model reasoning and model outputs. Our work contributes to a new paradigm that incorporates user oversight into LLM reasoning processes.
No abstract available
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through"drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.
Large language models (LLMs) are becoming increasingly popular in the field of psychological counseling. However, when human therapists work with LLMs in therapy sessions, it is hard to understand how the model gives the answers. To address this, we have constructed Psy-COT, a graph designed to visualize the thought processes of LLMs during therapy sessions. The Psy-COT graph presents semi-structured counseling conversations alongside step-by-step annotations that capture the reasoning and insights of therapists. Moreover, we have developed Psy-Copilot, which is a conversational AI assistant designed to assist human psychological therapists in their consultations. It can offer traceable psycho-information based on retrieval, including response candidates, similar dialogue sessions, related strategies, and visual traces of results. We have also built an interactive platform for AI-assisted counseling. It has an interface that displays the relevant parts of the retrieval sub-graph. The Psy-Copilot is designed not to replace psychotherapists but to foster collaboration between AI and human therapists, thereby promoting mental health development. Our code and demo are both open-sourced and available for use.
WebAssembly enables near-native execution in web applications and is increasingly adopted for tasks that demand high performance and robust security. However, its assembly-like syntax, implicit stack machine, and low-level data types make it extremely difficult for human developers to understand, spurring the need for effective WebAssembly reverse engineering techniques. In this paper, we propose StackSight, a novel neurosymbolic approach that combines Large Language Models (LLMs) with advanced program analysis to decompile complex WebAssembly code into readable C++ snippets. StackSight visualizes and tracks virtual stack alterations via a static analysis algorithm and then applies chain-of-thought prompting to harness LLM's complex reasoning capabilities. Evaluation results show that StackSight significantly improves WebAssembly decompilation. Our user study also demonstrates that code snippets generated by StackSight have significantly higher win rates and enable a better grasp of code semantics.
Current frontier large-language models rely on reasoning to achieve state-of-the-art performance. Many existing interpretability are limited in this area, as standard methods have been designed to study single forward passes of a model rather than the multi-token computational steps that unfold during reasoning. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We introduce a black-box method that measures each sentence's counterfactual importance by repeatedly sampling replacement sentences from the model, filtering for semantically different ones, and continuing the chain of thought from that point onwards to quantify the sentence's impact on the distribution of final answers. We discover that certain sentences can have an outsized impact on the trajectory of the reasoning trace and final answer. We term these sentences \textit{thought anchors}. These are generally planning or uncertainty management sentences, and specialized attention heads consistently attend from subsequent sentences to thought anchors. We further show that examining sentence-sentence causal links within a reasoning trace gives insight into a model's behavior. Such information can be used to predict a problem's difficulty and the extent different question domains involve sequential or diffuse reasoning. As a proof-of-concept, we demonstrate that our techniques together provide a practical toolkit for analyzing reasoning models by conducting a detailed case study of how the model solves a difficult math problem, finding that our techniques yield a consistent picture of the reasoning trace's structure. We provide an open-source tool (thought-anchors.com) for visualizing the outputs of our methods on further problems. The convergence across our methods shows the potential of sentence-level analysis for a deeper understanding of reasoning models.
The Natural Language to Visualization (NL2Vis) task aims to transform natural-language descriptions into visual representations for a grounded table, enabling users to gain insights from vast amounts of data. Recently, many deep learning-based approaches have been developed for NL2Vis. Despite the considerable efforts made by these approaches, challenges persist in visualizing data sourced from unseen databases or spanning multiple tables. Taking inspiration from the remarkable generation capabilities of Large Language Models (LLMs), this paper conducts an empirical study to evaluate their potential in generating visualizations, and explore the effectiveness of in-context learning prompts for enhancing this task. In particular, we first explore the ways of transforming structured tabular data into sequential text prompts, as to feed them into LLMs and analyze which table content contributes most to the NL2Vis. Our findings suggest that transforming structured tabular data into programs is effective, and it is essential to consider the table schema when formulating prompts. Furthermore, we evaluate two types of LLMs: finetuned models (e.g., T5-Small) and inference-only models (e.g., GPT-3.5), against state-of-the-art methods, using the NL2Vis benchmarks (i.e., nvBench). The experimental results reveal that LLMs outperform baselines, with inference-only models consistently exhibiting performance improvements, at times even surpassing fine-tuned models when provided with certain few-shot demonstrations through in-context learning. Finally, we analyze when the LLMs fail in NL2Vis, and propose to iteratively update the results using strategies such as chain-of-thought, role-playing, and code-interpreter. The experimental results confirm the efficacy of iterative updates and hold great potential for future study.
Large Language Models (LLMs) are increasingly integrated into software applications, giving rise to a broad class of prompt-enabled systems, in which prompts serve as the primary 'programming' interface for guiding system behavior. Building on this trend, a new software paradigm, promptware, has emerged, which treats natural language prompts as first-class software artifacts for interacting with LLMs. Unlike traditional software, which relies on formal programming languages and deterministic runtime environments, promptware is based on ambiguous, unstructured, and context-dependent natural language and operates on LLMs as runtime environments, which are probabilistic and non-deterministic. These fundamental differences introduce unique challenges in prompt development. In practice, prompt development remains largely ad hoc and relies heavily on time-consuming trial-and-error, a challenge we term the promptware crisis. To address this, we propose promptware engineering, a new methodology that adapts established Software Engineering (SE) principles to prompt development. Drawing on decades of success in traditional SE, we envision a systematic framework encompassing prompt requirements engineering, design, implementation, testing, debugging, evolution, deployment, and monitoring. Our framework re-contextualizes emerging prompt-related challenges within the SE lifecycle, providing principled guidance beyond ad-hoc practices. Without the SE discipline, prompt development is likely to remain mired in trial-and-error. This paper outlines a comprehensive roadmap for promptware engineering, identifying key research directions and offering actionable insights to advance the development of prompt-enabled systems.
The rapid emergence of generative AI models like Large Language Models (LLMs) has demonstrated its utility across various activities, including within Requirements Engineering (RE). Ensuring the quality and accuracy of LLM-generated output is critical, with prompt engineering serving as a key technique to guide model responses. However, existing literature provides limited guidance on how prompt engineering can be leveraged, specifically for RE activities. The objective of this study is to explore the applicability of existing prompt engineering guidelines for the effective usage of LLMs within RE. To achieve this goal, we began by conducting a systematic review of primary literature to compile a non-exhaustive list of prompt engineering guidelines. Then, we conducted interviews with RE experts to present the extracted guidelines and gain insights on the advantages and limitations of their application within RE. Our literature review indicates a shortage of prompt engineering guidelines for domain-specific activities, specifically for RE. Our proposed mapping contributes to addressing this shortage. We conclude our study by identifying an important future line of research within this field.
Large Language Models are transforming software engineering, yet prompt management in practice remains ad hoc, hindering reliability, reuse, and integration into industrial workflows. We present Prompt-with-Me, a practical solution for structured prompt management embedded directly in the development environment. The system automatically classifies prompts using a four-dimensional taxonomy encompassing intent, author role, software development lifecycle stage, and prompt type. To enhance prompt reuse and quality, Prompt-with-Me suggests language refinements, masks sensitive information, and extracts reusable templates from a developer's prompt library. Our taxonomy study of 1108 real-world prompts demonstrates that modern LLMs can accurately classify software engineering prompts. Furthermore, our user study with 11 participants shows strong developer acceptance, with high usability (Mean SUS=73), low cognitive load (Mean NASA-TLX=21), and reported gains in prompt quality and efficiency through reduced repetitive effort. Lastly, we offer actionable insights for building the next generation of prompt management and maintenance tools for software engineering workflows.
While generative artificial intelligence (Gen AI) increasingly transforms academic environments, a critical gap exists in understanding and mitigating human biases in AI interactions, such as anchoring and confirmation bias. This position paper advocates for metacognitive AI literacy interventions to help university students critically engage with AI and address biases across the Human-AI interaction workflows. The paper presents the importance of considering (1) metacognitive support with deliberate friction focusing on human bias; (2) bi-directional Human-AI interaction intervention addressing both input formulation and output interpretation; and (3) adaptive scaffolding that responds to diverse user engagement patterns. These frameworks are illustrated through ongoing work on "DeBiasMe," AIED (AI in Education) interventions designed to enhance awareness of cognitive biases while empowering user agency in AI interactions. The paper invites multiple stakeholders to engage in discussions on design and evaluation methods for scaffolding mechanisms, bias visualization, and analysis frameworks. This position contributes to the emerging field of AI-augmented learning by emphasizing the critical role of metacognition in helping students navigate the complex interaction between human, statistical, and systemic biases in AI use while highlighting how cognitive adaptation to AI systems must be explicitly integrated into comprehensive AI literacy frameworks.
Large Language Models, particularly decoder-only generative models such as GPT, are increasingly used to automate Software Engineering tasks. These models are primarily guided through natural language prompts, making prompt engineering a critical factor in system performance and behavior. Despite their growing role in SE research, prompt-related decisions are rarely documented in a systematic or transparent manner, hindering reproducibility and comparability across studies. To address this gap, we conducted a two-phase empirical study. First, we analyzed nearly 300 papers published at the top-3 SE conferences since 2022 to assess how prompt design, testing, and optimization are currently reported. Second, we surveyed 105 program committee members from these conferences to capture their expectations for prompt reporting in LLM-driven research. Based on the findings, we derived a structured guideline that distinguishes essential, desirable, and exceptional reporting elements. Our results reveal significant misalignment between current practices and reviewer expectations, particularly regarding version disclosure, prompt justification, and threats to validity. We present our guideline as a step toward improving transparency, reproducibility, and methodological rigor in LLM-based SE research.
We present Universal Conditional Logic (UCL), a mathematical framework for prompt optimization that transforms prompt engineering from heuristic practice into systematic optimization. Through systematic evaluation (N=305, 11 models, 4 iterations), we demonstrate significant token reduction (29.8%, t(10)=6.36, p < 0.001, Cohen's d = 2.01) with corresponding cost savings. UCL's structural overhead function O_s(A) explains version-specific performance differences through the Over-Specification Paradox: beyond threshold S* = 0.509, additional specification degrades performance quadratically. Core mechanisms -- indicator functions (I_i in {0,1}), structural overhead (O_s = gamma * sum(ln C_k)), early binding -- are validated. Notably, optimal UCL configuration varies by model architecture -- certain models (e.g., Llama 4 Scout) require version-specific adaptations (V4.1). This work establishes UCL as a calibratable framework for efficient LLM interaction, with model-family-specific optimization as a key research direction.
Generative text-to-image models have gained great popularity among the public for their powerful capability to generate high-quality images based on natural language prompts. However, developing effective prompts for desired images can be challenging due to the complexity and ambiguity of natural language. This research proposes PromptMagician, a visual analysis system that helps users explore the image results and refine the input prompts. The backbone of our system is a prompt recommendation model that takes user prompts as input, retrieves similar prompt-image pairs from DiffusionDB, and identifies special (important and relevant) prompt keywords. To facilitate interactive prompt refinement, PromptMagician introduces a multi-level visualization for the cross-modal embedding of the retrieved images and recommended keywords, and supports users in specifying multiple criteria for personalized exploration. Two usage scenarios, a user study, and expert interviews demonstrate the effectiveness and usability of our system, suggesting it facilitates prompt engineering and improves the creativity support of the generative text-to-image model.
Human cognition is constrained by processing limitations, leading to cognitive overload and inefficiencies in knowledge synthesis and decision-making. Large Language Models (LLMs) present an opportunity for cognitive augmentation, but their current reactive nature limits their real-world applicability. This position paper explores the potential of context-aware cognitive augmentation, where LLMs dynamically adapt to users' cognitive states and task environments to provide appropriate support. Through a think-aloud study in an exhibition setting, we examine how individuals interact with multi-modal information and identify key cognitive challenges in structuring, retrieving, and applying knowledge. Our findings highlight the need for AI-driven cognitive support systems that integrate real-time contextual awareness, personalized reasoning assistance, and socially adaptive interactions. We propose a framework for AI augmentation that seamlessly transitions between real-time cognitive support and post-experience knowledge organization, contributing to the design of more effective human-centered AI systems.
Artificial intelligence (AI), including large language models and generative AI, is emerging as a significant force in software development, offering developers powerful tools that span the entire development lifecycle. Although software engineering research has extensively studied AI tools in software development, the specific types of interactions between developers and these AI-powered tools have only recently begun to receive attention. Understanding and improving these interactions has the potential to enhance productivity, trust, and efficiency in AI-driven workflows. In this paper, we propose a taxonomy of interaction types between developers and AI tools, identifying eleven distinct interaction types, such as auto-complete code suggestions, command-driven actions, and conversational assistance. Building on this taxonomy, we outline a research agenda focused on optimizing AI interactions, improving developer control, and addressing trust and usability challenges in AI-assisted development. By establishing a structured foundation for studying developer-AI interactions, this paper aims to stimulate research on creating more effective, adaptive AI tools for software development.
By treating data and models as the source code, Foundation Models (FMs) become a new type of software. Mirroring the concept of software crisis, the increasing complexity of FMs making FM crisis a tangible concern in the coming decade, appealing for new theories and methodologies from the field of software engineering. In this paper, we outline our vision of introducing Foundation Model (FM) engineering, a strategic response to the anticipated FM crisis with principled engineering methodologies. FM engineering aims to mitigate potential issues in FM development and application through the introduction of declarative, automated, and unified programming interfaces for both data and model management, reducing the complexities involved in working with FMs by providing a more structured and intuitive process for developers. Through the establishment of FM engineering, we aim to provide a robust, automated, and extensible framework that addresses the imminent challenges, and discovering new research opportunities for the software engineering field.
Responsible prompt engineering has emerged as a critical framework for ensuring that generative artificial intelligence (AI) systems serve society's needs while minimizing potential harms. As generative AI applications become increasingly powerful and ubiquitous, the way we instruct and interact with them through prompts has profound implications for fairness, accountability, and transparency. This article examines how strategic prompt engineering can embed ethical and legal considerations and societal values directly into AI interactions, moving beyond mere technical optimization for functionality. This article proposes a comprehensive framework for responsible prompt engineering that encompasses five interconnected components: prompt design, system selection, system configuration, performance evaluation, and prompt management. Drawing from empirical evidence, the paper demonstrates how each component can be leveraged to promote improved societal outcomes while mitigating potential risks. The analysis reveals that effective prompt engineering requires a delicate balance between technical precision and ethical consciousness, combining the systematic rigor and focus on functionality with the nuanced understanding of social impact. Through examination of real-world and emerging practices, the article illustrates how responsible prompt engineering serves as a crucial bridge between AI development and deployment, enabling organizations to fine-tune AI outputs without modifying underlying model architectures. This approach aligns with broader "Responsibility by Design" principles, embedding ethical considerations directly into the implementation process rather than treating them as post-hoc additions. The article concludes by identifying key research directions and practical guidelines for advancing the field of responsible prompt engineering.
Text-to-image generation has advanced rapidly with large-scale multimodal training, yet fine-grained controllability remains a critical challenge. Counterfactual controllability, defined as the capacity to deliberately generate images that contradict common-sense patterns, remains a major challenge but plays a crucial role in enabling creativity and exploratory applications. In this work, we address this gap with a focus on counterfactual size (e.g., generating a tiny walrus beside a giant button) and propose an automatic prompt engineering framework that adapts base prompts into revised prompts for counterfactual images. The framework comprises three components: an image evaluator that guides dataset construction by identifying successful image generations, a supervised prompt rewriter that produces revised prompts, and a DPO-trained ranker that selects the optimal revised prompt. We construct the first counterfactual size text-image dataset and enhance the image evaluator by extending Grounded SAM with refinements, achieving a 114 percent improvement over its backbone. Experiments demonstrate that our method outperforms state-of-the-art baselines and ChatGPT-4o, establishing a foundation for future research on counterfactual controllability.
Existing multi-agent learning approaches have developed interactive training environments to explicitly promote collaboration among multiple Large Language Models (LLMs), thereby constructing stronger multi-agent systems (MAS). However, during inference, they require re-executing the MAS to obtain final solutions, which diverges from human cognition that individuals can enhance their reasoning capabilities through interactions with others and resolve questions independently in the future. To investigate whether multi-agent interaction can enhance LLMs' independent problem-solving ability, we introduce ILR, a novel co-learning framework for MAS that integrates two key components: Dynamic Interaction and Perception Calibration. Specifically, Dynamic Interaction first adaptively selects either cooperative or competitive strategies depending on question difficulty and model ability. LLMs then exchange information through Idea3 (Idea Sharing, Idea Analysis, and Idea Fusion), an innovative interaction paradigm designed to mimic human discussion, before deriving their respective final answers. In Perception Calibration, ILR employs Group Relative Policy Optimization (GRPO) to train LLMs while integrating one LLM's reward distribution characteristics into another's reward function, thereby enhancing the cohesion of multi-agent interactions. We validate ILR on three LLMs across two model families of varying scales, evaluating performance on five mathematical benchmarks and one coding benchmark. Experimental results show that ILR consistently outperforms single-agent learning, yielding an improvement of up to 5% over the strongest baseline. We further discover that Idea3 can enhance the robustness of stronger LLMs during multi-agent inference, and dynamic interaction types can boost multi-agent learning compared to pure cooperative or competitive strategies.
We introduce Meta Prompting (MP), a framework that elevates the reasoning capabilities of large language models (LLMs) by focusing on the formal structure of a task rather than content-specific examples. We establish a theoretical foundation for this paradigm, formalizing MP as a functor that maps a category of tasks to a category of structured prompts, thereby guaranteeing that compositional problem-solving strategies can be systematically decomposed into modular prompt structures. We extend this concept to Recursive Meta Prompting (RMP), an automated process where an LLM can generate and refine its own prompts. We model this self-improvement loop formally as a monad, providing a principled framework for automated prompt engineering. Our claims are validated through extensive experiments demonstrating that a Qwen-72B base model, guided by a single, example-agnostic meta-prompt, achieves state-of-the-art results on MATH, GSM8K, and Game of 24. These results are achieved with substantial token efficiency gains over traditional few-shot methods. Project Page: https://github.com/meta-prompting/meta-prompting.
LLM-as-a-Judge has been widely applied to evaluate and compare different LLM alignmnet approaches (e.g., RLHF and DPO). However, concerns regarding its reliability have emerged, due to LLM judges' biases and inconsistent decision-making. Previous research has developed evaluation frameworks to assess reliability of LLM judges and their alignment with human preferences. However, the employed evaluation metrics often lack adequate explainability and fail to address LLM internal inconsistency. Additionally, existing studies inadequately explore the impact of various prompt templates when applying LLM-as-a-Judge methods, leading to potentially inconsistent comparisons between different alignment algorithms. In this work, we systematically evaluate LLM-as-a-Judge on alignment tasks by defining more theoretically interpretable evaluation metrics and explicitly mitigating LLM internal inconsistency from reliability metrics. We develop an open-source framework to evaluate, compare, and visualize the reliability and alignment of LLM judges, which facilitates practitioners to choose LLM judges for alignment tasks. In the experiments, we examine effects of diverse prompt templates on LLM-judge reliability and also demonstrate our developed framework by comparing various LLM judges on two common alignment datasets (i.e., TL;DR Summarization and HH-RLHF-Helpfulness). Our results indicate a significant impact of prompt templates on LLM judge performance, as well as a mediocre alignment level between the tested LLM judges and human evaluators.
Generative AI tools can provide people with the ability to create virtual environments and scenes with natural language prompts. Yet, how people will formulate such prompts is unclear -- particularly when they inhabit the environment that they are designing. For instance, it is likely that a person might say, "Put a chair here", while pointing at a location. If such linguistic features are common to people's prompts, we need to tune models to accommodate them. In this work, we present a wizard-of-oz elicitation study with 22 participants, where we studied people's implicit expectations when verbally prompting such programming agents to create interactive VR scenes. Our findings show that people prompt with several implicit expectations: (1) that agents have an embodied knowledge of the environment; (2) that agents understand embodied prompts by users; (3) that the agents can recall previous states of the scene and the conversation, and that (4) agents have a commonsense understanding of objects in the scene. Further, we found that participants prompt differently when they are prompting in situ (i.e. within the VR environment) versus ex situ (i.e. viewing the VR environment from the outside). To explore how our could be applied, we designed and built Oastaad, a conversational programming agent that allows non-programmers to design interactive VR experiences that they inhabit. Based on these explorations, we outline new opportunities and challenges for conversational programming agents that create VR environments.
Compared with Large Language Models (LLMs), Large Vision-Language Models (LVLMs) can also accept images as input, thus showcasing more interesting emergent capabilities and demonstrating impressive performance on various vision-language tasks. Motivated by text prompting in LLMs, visual prompting has been explored to enhance LVLMs' capabilities of perceiving visual information. However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models' ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image, which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. Specifically, we generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel values of the original image to obtain the actual input image for the LVLM. Extensive experiments on various vison-language benchmarks verify the effectiveness of our technique. For example, Attention Prompting on Image improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks, respectively.
Modern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in-depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis. Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.
This study investigates whether diagnostic prompting can improve Multimodal Large Language Model (MLLM) reliability for visual complexity assessment of Amazon Search Results Pages (SRP). We compare diagnostic prompting with standard gestalt principles-based prompting using 200 Amazon SRP pages and human expert annotations. Diagnostic prompting showed notable improvements in predicting human complexity judgments, with F1-score increasing from 0.031 to 0.297 (+858\% relative improvement), though absolute performance remains modest (Cohen's $κ$ = 0.071). The decision tree revealed that models prioritize visual design elements (badge clutter: 38.6\% importance) while humans emphasize content similarity, suggesting partial alignment in reasoning patterns. Failure case analysis reveals persistent challenges in MLLM visual perception, particularly for product similarity and color intensity assessment. Our findings indicate that diagnostic prompting represents a promising initial step toward human-aligned MLLM-based evaluation, though failure cases with consistent human-MLLM disagreement require continued research and refinement in prompting approaches with larger ground truth datasets for reliable practical deployment.
Query Reformulation (QR) is a set of techniques used to transform a user's original search query to a text that better aligns with the user's intent and improves their search experience. Recently, zero-shot QR has been a promising approach due to its ability to exploit knowledge inherent in large language models. Inspired by the success of ensemble prompting strategies which have benefited other tasks, we investigate if they can improve query reformulation. In this context, we propose two ensemble-based prompting techniques, GenQREnsemble and GenQRFusion which leverage paraphrases of a zero-shot instruction to generate multiple sets of keywords to improve retrieval performance ultimately. We further introduce their post-retrieval variants to incorporate relevance feedback from a variety of sources, including an oracle simulating a human user and a "critic" LLM. We demonstrate that an ensemble of query reformulations can improve retrieval effectiveness by up to 18% on nDCG@10 in pre-retrieval settings and 9% on post-retrieval settings on multiple benchmarks, outperforming all previously reported SOTA results. We perform subsequent analyses to investigate the effects of feedback documents, incorporate domain-specific instructions, filter reformulations, and generate fluent reformulations that might be more beneficial to human searchers. Together, the techniques and the results presented in this paper establish a new state of the art in automated query reformulation for retrieval and suggest promising directions for future research.
Our interest is in the design of software systems involving a human-expert interacting -- using natural language -- with a large language model (LLM) on data analysis tasks. For complex problems, it is possible that LLMs can harness human expertise and creativity to find solutions that were otherwise elusive. On one level, this interaction takes place through multiple turns of prompts from the human and responses from the LLM. Here we investigate a more structured approach based on an abstract protocol described in [3] for interaction between agents. The protocol is motivated by a notion of "two-way intelligibility" and is modelled by a pair of communicating finite-state machines. We provide an implementation of the protocol, and provide empirical evidence of using the implementation to mediate interactions between an LLM and a human-agent in two areas of scientific interest (radiology and drug design). We conduct controlled experiments with a human proxy (a database), and uncontrolled experiments with human subjects. The results provide evidence in support of the protocol's capability of capturing one- and two-way intelligibility in human-LLM interaction; and for the utility of two-way intelligibility in the design of human-machine systems. Our code is available at https://github.com/karannb/interact.
Although Large Language Models (LLMs) have demonstrated extraordinary capabilities in many domains, they still have a tendency to hallucinate and generate fictitious responses to user requests. This problem can be alleviated by augmenting LLMs with information retrieval (IR) systems (also known as retrieval-augmented LLMs). Applying this strategy, LLMs can generate more factual texts in response to user input according to the relevant content retrieved by IR systems from external corpora as references. In addition, by incorporating external knowledge, retrieval-augmented LLMs can answer in-domain questions that cannot be answered by solely relying on the world knowledge stored in parameters. To support research in this area and facilitate the development of retrieval-augmented LLM systems, we develop RETA-LLM, a {RET}reival-{A}ugmented LLM toolkit. In RETA-LLM, we create a complete pipeline to help researchers and users build their customized in-domain LLM-based systems. Compared with previous retrieval-augmented LLM systems, RETA-LLM provides more plug-and-play modules to support better interaction between IR systems and LLMs, including {request rewriting, document retrieval, passage extraction, answer generation, and fact checking} modules. Our toolkit is publicly available at https://github.com/RUC-GSAI/YuLan-IR/tree/main/RETA-LLM.
Ambiguity in natural language instructions poses significant risks in safety-critical human-robot interaction, particularly in domains such as surgery. To address this, we propose a framework that uses Large Language Models (LLMs) for ambiguity detection specifically designed for collaborative surgical scenarios. Our method employs an ensemble of LLM evaluators, each configured with distinct prompting techniques to identify linguistic, contextual, procedural, and critical ambiguities. A chain-of-thought evaluator is included to systematically analyze instruction structure for potential issues. Individual evaluator assessments are synthesized through conformal prediction, which yields non-conformity scores based on comparison to a labeled calibration dataset. Evaluating Llama 3.2 11B and Gemma 3 12B, we observed classification accuracy exceeding 60% in differentiating ambiguous from unambiguous surgical instructions. Our approach improves the safety and reliability of human-robot collaboration in surgery by offering a mechanism to identify potentially ambiguous instructions before robot action.
Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, the understanding of their capability to process structured data like tables remains an under-explored area. While tables can be serialized as input for LLMs, there is a lack of comprehensive studies on whether LLMs genuinely comprehend this data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs through seven distinct tasks, e.g., cell lookup, row retrieval and size detection. Specially, we perform a series of evaluations on the recent most advanced LLM models, GPT-3.5 and GPT-4 and observe that performance varied with different input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose $\textit{self-augmentation}$ for effective structural prompting, such as critical value / range identification using internal knowledge of LLMs. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact($\uparrow2.31\%$), HybridQA($\uparrow2.13\%$), SQA($\uparrow2.72\%$), Feverous($\uparrow0.84\%$), and ToTTo($\uparrow5.68\%$). We believe that our open source benchmark and proposed prompting methods can serve as a simple yet generic selection for future research. The code and data of this paper will be temporality released at https://anonymous.4open.science/r/StructuredLLM-76F3/README.md and will be replaced with an official one at https://github.com/microsoft/TableProvider later.
Large Language Models (LLMs) have transformed human-computer interaction by enabling natural language-based communication with AI-powered chatbots. These models are designed to be intuitive and user-friendly, allowing users to articulate requests with minimal effort. However, despite their accessibility, studies reveal that users often struggle with effective prompting, resulting in inefficient responses. Existing research has highlighted both the limitations of LLMs in interpreting vague or poorly structured prompts and the difficulties users face in crafting precise queries. This study investigates learner-AI interactions through an educational experiment in which participants receive structured guidance on effective prompting. We introduce and compare three types of prompting guidelines: a task-specific framework developed through a structured methodology and two baseline approaches. To assess user behavior and prompting efficacy, we analyze a dataset of 642 interactions from 107 users. Using Von NeuMidas, an extended pragmatic annotation schema for LLM interaction analysis, we categorize common prompting errors and identify recurring behavioral patterns. We then evaluate the impact of different guidelines by examining changes in user behavior, adherence to prompting strategies, and the overall quality of AI-generated responses. Our findings provide a deeper understanding of how users engage with LLMs and the role of structured prompting guidance in enhancing AI-assisted communication. By comparing different instructional frameworks, we offer insights into more effective approaches for improving user competency in AI interactions, with implications for AI literacy, chatbot usability, and the design of more responsive AI systems.
We introduce FathomGPT, an open source system for the interactive investigation of ocean science data via a natural language interface. FathomGPT was developed in close collaboration with marine scientists to enable researchers to explore and analyze the FathomNet image database. FathomGPT provides a custom information retrieval pipeline that leverages OpenAI's large language models to enable: the creation of complex queries to retrieve images, taxonomic information, and scientific measurements; mapping common names and morphological features to scientific names; generating interactive charts on demand; and searching by image or specified patterns within an image. In designing FathomGPT, particular emphasis was placed on enhancing the user's experience by facilitating free-form exploration and optimizing response times. We present an architectural overview and implementation details of FathomGPT, along with a series of ablation studies that demonstrate the effectiveness of our approach to name resolution, fine tuning, and prompt modification. We also present usage scenarios of interactive data exploration sessions and document feedback from ocean scientists and machine learning experts.
Despite advances in the multilingual capabilities of Large Language Models (LLMs), their performance varies substantially across different languages and tasks. In multilingual retrieval-augmented generation (RAG)-based systems, knowledge bases (KB) are often shared from high-resource languages (such as English) to low-resource ones, resulting in retrieved information from the KB being in a different language than the rest of the context. In such scenarios, two common practices are pre-translation to create a mono-lingual prompt and cross-lingual prompting for direct inference. However, the impact of these choices remains unclear. In this paper, we systematically evaluate the impact of different prompt translation strategies for classification tasks with RAG-enhanced LLMs in multilingual systems. Experimental results show that an optimized prompting strategy can significantly improve knowledge sharing across languages, therefore improve the performance on the downstream classification task. The findings advocate for a broader utilization of multilingual resource sharing and cross-lingual prompt optimization for non-English languages, especially the low-resource ones.
A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.
Prompting has shown impressive success in enabling large pretrained language models (LMs) to perform diverse NLP tasks, especially when only few downstream data are available. Automatically finding the optimal prompt for each task, however, is challenging. Most existing work resorts to tuning soft prompt (e.g., embeddings) which falls short of interpretability, reusability across LMs, and applicability when gradients are not accessible. Discrete prompt, on the other hand, is difficult to optimize, and is often created by "enumeration (e.g., paraphrasing)-then-selection" heuristics that do not explore the prompt space systematically. This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL). RLPrompt formulates a parameter-efficient policy network that generates the desired discrete prompt after training with reward. To overcome the complexity and stochasticity of reward signals by the large LM environment, we incorporate effective reward stabilization that substantially enhances the training efficiency. RLPrompt is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods. Interestingly, the resulting optimized prompts are often ungrammatical gibberish text; and surprisingly, those gibberish prompts are transferrable between different LMs to retain significant performance, indicating LM prompting may not follow human language patterns.
Prompt compression is an innovative method for efficiently condensing input prompts while preserving essential information. To facilitate quick-start services, user-friendly interfaces, and compatibility with common datasets and metrics, we present the Prompt Compression Toolkit (PCToolkit). This toolkit is a unified plug-and-play solution for compressing prompts in Large Language Models (LLMs), featuring cutting-edge prompt compressors, diverse datasets, and metrics for comprehensive performance evaluation. PCToolkit boasts a modular design, allowing for easy integration of new datasets and metrics through portable and user-friendly interfaces. In this paper, we outline the key components and functionalities of PCToolkit. We conducted evaluations of the compressors within PCToolkit across various natural language tasks, including reconstruction, summarization, mathematical problem-solving, question answering, few-shot learning, synthetic tasks, code completion, boolean expressions, multiple choice questions, and lies recognition.
We define "visual story-writing" as using visual representations of story elements to support writing and revising narrative texts. To demonstrate this approach, we developed a text editor that automatically visualizes a graph of entity interactions, movement between locations, and a timeline of story events. Interacting with these visualizations results in suggested text edits: for example, connecting two characters in the graph creates an interaction between them, moving an entity updates their described location, and rearranging events on the timeline reorganizes the narrative sequence. Through two user studies on narrative text editing and writing, we found that visuals supported participants in planning high-level revisions, tracking story elements, and exploring story variations in ways that encourage creativity. Broadly, our work lays the foundation for writing support, not just through words, but also visuals.
Visual explanation (attention)-guided learning uses not only labels but also explanations to guide model reasoning process. While visual attention-guided learning has shown promising results, it requires a large number of explanation annotations that are time-consuming to prepare. However, in many real-world situations, it is usually desired to prompt the model with visual attention without model retraining. For example, when doing AI-assisted cancer classification on a medical image, users (e.g., clinicians) can provide the AI model with visual attention prompt on which areas are indispensable and which are precluded. Despite its promising objectives, achieving visual attention-prompted prediction presents several major challenges: 1) How can the visual prompt be effectively integrated into the model's reasoning process? 2) How should the model handle samples that lack visual prompts? 3) What is the impact on the model's performance when a visual prompt is imperfect? This paper introduces a novel framework for attention-prompted prediction and learning, utilizing visual prompts to steer the model's reasoning process. To improve performance in non-prompted situations and align it with prompted scenarios, we propose a co-training approach for both non-prompted and prompted models, ensuring they share similar parameters and activations. Additionally, for instances where the visual prompt does not encompass the entire input image, we have developed innovative attention prompt refinement methods. These methods interpolate the incomplete prompts while maintaining alignment with the model's explanations. Extensive experiments on four datasets demonstrate the effectiveness of our proposed framework in enhancing predictions for samples both with and without prompt.
Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.
Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.
Counterfactuals -- expressing what might have been true under different circumstances -- have been widely applied in statistics and machine learning to help understand causal relationships. More recently, counterfactuals have begun to emerge as a technique being applied within visualization research. However, it remains unclear to what extent counterfactuals can aid with visual data communication. In this paper, we primarily focus on assessing the quality of users' understanding of data when provided with counterfactual visualizations. We propose a preliminary model of causality comprehension by connecting theories from causal inference and visual data communication. Leveraging this model, we conducted an empirical study to explore how counterfactuals can improve users' understanding of data in static visualizations. Our results indicate that visualizing counterfactuals had a positive impact on participants' interpretations of causal relations within datasets. These results motivate a discussion of how to more effectively incorporate counterfactuals into data visualizations.
We develop NL2INTERFACE to explore the potential of generating usable interactive multi-visualization interfaces from natural language queries. With NL2INTERFACE, users can directly write natural language queries to automatically generate a fully interactive multi-visualization interface without any extra effort of learning a tool or programming language. Further, users can interact with the interfaces to easily transform the data and quickly see the results in the visualizations.
Large Language Models have demonstrated remarkable abilities across various tasks, with Chain-of-Thought (CoT) prompting emerging as a key technique to enhance reasoning capabilities. However, existing research primarily focuses on improving performance, lacking a comprehensive framework to explain and understand the fundamental factors behind CoT's success. To bridge this gap, we introduce a novel perspective grounded in the Hopfieldian view of cognition in cognitive neuroscience. We establish a connection between CoT reasoning and key cognitive elements such as stimuli, actions, neural populations, and representation spaces. From our view, we can understand the reasoning process as the movement between these representation spaces. Building on this insight, we develop a method for localizing reasoning errors in the response of CoTs. Moreover, we propose the Representation-of-Thought (RoT) framework, which leverages the robustness of low-dimensional representation spaces to enhance the robustness of the reasoning process in CoTs. Experimental results demonstrate that RoT improves the robustness and interpretability of CoT reasoning while offering fine-grained control over the reasoning process.
Chain-of-Thought (CoT) reasoning improves multi-step mathematical problem solving in large language models but remains vulnerable to exposure bias and error accumulation, as early mistakes propagate irreversibly through autoregressive decoding. In this work, we propose DiffCoT, a diffusion-styled CoT framework that reformulates CoT reasoning as an iterative denoising process. DiffCoT integrates diffusion principles at the reasoning-step level via a sliding-window mechanism, enabling unified generation and retrospective correction of intermediate steps while preserving token-level autoregression. To maintain causal consistency, we further introduce a causal diffusion noise schedule that respects the temporal structure of reasoning chains. Extensive experiments on three multi-step CoT reasoning benchmarks across diverse model backbones demonstrate that DiffCoT consistently outperforms existing CoT preference optimization methods, yielding improved robustness and error-correction capability in CoT reasoning.
This paper evaluates the visualization literacy of modern Large Language Models (LLMs) and introduces a novel prompting technique called Charts-of-Thought. We tested three state-of-the-art LLMs (Claude-3.7-sonnet, GPT-4.5 preview, and Gemini-2.0-pro) on the Visualization Literacy Assessment Test (VLAT) using standard prompts and our structured approach. The Charts-of-Thought method guides LLMs through a systematic data extraction, verification, and analysis process before answering visualization questions. Our results show Claude-3.7-sonnet achieved a score of 50.17 using this method, far exceeding the human baseline of 28.82. This approach improved performance across all models, with score increases of 21.8% for GPT-4.5, 9.4% for Gemini-2.0, and 13.5% for Claude-3.7 compared to standard prompting. The performance gains were consistent across original and modified VLAT charts, with Claude correctly answering 100% of questions for several chart types that previously challenged LLMs. Our study reveals that modern multimodal LLMs can surpass human performance on visualization literacy tasks when given the proper analytical framework. These findings establish a new benchmark for LLM visualization literacy and demonstrate the importance of structured prompting strategies for complex visual interpretation tasks. Beyond improving LLM visualization literacy, Charts-of-Thought could also enhance the accessibility of visualizations, potentially benefiting individuals with visual impairments or lower visualization literacy.
As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.
Misleading visualizations pose a significant challenge to accurate data interpretation. While recent research has explored the use of Large Language Models (LLMs) for detecting such misinformation, practical tools that also support explanation and correction remain limited. We present MisVisFix, an interactive dashboard that leverages both Claude and GPT models to support the full workflow of detecting, explaining, and correcting misleading visualizations. MisVisFix correctly identifies 96% of visualization issues and addresses all 74 known visualization misinformation types, classifying them as major, minor, or potential concerns. It provides detailed explanations, actionable suggestions, and automatically generates corrected charts. An interactive chat interface allows users to ask about specific chart elements or request modifications. The dashboard adapts to newly emerging misinformation strategies through targeted user interactions. User studies with visualization experts and developers of fact-checking tools show that MisVisFix accurately identifies issues and offers useful suggestions for improvement. By transforming LLM-based detection into an accessible, interactive platform, MisVisFix advances visualization literacy and supports more trustworthy data communication.
Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.
Recent advances in large language models elicit reasoning in a chain-of-thought that allows models to decompose problems in a human-like fashion. Though this paradigm improves multi-step reasoning ability in language models, it is limited by being unimodal and applied mainly to question-answering tasks. We claim that incorporating visual augmentation into reasoning is essential, especially for complex, imaginative tasks. Consequently, we introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to recursively bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks that can benefit from temporal reasoning, as well as provide interpretability into models' multi-step reasoning. We apply VCoT to the Visual Storytelling and WikiHow summarization datasets and demonstrate through human evaluation that VCoT offers novel and consistent synthetic data augmentation beating chain-of-thought baselines, which can be used to enhance downstream performance.
Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness. Beyond benchmarking, we study its synergy with retrieval-augmented generation, revealing how domain knowledge and visual grounding interact during autoregressive reasoning. Finally, we propose a new mechanism that strengthens the alignment between visual evidence and reasoning, improving both reliability and efficiency. S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs.
Chain-of-Thought (CoT) prompting has proven remarkably effective for eliciting complex reasoning in large language models (LLMs). Yet, its potential in multimodal large language models (MLLMs) remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. Existing visual-CoT resources are typically small, domain-specific, or lack the human-like stepwise structure necessary for compositional visual reasoning. In this paper, we introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, human-like rationales that guide MLLMs through interpretable visual reasoning steps. Building upon this, we curate VisReason-Pro, a 165K subset produced with a stronger expert-level GPT annotator, enriched with detailed reasoning traces and 3D spatial grounding via depth-informed annotations. Fine-tuning the state-of-the-art Qwen2.5-VL model on VisReason and VisReason-Pro yields substantial improvements in step-by-step visual reasoning accuracy, interpretability, and cross-benchmark generalization. These results demonstrate that VisReason equips MLLMs with more systematic and generalizable reasoning capabilities. We envision VisReason as a cornerstone for cultivating human-like visual reasoning, paving the way toward the next generation of multimodal intelligence.
While chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables VLMs to reason using visual crops corresponding to these relevant entities. Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting. Furthermore, we demonstrate that our method effectively scales to the larger A-OKVQA reasoning dataset by leveraging automatically generated pseudo-labels, outperforming CoT prompting.
Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.
Multi-modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain-of-Thought methods suffer from two critical limitations in cross-modal scenarios: (1) over-reliance on single coarse-grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross-modal Thought) framework, built upon two key innovations: a) Dynamic Multi-Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation-Aware Reasoning to enable multi-region collaboration by iteratively aligning visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT-70K dataset, comprising 74,691 high-quality samples with multi-region annotations and structured reasoning chains. Extensive experiments demonstrate that CoCoT significantly enhances complex visual reasoning, achieving an average accuracy improvement of 15.4% on LLaVA-1.5 and 4.0% on Qwen2-VL across six challenging benchmarks. The data and code are available at: https://github.com/deer-echo/CoCoT.
Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT
Automated chart summarization is crucial for enhancing data accessibility and enabling efficient information extraction from visual data. While recent advances in visual-language models (VLMs) have demonstrated promise, existing methods often suffer from limitations in matching the generated summary to the chart data and in reasoning about complex chart patterns. This paper introduces End-to-End Visual Chain-of-Thought (V-CoT) for chart summarization, a novel approach optimized for Large Vision-Language Models (LVLMs). Our method directly trains an LVLM to process chart images and generate textual summaries in an end-to-end fashion, eliminating the need for explicit chart parsing modules. We incorporate a visual Chain-of-Thought mechanism through instruction fine-tuning, implicitly guiding the LVLM to perform visual reasoning steps during summary generation. Evaluated on the large-scale Chart-Sum-QA dataset, our V-CoT method significantly outperforms state-of-the-art baselines across a range of automatic metrics, including BLEU, BLEURT, CIDEr, and CS, and demonstrates superior matching degree and reasoning correctness in human evaluations. Ablation studies and detailed analyses further validate the effectiveness and robustness of our proposed approach, establishing a new benchmark for end-to-end chart summarization.
Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model's policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
本报告综合了从底层提示词工程方法论到高层可视化交互系统的全方位研究。研究脉络呈现出明显的演进趋势:首先,提示词工程正从“炼金术”走向体系化的软件工程管理;其次,推理机制通过思维链(CoT)及其可视化手段实现了从黑盒到透明的跨越;第三,多模态技术的介入使得“视觉思维”成为增强模型逻辑能力的关键;第四,交互式可视化界面与自然语言的融合,极大地降低了垂直领域(如医疗、工业、创意设计)应用AI的门槛;最后,研究视角回归“以人为本”,深入探讨了交互安全性、认知负荷及人机协作的伦理边界。整体而言,该领域正致力于构建一个逻辑透明、交互直观且行业深耕的智能交互生态。