AI驱动的界面即时重构
UI感知建模与多模态底层理解
该组文献聚焦于构建能够理解、表征和识别用户界面的底层模型。通过多模态大模型(MLLM)、视觉语言模型(VLM)和自监督学习,实现对UI元素的精确检测、语义分组及导航规划,为即时重构提供结构化输入。
- UI-UG: A Unified MLLM for UI Understanding and Generation(Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, Hai Rao, 2025, ArXiv Preprint)
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs(Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan, 2024, ArXiv Preprint)
- Lexi: Self-Supervised Learning of the UI Language(Pratyay Banerjee, Shweti Mahajan, Kushal Arora, Chitta Baral, Oriana Riva, 2023, ArXiv Preprint)
- UI Semantic Group Detection: Grouping UI Elements with Similar Semantics in Mobile Graphical User Interface(Shuhong Xiao, Yunnong Chen, Yaxuan Song, Liuqing Chen, Lingyun Sun, Yankun Zhen, Yanfang Chang, 2024, ArXiv Preprint)
- ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations(Yue Jiang, Eldon Schoop, Amanda Swearngin, Jeffrey Nichols, 2023, ArXiv Preprint)
- UI-Venus Technical Report: Building High-performance UI Agents with RFT(Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, Weiqiang Wang, 2025, ArXiv Preprint)
- MUD: Towards a Large-Scale and Noise-Filtered UI Dataset for Modern Style UI Modeling(Sidong Feng, Suyu Ma, Han Wang, David Kong, Chunyang Chen, 2024, ArXiv Preprint)
- Integrating Optical Characteristic Recognition with Conversational AI: A Multimodal Chatbot Featuring Speech and Poster Generation(C. Sai, Raguraman Purushothaman, Chinthaparthy Reddy Dhanush Reddy, Peddi Reddy Gangothri, V. Dhanush, Chintamanipeta Bhavana, 2025, 2025 Third International Conference on Augmented Intelligence and Sustainable Systems (ICAISS))
生成式布局优化与动态合成技术
此类研究探讨利用VAE、GAN、扩散模型及强化学习实现界面的自动生成与实时调整。重点在于界面的可塑性(Malleability)、代码自动合成(如React组件)以及根据用户演示或指令进行的即时界面重构。
- Dynamic User Interface Generation for Enhanced Human-Computer Interaction Using Variational Autoencoders(Runsheng Zhang, Shixiao Wang, Tianfang Xie, Shiyu Duan, Mengmeng Chen, 2024, ArXiv Preprint)
- Adaptive User Interface Generation Through Reinforcement Learning: A Data-Driven Approach to Personalization and Optimization(Qi Sun, Yayun Xue, Zhijun Song, 2024, ArXiv Preprint)
- UI Layout Generation with LLMs Guided by UI Grammar(Yuwen Lu, Ziang Tong, Qinyi Zhao, Chengzhi Zhang, Toby Jia-Jun Li, 2023, ArXiv Preprint)
- Generative User Interface for the Mobile Apps: Image Synthesis with VAEs, GANs, and Stable Diffusion(Konstantin V. Kostinich, Dmitry Vidmanov, 2025, 2025 7th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE))
- Innovative Application of Generative Adversarial Networks in User Interface Design(Kaixi Huang, Tao Wang, 2025, Proceedings of the 2025 International Conference on Artificial Intelligence, Virtual Reality and Interaction Design)
- ReDemon UI: Reactive Synthesis by Demonstration for Web UI(Jay Lee, Gyuhyeok Oh, Joongwon Ahn, Xiaokang Qiu, 2025, ArXiv Preprint)
- Generative and Malleable User Interfaces with Generative and Evolving Task-Driven Data Model(Yining Cao, Peiling Jiang, Haijun Xia, 2025, Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems)
- Gradual Generation of User Interfaces as a Design Method for Malleable Software(Bryan Min, Peiling Jiang, Zhicheng Huang, Haijun Xia, 2026, ArXiv Preprint)
- A Reference Architecture Based on Reflection for Self-Adaptive Software: A Second Release(F. J. Affonso, Gabriel Nagassaki Campos, Guilherme Guiguer Menaldo, 2024, IEEE Access)
- Automatic Generation of Conversational Interfaces for Tabular Data Analysis(Marcos Gomez-Vazquez, Jordi Cabot, Robert Claris'o, 2023, Proceedings of the 6th ACM Conference on Conversational User Interfaces)
多模态对话式交互与增强范式
该组文献研究如何将传统的静态图形界面(GUI)转变为由自然语言驱动的对话式界面(CUI)。通过集成LLM、RAG技术及多模态输入(语音、图像、眼动),提升交互的自然度与任务处理效率。
- A Large Language Model Enhanced Conversational Recommender System(Yue Feng, Shuchang Liu, Zhenghai Xue, Qingpeng Cai, Lantao Hu, Peng Jiang, Kun Gai, Fei Sun, 2023, ArXiv)
- FAVOR-GPT: a generative natural language interface to whole genome variant functional annotations(T. Li, Hufeng Zhou, Vineet Verma, Xiangru Tang, Yanjun Shao, Eric Van Buren, Zhiping Weng, Mark Gerstein, B. Neale, S. Sunyaev, Xihong Lin, 2024, Bioinformatics Advances)
- Athena: A Conversational Book Discovery Interface Combining LLM-Powered Retrieval-Augmented Generation and Interactive Graph Visualization(Matt Murtagh White, Yunkai Xu, Nicole León, Frank E. Ritter, 2025, Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology)
- A Conceptual Framework for Conversational Search and Recommendation: Conceptualizing Agent-Human Interactions During the Conversational Search Process(Leif Azzopardi, Mateusz Dubiel, Martin Halvey, Jeffery Dalton, 2024, ArXiv Preprint)
- MuDoC: An Interactive Multimodal Document-grounded Conversational AI System(Karan Taneja, Ashok K. Goel, 2025, ArXiv)
- TalkPhoto: A Versatile Training-Free Conversational Assistant for Intelligent Image Editing(Yujie Hu, Zecheng Tang, Xu Jiang, Weiqi Li, Jian Zhang, 2026, ArXiv Preprint)
- Feedstack: Layering Structured Representations over Unstructured Feedback to Scaffold Human AI Conversation(Hannah Vy Nguyen, Yu-Chun Grace Yen, Omar Shakir, Hang Huynh, Sebastian Gutierrez, June A. Smith, Sheila Jimenez, Salma Abdelgelil, Stephen MacNeil, 2025, ArXiv Preprint)
- Integrating Conversational AI, Image Generation, and Code Generation: A Unified Platform(Joylin Priya Pinto, M. Aqib, Osama Shakeel, Emad Habibi, G. S, 2025, Proceedings of the 3rd International Conference on Futuristic Technology)
- A Developed Graphical User Interface-Based on Different Generative Pre-trained Transformers Models(Ekrem Küçük, İpek Balıkçı Çiçek, Zeynep Küçükakçalı, Cihan Yetiş, Cemil Çolak, 2024, ODÜ Tıp Dergisi)
- SLM: Bridge the Thin Gap Between Speech and Text Foundation Models(Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, H. Soltau, P. Rubenstein, Lukás Zilka, Dian Yu, Zhong Meng, G. Pundak, Nikhil Siddhartha, J. Schalkwyk, Yonghui Wu, 2023, 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU))
智能体驱动的自动化与人机协作开发
关注AI智能体(Agents)在界面操作与软件工程中的角色。包括能够自主执行UI任务的智能体、作为界面评审者的AI,以及在协同开发过程中AI如何改变开发者与界面的交互模式。
- Computer-Use Agents as Judges for Generative User Interface(Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou, 2025, ArXiv)
- LLM-Powered AI Agent Systems and Their Applications in Industry(Guannan Liang, Qianqian Tong, 2025, 2025 IEEE World AI IoT Congress (AIIoT))
- Morae: Proactively Pausing UI Agents for User Choices(Yi-Hao Peng, Dingzeyu Li, Jeffrey P. Bigham, Amy Pavel, 2025, ArXiv Preprint)
- How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in Software Engineering(Christoph Treude, M. Gerosa, 2025, 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge))
- Generative AI and Empirical Software Engineering: A Paradigm Shift(Christoph Treude, M. Storey, 2025, 2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware))
- Memolet: Reifying the Reuse of User-AI Conversational Memories(Ryan Yen, Jian Zhao, 2024, Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology)
- Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API(Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu, 2023, ArXiv Preprint)
混合现实(MR)与空间情境自适应重构
专门针对3D、AR/VR及混合现实环境,研究如何根据物理环境、社会线索及用户偏好,实现UI在三维空间中的即时布局优化与认知减负。
- Preference-Guided Multi-Objective UI Adaptation(Yao Song, Christoph Gebhardt, Yi-Chi Liao, Christian Holz, 2025, ArXiv Preprint)
- SituationAdapt: Contextual UI Optimization in Mixed Reality with Situation Awareness via LLM Reasoning(Zhipeng Li, Christoph Gebhardt, Yves Inglin, Nicolas Steck, Paul Streli, Christian Holz, 2024, Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology)
- Mixed Reality UI Adaptations with Inaccurate and Incomplete Objectives(Christoph Albert Johns, João Marcelo Evangelista Belo, 2023, ArXiv Preprint)
- Cognitive-Unburdening Surveillance: Real-Time 3D Reconstruction for Distributed Spatial Awareness(Dong Yoon Kim, Rocky Kim, Jinwoo Park, Jihoon Park, Beomgeun Seo, 2025, Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology)
- Bridging Industrial Expertise and XR with LLM-Powered Conversational Agents(Despina Tomkou, George Fatouros, Andreas Andreou, Georgios Makridis, F. Liarokapis, Dimitrios Dardanis, Athanasios Kiourtis, John Soldatos, D. Kyriazis, 2025, 2025 21st International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT))
- A Wearable Real-Time 2D/3D Eye-Gaze Interface to Realize Robot Assistance for Quadriplegics(Yongpeng Cao, Shouren Huang, S. Sørensen, Y. Yamakawa, Masatoshi Ishikawa, 2025, IEEE Access)
- AR-Classroom: Integrating Conversational Artificial Intelligence with Augmented Reality Technology for Learning Spatial Transformations and Their Matrix Representation(Uttamasha Monjoree, Samantha D. Aguilar, Chengyuan Qian, Carl Van Huyck, Shu-Hao Yeh, Preston Tranbarger, Luke Duane-Tessier, Leo Solitare-Renaldo, Heather Burte, Philip Yasskin, Jeffrey Liew, Dezhen Song, Francis K. H. Quek, Wei Yan, 2024, 2024 IEEE Frontiers in Education Conference (FIE))
用户认知理论、信任度与伦理安全
从HCI视角出发,分析AI驱动界面对用户心理模型、认知流状态的影响,探讨生成式UI中的暗黑模式、偏见消除、信任建立及负责任的AI设计蓝图。
- User Preferences on a Generative AI User Interface Through a Choice Experiment(Jesun Yeon, Youchan Jung, Yongki Baek, Daeho Lee, Jungwoo Shin, W. Chung, 2024, International Journal of Human–Computer Interaction)
- Understanding Mental Models of Generative Conversational Search and The Effect of Interface Transparency(Chadha Degachi, Samuel Kernan Freire, E. Niforatos, Gerd Kortuem, 2025, ArXiv)
- Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support(Dinithi Dissanayake, Suranga Nanayakkara, 2025, ArXiv Preprint)
- Distributed Cognition for AI-supported Remote Operations: Challenges and Research Directions(Rune Møberg Jacobsen, Joel Wester, Helena Bøjer Djernæs, Niels van Berkel, 2025, ArXiv Preprint)
- Emergent Dark Patterns in AI-Generated User Interfaces(Daksh Pandey, 2026, ArXiv Preprint)
- DeBiasMe: De-biasing Human-AI Interactions with Metacognitive AIED (AI in Education) Interventions(Chaeyeon Lim, 2025, ArXiv Preprint)
- Towards responsible AI: an implementable blueprint for integrating explainability and social-cognitive frameworks in AI systems(Rittika Shamsuddin, H. B. Tabrizi, Pavan R. Gottimukkula, 2025, AI Perspectives & Advances)
- Promoting Real-Time Reflection in Synchronous Communication with Generative AI(Yi Wen, Meng Xia, 2025, ArXiv Preprint)
- UI Remix: Supporting UI Design Through Interactive Example Retrieval and Remixing(Junling Wang, Hongyi Lan, Xiaotian Su, Mustafa Doga Dogan, April Yi Wang, 2026, ArXiv Preprint)
- Comparing Interface Structures of Generative AI Tools : Focusing on Designer’s Creative Experience and Interaction Flow(Hyeontaek Hwang, Boram Lee, 2025, Institute of Art and Design Research)
垂直领域驱动的个性化应用实践
展示AI界面重构在医疗、金融、工业、自动驾驶及教育等特定领域的应用。强调根据领域知识、用户情绪及实时数据动态调整界面,以支持复杂决策。
- Visual-Conversational Interface for Evidence-Based Explanation of Diabetes Risk Prediction(Reza Samimi, Aditya Bhattacharya, Lucija Gosak, Gregor Stiglic, Katrien Verbert, 2025, ArXiv Preprint)
- Intellifinance - An AI Powered Assistant for Bank Statement Parsing and Conversational Financial Inquiry(G. Pugalendhi, D. R, S. K, Srinaath S S, 2025, 2025 IEEE First International Conference on Innovations in Engineering and Next-Generation Technologies for Sustainability (ICINVENTS))
- Face2Feel: Emotion-Aware Adaptive User Interface(Ismail Alihan Hadimlioglu, Siddharth Linga, 2025, ArXiv Preprint)
- A Spatially-Grounded Conversational Planner for Personalized Urban Itineraries(Chiara Pugliese, Maddalena Amendola, Raffale Perego, Chiara Renso, 2025, Proceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems)
- Real Time Inventory Management System powered by Generative User Interface(Omkar Patil, 2024, INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT)
- Supporting Data-Frame Dynamics in AI-assisted Decision Making(Chengbo Zheng, Tim Miller, Alina Bialkowski, H Peter Soyer, Monika Janda, 2025, ArXiv Preprint)
- Exploring utilization of generative AI for research and education in data-driven materials science(Takahiro Misawa, Ai Koizumi, Ryo Tamura, Kazuyoshi Yoshimi, 2025, ArXiv Preprint)
- Demonstration of a Continuously Updated, Radio-Compatible Digital Twin for Robotic Integrated Sensing and Communications(Vlad-Costin Andrei, Praneeth Susarla, Aladin Djuhera, Niklas Vaara, Janne Mustaniemi, Constantino Álvarez Casado, Xinyan Li, U. Mönich, Holger Boche, Miguel Bordallo López, 2025, 2025 IEEE 5th International Symposium on Joint Communications & Sensing (JC&S))
- Generative AI Interface Design Considerations for Private Equity(Shirley Anderson, Yuanfei Zhao, 2025, Companion Proceedings of the 30th International Conference on Intelligent User Interfaces)
最终分组结果构建了一个从底层技术到高层伦理的完整研究图谱。研究涵盖了以MLLM为核心的UI感知建模,利用生成式算法实现的布局即时合成,以及向对话式、多模态交互范式的演进。同时,报告深入探讨了智能体在自动化重构中的作用,以及在混合现实等复杂空间环境下的自适应优化。最后,通过对认知理论、伦理安全及垂直领域实践的整合,强调了AI驱动的界面重构正朝着“上下文感知、以人为本、领域增强”的智能化生态系统方向发展。
总计84篇相关文献
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
Generative User Interface for the Mobile Apps: Image Synthesis with VAEs, GANs, and Stable Diffusion
Introduction of the neural network models in creating intelligent mobile user interfaces poses a new challenge for researchers and the application developers. Each year, using the generative models in automatic image generation becomes more and more important. Developing interfaces based on the element images and the entire screen is considered an innovation solution in design. The paper presents results of testing the Variational Autoencoder (VAE), Generative Adversarial Network (GAN) and Stable Diffusion models aimed at studying these approaches potential in creating the user interfaces. To compare the approaches, it analyzes the obtained images as the user interfaces based on the usability metrics. The analysis makes it possible to formulate recommendations for selecting suitable models for various applications and highlights the areas for further research.
This research explores the developing a real time inventory management system powered by a generative user interface. We are leveraging large language models like GPT4, Claude 3, and Google Gemini that support tool calling or function calling, and integrating it with the modern frontend frameworks like Next js that support streaming React Server Component (RSC), the proposed system enables interaction with the inventory through natural language prompts. We are using PostgreSQL as a choice of database and server actions are used to interact with the database in real time. The system composes and renders appropriate react components based on user prompt, providing a personalized user experience. The research discusses the system's architecture, implementation, and potential impact on inventory management systems. It showcases the potential of Large Language Models (LLMs) and conversational interfaces in enhancing enterprise software user experiences. Key Words: Inventory Management System, Generative User Interface, Generative AI, Large Language Models, Conversational Interface, Natural Language Processing
To enhance the efficiency and innovation of user interface design, this study explores the application of Generative Adversarial Networks through adversarial training between the generator and discriminator models. The research covers areas such as layout generation, color schemes, and interactive element design. Results indicate that UI-GAN can rapidly generate diverse design schemes, with layout module counts ranging from 5 to 8 on different devices, color scheme generation times between 1.8 and 2.3 seconds, and interactive element response times of 100 to 180 milliseconds. This significantly improves design efficiency and optimizes user experience.
Abstract Since the launch of ChatGPT in November 2022, the number of services providing generative AI has been steadily increasing. As different services enter the market, the generative AI interfaces users experience become more diverse. However, none of the services has yet established itself as the dominant tool, and which interface component affects user preferences the most has yet to be identified. This study investigates user preferences based on the interface of generative AI services currently on the market. We investigated users’ preferred interface components by setting the output data type, generative style, output variation, reference style, and generation history provided by the current generative AI service as properties. We collected data from 500 users through a survey and conducted conjoint analysis. Users preferred the provision of 10 generation histories the most, and the second most preferred the provision of reference style in footnote format. In addition, it was found that there was no preference for the creative generative style, which can be interpreted as users being aware of the problem of hallucination in generative AI. The results of this study will help future generative AI services design interfaces that consider user experience.
No abstract available
Generative UI is transforming interface design by facilitating AI-driven collaborative workflows between designers and computational systems. This study establishes a working definition of Generative UI through a multi-method qualitative approach, integrating insights from a systematic literature review of 127 publications, expert interviews with 18 participants, and analyses of 12 case studies. Our findings identify five core themes that position Generative UI as an iterative and co-creative process. We highlight emerging design models, including hybrid creation, curation-based workflows, and AI-assisted refinement strategies. Additionally, we examine ethical challenges, evaluation criteria, and interaction models that shape the field. By proposing a conceptual foundation, this study advances both theoretical discourse and practical implementation, guiding future HCI research toward responsible and effective generative UI design practices.
Unlike static and rigid user interfaces, generative and malleable user interfaces offer the potential to respond to diverse users’ goals and tasks. However, current approaches primarily rely on generating code, making it difficult for end-users to iteratively tailor the generated interface to their evolving needs. We propose employing task-driven data models—representing the essential information entities, relationships, and data within information tasks—as the foundation for UI generation. We leverage AI to interpret users’ prompts and generate the data models that describe users’ intended tasks, and by mapping the data models with UI specifications, we can create generative user interfaces. End-users can easily modify and extend the interfaces via natural language and direct manipulation, with these interactions translated into changes in the underlying model. The technical evaluation of our approach and user evaluation of the developed system demonstrate the feasibility and effectiveness of the proposed generative and malleable UIs.
We present a case study of using generative user interfaces, or ``vibe coding,''a method leveraging large language models (LLMs) for generating code via natural language prompts, to support rapid prototyping in user-centered design (UCD). Extending traditional UCD practices, we propose an AI-in-the-loop ideate-prototyping process. We share insights from an empirical experience integrating this process to develop an interactive data analytics interface for highway traffic engineers to effectively retrieve and analyze historical traffic data. With generative UIs, the team was able to elicit rich user feedback and test multiple alternative design ideas from user evaluation interviews and real-time collaborative sessions with domain experts. We discuss the advantages and pitfalls of vibe coding for bridging the gaps between design expertise and domain-specific expertise.
Commonly used methods in User-Centered Design (UCD) can face challenges in incorporating user feedback during early design stages, often resulting in extended iteration cycles. To address this, we explore the following question: “How can generative artificial intelligence (AI) be utilized to enable prototyping within user studies to facilitate immediate user feedback integration and validation?” We introduce a conceptual framework for live-prototyping, where designers modify AI-generated components of a prototype in real time through a separate control interface during user testing. This approach invites more immediate interaction between feedback and design decisions. To explore our concept, we engaged in a case study with experienced prototyping practitioners, examining how real-time prototyping might shape design processes. Participants highlighted the framework’s potential to support spontaneous insight generation and enhance collaborative dynamics. However, they also highlighted important considerations, including the need for a certain level of AI knowledge and challenges around planning and reliability. By integrating generative AI into the UCD process, our conceptual framework contributes to ongoing conversations around evolving user-centered methodologies.
BlendScape: Enabling End-User Customization of Video-Conferencing Environments through Generative AI
Today’s video-conferencing tools support a rich range of professional and social activities, but their generic meeting environments cannot be dynamically adapted to align with distributed collaborators’ needs. To enable end-user customization, we developed BlendScape, a rendering and composition system for video-conferencing participants to tailor environments to their meeting context by leveraging AI image generation techniques. BlendScape supports flexible representations of task spaces by blending users’ physical or digital backgrounds into unified environments and implements multimodal interaction techniques to steer the generation. Through an exploratory study with 15 end-users, we investigated whether and how they would find value in using generative AI to customize video-conferencing environments. Participants envisioned using a system like BlendScape to facilitate collaborative activities in the future, but required further controls to mitigate distracting or unrealistic visual elements. We implemented scenarios to demonstrate BlendScape’s expressiveness for supporting environment design strategies from prior work and propose composition techniques to improve the quality of environments.
As generative AI technologies integrate into financial workflows, understanding user interactions in the private equity domain is vital for optimizing information retrieval and user experience. This study uses a mixed-methods approach to explore interaction patterns within a generative AI chatbot prototype. Key themes, sentiments, and search behaviors were identified by analyzing 12 user interviews and 825 generative AI query inputs. Findings reveal predominant user intents—company analysis and sector-specific dynamics/market research—highlighting the importance of expertise in shaping interaction dynamics. A comprehensive product design framework tailored to private equity-specific needs is proposed, including delivering robust prompting support, offering high-level market insights and in-depth company analyses, clearly citing sources, and transparently communicating timeframes to users. This research provides actionable insights into designing intuitive, effective generative AI interfaces, advancing their application in private equity and broader financial sectors such as venture capital, asset management, hedge funds, investing banking, and M&A.
The experience and adoption of conversational search is tied to the accuracy and completeness of users' mental models -- their internal frameworks for understanding and predicting system behaviour. Thus, understanding these models can reveal areas for design interventions. Transparency is one such intervention which can improve system interpretability and enable mental model alignment. While past research has explored mental models of search engines, those of generative conversational search remain underexplored, even while the popularity of these systems soars. To address this, we conducted a study with 16 participants, who performed 4 search tasks using 4 conversational interfaces of varying transparency levels. Our analysis revealed that most user mental models were too abstract to support users in explaining individual search instances. These results suggest that 1) mental models may pose a barrier to appropriate trust in conversational search, and 2) hybrid web-conversational search is a promising novel direction for future search interface design.
As the boundaries of human computer interaction expand, Generative AI emerges as a key driver in reshaping user interfaces, introducing new possibilities for personalized, multimodal and cross-platform interactions. This integration reflects a growing demand for more adaptive and intuitive user interfaces that can accommodate diverse input types such as text, voice and video, and deliver seamless experiences across devices. This paper explores the integration of generative AI in modern user interfaces, examining historical developments and focusing on multimodal interaction, cross-platform adaptability and dynamic personalization. A central theme is the interface dilemma, which addresses the challenge of designing effective interactions for multimodal large language models, assessing the trade-offs between graphical, voice-based and immersive interfaces. The paper further evaluates lightweight frameworks tailored for mobile platforms, spotlighting the role of mobile hardware in enabling scalable multimodal AI. Technical and ethical challenges, including context retention, privacy concerns and balancing cloud and on-device processing are thoroughly examined. Finally, the paper outlines future directions such as emotionally adaptive interfaces, predictive AI driven user interfaces and real-time collaborative systems, underscoring generative AI's potential to redefine adaptive user-centric interfaces across platforms.
This study analyzes how the UI structures of key generative AI tools influence designers' creative processes. Using four criteria input, intervention, feedback, and interaction flow seven tools were examined through official documents, user reviews, and interface observations. The results show structural differences affecting creative flow, control, and output. The study offers insights for developing generative AI tools that better support designer-centered workflows.
Developing user-centred applications that address diverse user needs requires rigorous user research. This is time, effort and cost-consuming. With the recent rise of generative AI techniques based on Large Language Models (LLMs), there is a possibility that these powerful tools can be used to develop adaptive interfaces. This paper presents a novel approach to develop user personas and adaptive interface candidates for a specific domain using ChatGPT. We develop user personas and adaptive interfaces using both ChatGPT and a traditional manual process and compare these outcomes. To obtain data for the personas we collected data from 37 survey participants and 4 interviews in collaboration with a not-for-profit organisation. The comparison of ChatGPT generated content and manual content indicates promising results that encourage using LLMs in the adaptive interfaces design process.
Abstract Motivation Functional Annotation of genomic Variants Online Resources (FAVOR) offers multi-faceted, whole genome variant functional annotations, which is essential for Whole Genome and Exome Sequencing (WGS/WES) analysis and the functional prioritization of disease-associated variants. A versatile chatbot designed to facilitate informative interpretation and interactive, user-centric summary of the whole genome variant functional annotation data in the FAVOR database is needed. Results We have developed FAVOR-GPT, a generative natural language interface powered by integrating large language models (LLMs) and FAVOR. It is developed based on the Retrieval Augmented Generation (RAG) approach, and complements the original FAVOR portal, enhancing usability for users, especially those without specialized expertise. FAVOR-GPT simplifies raw annotations by providing interpretable explanations and result summaries in response to the user’s prompt. It shows high accuracy when cross-referencing with the FAVOR database, underscoring the robustness of the retrieval framework. Availability and implementation Researchers can access FAVOR-GPT at FAVOR’s main website (https://favor.genohub.org).
For people with severe sensory-based motor disorders or musculoskeletal disorders, robotic assistance offers a promising solution to improve their daily living standards. In this paper, we present a wearable real-time 2-Dimensional/3-Dimensional (2D/3D) eye-gaze control interface for people with quadriplegia to enable robot-assisted locomotion, sensing, and manipulation. Compared to other modalities in human-robot interaction, gaze point information in Cartesian space has the advantage of being directly feasible for robotic control. To achieve accurate 3D gaze point estimation, we propose a method that leverages a commercially available 2D wearable eye tracker and an off-the-shelf stereo camera. Unlike traditional stereoscopic depth computation or 3D eye reconstruction approaches, which often rely on user-specific eye model calibration, our method reformulates the 2D-to-3D mapping as an online Newton-Raphson search problem that does not depend on individual eye model parameters for gaze depth estimation, allowing the system to operate effectively across varied environments and depth ranges. This results in a solution that is easy to implement, robust to individual variability, and computationally efficient for accurate 3D gaze point estimation. At the same time, 2D gaze is utilized to interact with a screen displaying robotic sensing of specific environments that are not directly visible to the user, thereby enabling extended sensing through robotic assistance. The feasibility of the proposed method is verified through quantitative evaluations of the 3D gaze point estimation. On average, the 3D gaze point estimation yields a mean Euclidean distance error of approximately 1.88 cm across the 0.5–4.0 m distance (corresponding to a 0.9 % percentage error), outperforming comparable methods. A proof-of-concept study further demonstrates successful robot-assisted locomotion, sensing, and manipulation using the proposed gaze interface. The implementation of the core method is available at https://github.com/SavickTso/ros2-3d-gaze-mapping.git.
Current surveillance systems impose significant cognitive burden on operators who must monitor multiple 2D camera feeds simultaneously, leading to mental fatigue and degraded spatial awareness. We address this issue with a system that transforms disparate 2D surveillance feeds into a unified 3D spatial representation, offloading the mental mapping task from operators to the interface itself. Our approach integrates real-time 3D reconstruction with multi-camera feeds to create an intuitive spatial environment that aligns with natural human spatial cognition. Preliminary user studies with 14 participants demonstrate improved task performance and reduced cognitive load compared to traditional multi-view displays, supporting the hypothesis that externalizing spatial integration reduces vigilance decrement in monitoring tasks.
Integrated sensing and communications (ISAC) is essential for future 6 G, bridging physical and digital worlds by enabling wireless systems to sense and respond to their environment. Digital twins (DTs) enhance ISAC by providing real-time, data-driven models for applications like localization and autonomous navigation. However, existing DT frameworks lack real-time modeling and multimodal sensing, limiting their use for high-resolution ISAC. This demo paper extends our recently proposed real-time DT framework for indoor ISACenabled robotics and introduces an LLM-driven speech interface for control and planning, a continuous 3D reconstruction scheme using RGB-D data, and a novel ray tracing approach for wireless channel modeling from point clouds. These innovations address key limitations, supporting advanced ISAC applications and immersive human-computer interaction.
The emergence of Large Language Models (LLMs) has reshaped agent systems. Unlike traditional rule-based agents with limited task scope, LLM-powered agents offer greater flexibility, cross-domain reasoning, and natural language interaction. Moreover, with the integration of multi-modal LLMs, current agent systems are highly capable of processing diverse data modalities, including text, images, audio, and structured tabular data, enabling richer and more adaptive real-world behavior. This paper comprehensively examines the evolution of agent systems from the pre-LLM era to current LLM-powered architectures. We categorize agent systems into software-based, physical, and adaptive hybrid systems, highlighting applications across customer service, software development, manufacturing automation, personalized education, financial trading, and healthcare. We further discuss the primary challenges posed by LLM-powered agents, including high inference latency, output uncertainty, lack of evaluation metrics, and security vulnerabilities, and propose potential solutions to mitigate these concerns.
Artificial intelligence (AI), including large language models and generative AI, is emerging as a significant force in software development, offering developers powerful tools that span the entire development lifecycle. Although software engineering research has extensively studied AI tools in software development, the specific types of interactions between developers and these AI-powered tools have only recently begun to receive attention. Understanding and improving these interactions has the potential to enhance productivity, trust, and efficiency in AI-driven workflows. In this paper, we propose a taxonomy of interaction types between developers and AI tools, identifying eleven distinct interaction types, such as auto-complete code suggestions, command-driven actions, and conversational assistance. Building on this taxonomy, we outline a research agenda focused on optimizing AI interactions, improving developer control, and addressing trust and usability challenges in AI-assisted development. By establishing a structured foundation for studying developer-AI interactions, this paper aims to stimulate research on creating more effective, adaptive AI tools for software development.
The adoption of large language models (LLMs) and autonomous agents in software engineering marks an enduring paradigm shift. These systems create new opportunities for tool design, workflow orchestration, and empirical observation, while fundamentally reshaping the roles of developers and the artifacts they produce. Although traditional empirical methods remain central to software engineering research, the rapid evolution of AI introduces new data modalities, alters causal assumptions, and challenges foundational constructs such as “developer”,”artifact”, and “interaction”. As humans and AI agents increasingly co-create, the boundaries between social and technical actors blur, and the reproducibility of findings becomes contingent on model updates and prompt contexts. This vision paper examines how the integration of LLMs into software engineering disrupts established research paradigms. We discuss how it transforms the phenomena we study, the methods and theories we rely on, the data we analyze, and the threats to validity that arise in dynamic AI-mediated environments. Our aim is to help the empirical software engineering community adapt its questions, instruments, and validation standards to a future in which AI systems are not merely tools, but active collaborators shaping software engineering and its study.
Automation increasingly shapes modern society, requiring artificial intelligence (AI) systems to not only perform complex tasks but also provide clear, actionable explanations of their decisions, especially in high-stakes domains. However, most contemporary AI systems struggle to explain their runtime operations in specific instances, limiting their applicability in contexts demanding stringent outcome justification. Existing approaches have attempted to address this challenge but often fall short in terms of contextual relevance, human cognitive alignment, or scalability. This paper introduces System-of-Systems Machine Learning (SoS-ML) as a novel framework to advance explainable artificial intelligence (XAI) by addressing the limitations of current methods. Drawing from insights in philosophy, cognitive science, and social sciences, SoS-ML seeks to integrate human-like reasoning processes into AI, framing explanations as contextual inferences and justifications. The research demonstrates how SoS-ML addresses key challenges in XAI, such as enhancing explanation accuracy and aligning AI reasoning with human cognition. By leveraging a multi-agent, modular design, SoS-ML encourages collaboration among machine learning models, leading to more transparent, context-aware systems. The framework’s ability to generalize across domains is demonstrated through experiments on the Pima Indian Diabetes dataset and pie chart image-to-text interpretation, showcasing its transformative potential in improving both model accuracy and explainability. The findings emphasize SoS-ML’s role in advancing responsible AI, particularly in high-stakes environments where interpretability and social accountability are paramount.
This study explores perceptions of artificial intelligence (AI) in the higher education workplace through innovative use of fiction writing workshops. Twenty-three participants took part in three workshops, imagining the application of AI assistants and chatbots to their roles. Key themes were identified, including perceived benefits and challenges of AI implementation, interface design implications, and factors influencing task delegation to AI. Participants envisioned AI primarily as a tool to enhance task efficiency rather than fundamentally transform job roles. This research contributes insights into the desires and concerns of educational users regarding AI adoption, highlighting potential barriers such as value alignment.
: Today’s demand for customized service-based systems requires that industry understands the context and the particular needs of their customers. Service Oriented Dynamic Software Product Line practices enable companies to create individual products for every customer by providing an interdependent set of features presenting web services that are automatically activated and deactivated depending on the running situation. Such product lines are designed to support their self-adaptation to new contexts and requirements. Users configure personalized products by selecting desired features based on their needs. However, with large feature models, users must understand the functionalities of features and the impact of their gradual selections and their current context in order to make appropriate decisions. Thus, users need to be guided in configuring their product. To tackle this challenge, users can express their product requirements by textual language and a recommended product will be generated with respect to the described requirements. In this paper, we propose a deep neural network based recommendation approach that provides personalized recommendations to users which ease the configuration process. In detail, our proposed recommender system is based on a deep neural network that predicts to the user relevant features of the recommended product with the consideration of their requirements, contextual data and previous recommended products. In order to demonstrate the performance of our approach, we compared six different recommendation algorithms in a smart home case study.
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1% (156M) of the foundation models’ parameters. This adaptation not only leads SLM to achieve strong performance on conventional tasks such as automatic speech recognition (ASR) and automatic speech translation (AST), but also unlocks the novel capability of zero-shot instruction-following for more diverse tasks. Given a speech input and a text instruction, SLM is able to perform unseen generation tasks including contextual biasing ASR using real-time context, dialog generation, speech continuation, and question answering. Our approach demonstrates that the representational gap between pretrained speech and language models is narrower than one would expect, and can be bridged by a simple adaptation mechanism. As a result, SLM is not only efficient to train, but also inherits strong capabilities already present in foundation models of different modalities.
With the rapid advancement of artificial intelligence (AI), the field of language service studies has ushered in a paradigm shift and holds broad development prospects in the AI era. This paper first systematically reviews the global and domestic research progress in AI-driven language services: internationally, scholars focus on the integration of AI technologies with language service workflows, efficiency optimization, and quality evaluation; domestically, research leans toward addressing practical needs such as cross-cultural communication under national strategies and the localization of AI language tools. Subsequently, it examines the current applications of AI in the language service domain, covering key technologies including neural machine translation (NMT) with enhanced contextual adaptation, speech recognition and synthesis supporting real-time multilingual interaction, and large language models (LLMs) enabling intelligent content creation and multi-modal language services. Finally, the paper envisions future research directions such as cross-disciplinary integration of AI, linguistics, and communication, ethical governance of AI language services, and personalized service innovation. It further puts forward pertinent suggestions, including strengthening the construction of multilingual corpus resources, improving the evaluation system for AI-driven language services, and cultivating interdisciplinary talents, so as to promote the high-quality development of the global language service industry.
Self-adaptive systems have traditionally relied on the MAPE-K loop. It consists of a centralized, reactive, and sequential loop for monitoring, analyzing, planning, and executing system adaptations. However, the increasing complexity and dynamic nature of modern systems have exposed the limitations of MAPE-K loops, including their lack of proactivity, scalability challenges, and difficulty integrating continuous learning or distributed decision-making. We introduce AWARE (Assess, Weigh, Act, Reflect, Enrich), a distributed, goal-driven framework that addresses these limitations. AWARE employs autonomous AI agents capable of proactive adaptation, collaboration, and continuous learning to enhance decision-making and system resilience. The modular design of our framework supports dynamic agent integration and optimized resource utilization, enabling seamless scalability and adaptability. AWARE not only anticipates changes and optimizes responses but also iteratively refines its strategies based on contextual insights. Through a comparison with MAPE-K and a real-world use case, we demonstrate how AWARE distributed intelligence redefines the capabilities of self-adaptive systems, offering a solution better aligned with the demands of complex real-world systems.
The development of Self-adaptive Software (SaS) is not a trivial task because this type of software has specific features compared to traditional ones. In short, SaS can reflect on its internal and external states and propose structural, behavioral, and contextual changes that can be incorporated at runtime. Manual adaptation tasks, even if very well executed, normally become onerous in time and effort, besides being error prone because of the involuntary injection of errors by the developers. Automated processes have been used as a feasible solution to conduct software adaptation at runtime by minimizing human involvement (e.g., software engineers and developers) and quickening up the execution of tasks. In parallel, Reference Architectures (RA) have been used to aggregate knowledge and architectural artifacts, capturing the systems’ essence in specific domains. Therefore, it can be said that this type of architecture is an important way to support the development, standardization, and evolution of software systems. Considering this context, the main contribution of this paper is to present the second release of a reference architecture called RA4SaS (Reference Architecture for SaS). This architecture is based on reflection, a controlled adaptation approach, and a set of automated processes that support the development of SaS in both design and runtime. To show the applicability of our RA, we conducted a case study that explored three adaptation scenarios. As a result, we observe our RA has good potential to efficiently contribute to the SaS domain.
Mixed Reality is increasingly used in mobile settings beyond controlled home and office spaces. This mobility introduces the need for user interface layouts that adapt to varying contexts. However, existing adaptive systems are designed only for static environments. In this paper, we introduce SituationAdapt, a system that adjusts Mixed Reality UIs to real-world surroundings by considering environmental and social cues in shared settings. Our system consists of perception, reasoning, and optimization modules for UI adaptation. Our perception module identifies objects and individuals around the user, while our reasoning module leverages a Vision-and-Language Model to assess the placement of interactive UI elements. This ensures that adapted layouts do not obstruct relevant environmental cues or interfere with social norms. Our optimization module then generates Mixed Reality interfaces that account for these considerations as well as temporal constraints. For evaluation, we first validate our reasoning module’s capability of assessing UI contexts in comparison to human expert users. In an online user study, we then establish SituationAdapt’s capability of producing context-aware layouts for Mixed Reality, where it outperformed previous adaptive layout methods. We conclude with a series of applications and scenarios to demonstrate SituationAdapt’s versatility.
Despite advances in digital libraries, keyword-based search remains rigid, offering limited support for exploratory and sense-making tasks. We introduce Athena, a book discovery system that integrates LLM-powered Retrieval-Augmented Generation (RAG) with interactive graph visualization. This hybrid system allows users to engage in natural language dialogue, navigate relational graphs of retrieved books, and generate cross-book summaries, offering an alternative to static keyword search. A preliminary user study found that Athena reduced cognitive load, improved usability, and encouraged exploratory behavior, although user trust in AI-generated content varied. We outline future directions focused on scaling to larger, more diverse user studies and systematically analyzing how conversational and visual features influence trust, satisfaction, and external validation behaviors.
: The new AI platform is designed and implemented by the software team at full-stack AI SaaS, which includes conversational AI, image generation, and code generation. Powering for both conversation and code uses OpenAI’s GPT-3.5-turbo and the platform will use DALL·E to generate images, thus offering a unified user experience of these different AI villages through a single cohesive interface. Consumer-side routing is seen in a number of technologies e.g. modern web technologies such as Next.js 13 and its App Router, thereby it becomes possible to have smooth, fast and responsive user experience. The era in which we live is characterized by user expectations that are always demanding, and this endeavor is related to vital user-experience, developer-usability, and scalability tasks. An AI platform will only be useful if it involves the whole process - integrating the listed elements to a solution that empowers users comprehensively across all the various domains of activities will make it one. The discussion of the layout and those that use our platform is talked about in this particular context which includes the scalability key aspect, the user interface design concern, and the development treatment to the users. Next.js 13 together with API-s as well as the machine algorithms of OpenAI presents the interlaid milieu of a technologic-creative realm which deals with today’s issues. Consequently, we verify the efficiency of our system by load testing and collecting user reviews. These results not only reaffirm the infallibility of our tool but also bring new areas for investigation and enhancement. In this paper, we hope to stimulate some subtle debates on the probable of AI to make significant changes in the user experience and enhance the productivity of the user, moving towards a scenario in which everyone has easy access to AI tools covering a large variety of walks of life.
Artificial Intelligence (AI) has progressed so far in human computer interaction that it is much more natural and interesting. Optical Character Recognition (OCR) conjointly with Conversational AI is capable of processing visual alongside the textual input and generating intelligent and context aware responses, and therefore the work on a multimodal chatbot system is introduced in this paper. The proposed system extracts text from images, Natural Language Processing (NLP) processes user queries, and enhances the interaction by speech output through text to speech synthesis. In particular, this chatbot doesn’t accept speech as input modality but tries to translate text response to speech to make the interface more accessible for visually impaired users. Additionally, there is a poster generation module in the system for visual summarization of the conversations and the extracted content. The chatbot uses state of the art deep learning models and language frameworks to handle real time processing, grammatical accuracy in real time and also across different scenarios. From education, assistive technologies, customer support and all the possibilities in between, the applications take advantage of multimodal, voice enriched and visually enhanced communication to include users.
Tabular data is the most common format to publish and exchange structured data online. A clear example is the growing number of open data portals published by public administrations. However, exploitation of these data sources is currently limited to technical people able to programmatically manipulate and digest such data. As an alternative, we propose the use of chatbots to offer a conversational interface to facilitate the exploration of tabular data sources, including support for data analytics questions that are responded via charts rendered by the chatbot. Moreover, our chatbots are automatically generated from the data source itself thanks to the instantiation of a configurable collection of conversation patterns matched to the chatbot intents and entities.
This study explores strategies to optimize the emotional experience of voice user interfaces (VUIs) in AI assistants, focusing on Generation Z single households in their 20s. The research proceeded in four stages: user survey, case analysis, prototype design with usability testing, and emotion recognition technology analysis, applying the Person–Artifact–Task (P-A-T) model to derive emotion-based VUI design strategies. In UI, visual feedback elements were emphasized, while in UX, conversational naturalness, emotional responsiveness, and adaptability to user states were considered. Usability testing revealed that users perceived personalized dialogue and emotional reactions as key satisfaction factors, with visual feedback enhancing immersion and emotional engagement. These findings suggest that AI assistants can evolve beyond functional tools into emotionally connected digital companions, and the study provides methodological foundations for future personalized VUI design.
As users engage more frequently with AI conversational agents, conversations may exceed their “memory” capacity, leading to failures in correctly leveraging certain memories for tailored responses. However, in finding past memories that can be reused or referenced, users need to retrieve relevant information in various conversations and articulate to the AI their intention to reuse these memories. To support this process, we introduce Memolet, an interactive object that reifies memory reuse. Users can directly manipulate Memolet to specify which memories to reuse and how to use them. We developed a system demonstrating Memolet’s interaction across various memory reuse stages, including memory extraction, organization, prompt articulation, and generation refinement. We examine the system’s usefulness with an N=12 within-subject study and provide design implications for future systems that support user-AI conversational memory reusing.
Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.
This paper introduces a novel integration of Retrieval-Augmented Generation (RAG) enhanced Large Language Models (LLMs) with Extended Reality (XR) technologies to address knowledge transfer challenges in industrial environments. The proposed system embeds domain-specific industrial knowledge into XR environments through a natural language interface, enabling hands-free, context-aware expert guidance for workers. We present the architecture of the proposed system consisting of an LLM Chat Engine with dynamic tool orchestration and an XR application featuring voice-driven interaction. Performance evaluation of various chunking strategies, embedding models, and vector databases reveals that semantic chunking, balanced embedding models, and efficient vector stores deliver optimal performance for industrial knowledge retrieval. The system's potential is demonstrated through early implementation in multiple industrial use cases, including robotic assembly, smart infrastructure maintenance, and aerospace component servicing. Results indicate potential for enhancing training efficiency, remote assistance capabilities, and operational guidance in alignment with Industry 5.0's human-centric and resilient approach to industrial development.
Mixed reality (MR) environments offer embodied spatial interaction, providing intuitive 3D manipulation capabilities that enhance the conceptual design process. Parametric modeling, a powerful and advanced architectural design method, enables the generation of complex, optimized geometries. However, its integration into MR environments remains limited due to precision constraints and unsuitable input modalities. Existing MR tools prioritize spatial interaction but lack the control and expressiveness required for parametric workflows, particularly for designers without formal programming backgrounds. We address this gap by introducing a novel conversational MR interface that combines speech input, gesture recognition, and a multi-agent large language model (LLM) system to support intuitive parametric modeling. Our system dynamically manages parameter states, resolves ambiguous commands through conversation and contextual prompting, and enables real-time model manipulation within immersive environments. We demonstrate how this approach reduces cognitive and operational barriers in early-stage design tasks, allowing users to refine and explore their design space. This work expands the role of MR to a generative design platform, supporting programmatic thinking in design tasks through natural, embodied interaction.
We present a demo of RAGTrip, a modular conversational system that integrates Large Language Models (LLMs), spatial reasoning, and information retrieval to generate personalized walking itineraries in urban environments. Unlike traditional route planners or closed-book LLMs, RAGTrip interprets nuanced user preferences, avoids hallucinations, and grounds its suggestions in real-world geographic and factual data. The system features an interactive conversational interface that engages users in refining both the itinerary and the attractions to visit. Through dynamic map visualizations and contextual responses, users can explore and iteratively customize their routes. The demo includes a toggle to enable or disable Retrieval-Augmented Generation (RAG), allowing direct comparison between RAG-enhanced and closed-book LLM responses. This highlights the value of combining spatial and semantic grounding in conversational itinerary recommendation.
Proactive conversational agents (CAs) are often underutilized in e-Commerce due to misalignment with user expectations and integration challenges. In this demo, we present a hybrid e-Commerce interface that combines a browsing window with a proactive conversational agent, leveraging context-aware interactions to enhance the user’s experience. The interface dynamically adapts to user actions, repositioning the CA to centralize its recommendations and foster engagement through visual design nudges. By integrating a graph-based context model with a large language model (LLM) for intent detection and response generation, the system provides precise, multi-turn recommendations and action-oriented dialogues. A formative user study (N=10) demonstrated the hybrid interface’s effectiveness, achieving higher user engagement and satisfaction compared to standalone browsing or conversational interfaces.
The increasing volume and complexity of digital financial transactions make it challenging for individuals to manually track and analyze their spending. This paper introduces Intelli-Finance, an AI-powered financial assistant designed to automate and simplify personal finance management. Our system integrates a multi-stage pipeline that begins with parsing unstructured bank statements from PDF and CSV formats. It then employs a novel hybrid classification model, combining weak supervision with a Support Vector Machine (SVM), to accurately categorize transactions without requiring large manually-labeled datasets. The classified, structured data is then made accessible through a dynamic, natural-language query interface powered by a Large Language Model (LLM) agent built with LangChain. This allows users to seamlessly ask complex questions about their finances and receive personalized, data-driven insights. Our approach demonstrates the power of combining classical machine learning for robust classification with the advanced reasoning capabilities of LLMs to create a comprehensive, intuitive, and powerful tool for personal financial management.
Conversational recommender systems (CRSs) aim to recommend high-quality items to users through a dialogue interface. It usually contains multiple sub-tasks, such as user preference elicitation, recommendation, explanation, and item information search. To develop effective CRSs, there are some challenges: 1) how to properly manage sub-tasks; 2) how to effectively solve different sub-tasks; and 3) how to correctly generate responses that interact with users. Recently, Large Language Models (LLMs) have exhibited an unprecedented ability to reason and generate, presenting a new opportunity to develop more powerful CRSs. In this work, we propose a new LLM-based CRS, referred to as LLMCRS, to address the above challenges. For sub-task management, we leverage the reasoning ability of LLM to effectively manage sub-task. For sub-task solving, we collaborate LLM with expert models of different sub-tasks to achieve the enhanced performance. For response generation, we utilize the generation ability of LLM as a language interface to better interact with users. Specifically, LLMCRS divides the workflow into four stages: sub-task detection, model matching, sub-task execution, and response generation. LLMCRS also designs schema-based instruction, demonstration-based instruction, dynamic sub-task and model matching, and summary-based generation to instruct LLM to generate desired results in the workflow. Finally, to adapt LLM to conversational recommendations, we also propose to fine-tune LLM with reinforcement learning from CRSs performance feedback, referred to as RLPF. Experimental results on benchmark datasets show that LLMCRS with RLPF outperforms the existing methods.
This research full paper describes the AR-Classroom application that utilizes augmented reality (AR) and physical and virtual manipulatives to enable undergraduate students to build intuition about the relation between spatial transformations and their mathematical representations. To further build on the app's usability and functionality, additional features are being prototyped to continue improving the user-app interaction with the AR-Classroom. Some of the challenges the students faced when using AR-Classroom were recalling basic matrix operations without geometric context, basic trigonometric functions and their applications in the two-dimensional space, loss of AR registration for not understanding the AR environment, and User Interface (UI) issues. To address these issues, a conversational Artificial Intelligence (AI)-based multi-sensory and interactive assistance has been added to the AR-Classroom. Integrating sophisticated language processing and response generation of AI with immersive three-dimensional capabilities of AR can create a more engaging learning experience than the previous versions of the app. This integration focuses on creating a symbiosis between AR and AI. It creates an elevated user experience by offering real-time, personalized assistance to students dealing with issues related to understanding mathematical concepts and functionalities of the app. A qualitative exploratory usability study was done to assess the user's interaction with the AI implemented in the AR-Classroom, aiming to explore the AI's ability to guide students in using AR technology and aid in introductory matrix algebra learning, to effectively serve the students' learning. Based on the thematic analysis of the user experiment we found four main themes related to users' perceptions of AR-Classroom AI features usability: (1) AI chatbot ease-of-use, (2) Need for answer elaboration from AI, (3) Desire for visual information, and (4) Increased understanding of the content area. The scores of ease of use indicate AI's ability to guide complex tasks in an AR environment using AI features with less concern for the cognitive load. The overall result suggests the need for further investigation on incorporating AI-guided visual cues in an AR environment.
The advancement of artificial intelligence has transformed user interface design by enabling adaptive and personalized systems. Alongside these benefits, AI driven interfaces have also enabled the emergence of dark patterns, which are manipulative design strategies that influence user behavior for financial or business gain. As AI systems learn from data that already contains deceptive practices, they can replicate and optimize these patterns in increasingly subtle and personalized ways. This paper examines AI generated dark patterns, their psychological foundations, technical mechanisms, and regulatory implications in India. We introduce DarkPatternDetector, an automated system that crawls and analyzes websites to detect dark patterns using a combination of UI heuristics, natural language processing, and temporal behavioral signals. The system is evaluated on a curated dataset of dark and benign webpages and achieves strong precision and recall. By aligning detection results with India's Digital Personal Data Protection Act, 2023, this work provides a technical and regulatory framework for identifying and mitigating deceptive interface practices. The goal is to support ethical AI design, regulatory enforcement, and greater transparency in modern digital systems.
To facilitate high quality interaction during the regular use of computing systems, it is essential that the user interface (UI) deliver content and components in an appropriate manner. Although extended reality (XR) is emerging as a new computing platform, we still have a limited understanding of how best to design and present interactive content to users in such immersive environments. Adaptive UIs offer a promising approach for optimal presentation in XR as the user's environment, tasks, capabilities, and preferences vary under changing context. In this position paper, we present a design framework for adapting various characteristics of content presented in XR. We frame these as five considerations that need to be taken into account for adaptive XR UIs: What?, How Much?, Where?, How?, and When?. With this framework, we review literature on UI design and adaptation to reflect on approaches that have been adopted or developed in the past towards identifying current gaps and challenges, and opportunities for applying such approaches in XR. Using our framework, future work could identify and develop novel computational approaches for achieving successful adaptive user interfaces in such immersive environments.
As the automotive world moves toward higher levels of driving automation, Level 3 automated driving represents a critical juncture. In Level 3 driving, vehicles can drive alone under limited conditions, but drivers are expected to be ready to take over when the system requests. Assisting the driver to maintain an appropriate level of Situation Awareness (SA) in such contexts becomes a critical task. This position paper explores the potential of Attentive User Interfaces (AUIs) powered by generative Artificial Intelligence (AI) to address this need. Rather than relying on overt notifications, we argue that AUIs based on novel AI technologies such as large language models or diffusion models can be used to improve SA in an unconscious and subtle way without negative effects on drivers overall workload. Accordingly, we propose 5 strategies how generative AI s can be used to improve the quality of takeovers and, ultimately, road safety.
This study presents a novel approach for intelligent user interaction interface generation and optimization, grounded in the variational autoencoder (VAE) model. With the rapid advancement of intelligent technologies, traditional interface design methods struggle to meet the evolving demands for diversity and personalization, often lacking flexibility in real-time adjustments to enhance the user experience. Human-Computer Interaction (HCI) plays a critical role in addressing these challenges by focusing on creating interfaces that are functional, intuitive, and responsive to user needs. This research leverages the RICO dataset to train the VAE model, enabling the simulation and creation of user interfaces that align with user aesthetics and interaction habits. By integrating real-time user behavior data, the system dynamically refines and optimizes the interface, improving usability and underscoring the importance of HCI in achieving a seamless user experience. Experimental findings indicate that the VAE-based approach significantly enhances the quality and precision of interface generation compared to other methods, including autoencoders (AE), generative adversarial networks (GAN), conditional GANs (cGAN), deep belief networks (DBN), and VAE-GAN. This work contributes valuable insights into HCI, providing robust technical solutions for automated interface generation and enhanced user experience optimization.
This study introduces an adaptive user interface generation technology, emphasizing the role of Human-Computer Interaction (HCI) in optimizing user experience. By focusing on enhancing the interaction between users and intelligent systems, this approach aims to automatically adjust interface layouts and configurations based on user feedback, streamlining the design process. Traditional interface design involves significant manual effort and struggles to meet the evolving personalized needs of users. Our proposed system integrates adaptive interface generation with reinforcement learning and intelligent feedback mechanisms to dynamically adjust the user interface, better accommodating individual usage patterns. In the experiment, the OpenAI CLIP Interactions dataset was utilized to verify the adaptability of the proposed method, using click-through rate (CTR) and user retention rate (RR) as evaluation metrics. The findings highlight the system's ability to deliver flexible and personalized interface solutions, providing a novel and effective approach for user interaction design and ultimately enhancing HCI through continuous learning and adaptation.
AI is growing increasingly capable of automatically generating user interfaces (GenUI) from user prompts. However, designing GenUI applications that enable users to discover diverse customizations while preserving GenUI's expressiveness remains challenging. Current design methods -- presenting prompt boxes and leveraging context -- lack affordances for customization discovery, while traditional menu-based approaches become overly complex given GenUI's vast customization space. We propose Gradually Generating User Interfaces -- a design method that structures customizations into intermediate UI layers that AI gradually loads during interface generation. These intermediate stages expose different customization features along specific dimensions, making them discoverable to users. Users can wind back the generation process to access customizations. We demonstrate this approach through three prototype websites, showing how designers can support GenUI's expanded customization capabilities while maintaining visual simplicity and discoverability. Our work offers a practical method for integrating customization features into GenUI applications, contributing an approach to designing malleable software.
Robots often need to convey information to human users. For example, robots can leverage visual, auditory, and haptic interfaces to display their intent or express their internal state. In some scenarios there are socially agreed upon conventions for what these signals mean: e.g., a red light indicates an autonomous car is slowing down. But as robots develop new capabilities and seek to convey more complex data, the meaning behind their signals is not always mutually understood: one user might think a flashing light indicates the autonomous car is an aggressive driver, while another user might think the same signal means the autonomous car is defensive. In this paper we enable robots to adapt their interfaces to the current user so that the human's personalized interpretation is aligned with the robot's meaning. We start with an information theoretic end-to-end approach, which automatically tunes the interface policy to optimize the correlation between human and robot. But to ensure that this learning policy is intuitive -- and to accelerate how quickly the interface adapts to the human -- we recognize that humans have priors over how interfaces should function. For instance, humans expect interface signals to be proportional and convex. Our approach biases the robot's interface towards these priors, resulting in signals that are adapted to the current user while still following social expectations. Our simulations and user study results across $15$ participants suggest that these priors improve robot-to-human communication. See videos here: https://youtu.be/Re3OLg57hp8
This paper presents Face2Feel, a novel user interface (UI) model that dynamically adapts to user emotions and preferences captured through computer vision. This adaptive UI framework addresses the limitations of traditional static interfaces by integrating digital image processing, face recognition, and emotion detection techniques. Face2Feel analyzes user expressions utilizing a webcam or pre-installed camera as the primary data source to personalize the UI in real-time. Although dynamically changing user interfaces based on emotional states are not yet widely implemented, their advantages and the demand for such systems are evident. This research contributes to the development of emotion-aware applications, particularly in recommendation systems and feedback mechanisms. A case study, "Shresta: Emotion-Based Book Recommendation System," demonstrates the practical implementation of this framework, the technologies employed, and the system's usefulness. Furthermore, a user survey conducted after presenting the working model reveals a strong demand for such adaptive interfaces, emphasizing the importance of user satisfaction and comfort in human-computer interaction. The results showed that nearly 85.7\% of the users found these systems to be very engaging and user-friendly. This study underscores the potential for emotion-driven UI adaptation to improve user experiences across various applications.
The chapter discusses the foundational impact of modern generative AI models on information access (IA) systems. In contrast to traditional AI, the large-scale training and superior data modeling of generative AI models enable them to produce high-quality, human-like responses, which brings brand new opportunities for the development of IA paradigms. In this chapter, we identify and introduce two of them in details, i.e., information generation and information synthesis. Information generation allows AI to create tailored content addressing user needs directly, enhancing user experience with immediate, relevant outputs. Information synthesis leverages the ability of generative AI to integrate and reorganize existing information, providing grounded responses and mitigating issues like model hallucination, which is particularly valuable in scenarios requiring precision and external knowledge. This chapter delves into the foundational aspects of generative models, including architecture, scaling, and training, and discusses their applications in multi-modal scenarios. Additionally, it examines the retrieval-augmented generation paradigm and other methods for corpus modeling and understanding, demonstrating how generative AI can enhance information access systems. It also summarizes potential challenges and fruitful directions for future studies.
While generative artificial intelligence (Gen AI) increasingly transforms academic environments, a critical gap exists in understanding and mitigating human biases in AI interactions, such as anchoring and confirmation bias. This position paper advocates for metacognitive AI literacy interventions to help university students critically engage with AI and address biases across the Human-AI interaction workflows. The paper presents the importance of considering (1) metacognitive support with deliberate friction focusing on human bias; (2) bi-directional Human-AI interaction intervention addressing both input formulation and output interpretation; and (3) adaptive scaffolding that responds to diverse user engagement patterns. These frameworks are illustrated through ongoing work on "DeBiasMe," AIED (AI in Education) interventions designed to enhance awareness of cognitive biases while empowering user agency in AI interactions. The paper invites multiple stakeholders to engage in discussions on design and evaluation methods for scaffolding mechanisms, bias visualization, and analysis frameworks. This position contributes to the emerging field of AI-augmented learning by emphasizing the critical role of metacognition in helping students navigate the complex interaction between human, statistical, and systemic biases in AI use while highlighting how cognitive adaptation to AI systems must be explicitly integrated into comprehensive AI literacy frameworks.
Generative AI has recently had a profound impact on various fields, including daily life, research, and education. To explore its efficient utilization in data-driven materials science, we organized a hackathon -- AIMHack2024 -- in July 2024. In this hackathon, researchers from fields such as materials science, information science, bioinformatics, and condensed matter physics worked together to explore how generative AI can facilitate research and education. Based on the results of the hackathon, this paper presents topics related to (1) conducting AI-assisted software trials, (2) building AI tutors for software, and (3) developing GUI applications for software. While generative AI continues to evolve rapidly, this paper provides an early record of its application in data-driven materials science and highlights strategies for integrating AI into research and education.
High stakes decision-making often requires a continuous interplay between evolving evidence and shifting hypotheses, a dynamic that is not well supported by current AI decision support systems. In this paper, we introduce a mixed-initiative framework for AI assisted decision making that is grounded in the data-frame theory of sensemaking and the evaluative AI paradigm. Our approach enables both humans and AI to collaboratively construct, validate, and adapt hypotheses. We demonstrate our framework with an AI-assisted skin cancer diagnosis prototype that leverages a concept bottleneck model to facilitate interpretable interactions and dynamic updates to diagnostic hypotheses.
This paper investigates the impact of artificial intelligence integration on remote operations, emphasising its influence on both distributed and team cognition. As remote operations increasingly rely on digital interfaces, sensors, and networked communication, AI-driven systems transform decision-making processes across domains such as air traffic control, industrial automation, and intelligent ports. However, the integration of AI introduces significant challenges, including the reconfiguration of human-AI team cognition, the need for adaptive AI memory that aligns with human distributed cognition, and the design of AI fallback operators to maintain continuity during communication disruptions. Drawing on theories of distributed and team cognition, we analyse how cognitive overload, loss of situational awareness, and impaired team coordination may arise in AI-supported environments. Based on real-world intelligent port scenarios, we propose research directions that aim to safeguard human reasoning and enhance collaborative decision-making in AI-augmented remote operations.
Real-time reflection plays a vital role in synchronous communication. It enables users to adjust their communication strategies dynamically, thereby improving the effectiveness of their communication. Generative AI holds significant potential to enhance real-time reflection due to its ability to comprehensively understand the current context and generate personalized and nuanced content. However, it is challenging to design the way of interaction and information presentation to support the real-time workflow rather than disrupt it. In this position paper, we present a review of existing research on systems designed for reflection in different synchronous communication scenarios. Based on that, we discuss design implications on how to design human-AI interaction to support reflection in real time.
Flow theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation when a task's difficulty aligns with their skill level. In AI-augmented reasoning, interventions that disrupt the state of cognitive flow can hinder rather than enhance decision-making. This paper proposes a context-aware cognitive augmentation framework that adapts interventions based on three key contextual factors: type, timing, and scale. By leveraging multimodal behavioral cues (e.g., gaze behavior, typing hesitation, interaction speed), AI can dynamically adjust cognitive support to maintain or restore flow. We introduce the concept of cognitive flow, an extension of flow theory in AI-augmented reasoning, where interventions are personalized, adaptive, and minimally intrusive. By shifting from static interventions to context-aware augmentation, our approach ensures that AI systems support deep engagement in complex decision-making and reasoning without disrupting cognitive immersion.
Adapting the User Interface (UI) of software systems to user requirements and the context of use is challenging. The main difficulty consists of suggesting the right adaptation at the right time in the right place in order to make it valuable for end-users. We believe that recent progress in Machine Learning techniques provides useful ways in which to support adaptation more effectively. In particular, Reinforcement learning (RL) can be used to personalise interfaces for each context of use in order to improve the user experience (UX). However, determining the reward of each adaptation alternative is a challenge in RL for UI adaptation. Recent research has explored the use of reward models to address this challenge, but there is currently no empirical evidence on this type of model. In this paper, we propose a confirmatory study design that aims to investigate the effectiveness of two different approaches for the generation of reward models in the context of UI adaptation using RL: (1) by employing a reward model derived exclusively from predictive Human-Computer Interaction (HCI) models (HCI), and (2) by employing predictive HCI models augmented by Human Feedback (HCI&HF). The controlled experiment will use an AB/BA crossover design with two treatments: HCI and HCI&HF. We shall determine how the manipulation of these two treatments will affect the UX when interacting with adaptive user interfaces (AUI). The UX will be measured in terms of user engagement and user satisfaction, which will be operationalized by means of predictive HCI models and the Questionnaire for User Interaction Satisfaction (QUIS), respectively. By comparing the performance of two reward models in terms of their ability to adapt to user preferences with the purpose of improving the UX, our study contributes to the understanding of how reward modelling can facilitate UI adaptation using RL.
3D Mixed Reality interfaces have nearly unlimited space for layout placement, making automatic UI adaptation crucial for enhancing the user experience. Such adaptation is often formulated as a multi-objective optimization (MOO) problem, where multiple, potentially conflicting design objectives must be balanced. However, selecting a final layout is challenging since MOO typically yields a set of trade-offs along a Pareto frontier. Prior approaches often required users to manually explore and evaluate these trade-offs, a time-consuming process that disrupts the fluidity of interaction. To eliminate this manual and laborous step, we propose a novel optimization approach that efficiently determines user preferences from a minimal number of UI element adjustments. These determined rankings are translated into priority levels, which then drive our priority-based MOO algorithm. By focusing the search on user-preferred solutions, our method not only identifies UIs that are more aligned with user preferences, but also automatically selects the final design from the Pareto frontier; ultimately, it minimizes user effort while ensuring personalized layouts. Our user study in a Mixed Reality setting demonstrates that our preference-guided approach significantly reduces manual adjustments compared to traditional methods, including fully manual design and exhaustive Pareto front searches, while maintaining high user satisfaction. We believe this work opens the door for more efficient MOO by seamlessly incorporating user preferences.
This position paper outlines a new approach to adapting 3D user interface (UI) layouts given the complex nature of end-user preferences. Current optimization techniques, which mainly rely on weighted sum methods, can be inflexible and result in unsatisfactory adaptations. We propose using multi-objective optimization and interactive preference elicitation to provide semi-automated, flexible, and effective adaptations of 3D UIs. Our approach is demonstrated using an example of single-element 3D layout adaptation with ergonomic objectives. Future work is needed to address questions around the presentation and selection of optimal solutions, the impact on cognitive load, and the integration of preference learning. We conclude that, to make adaptive 3D UIs truly effective, we must acknowledge the limitations of our optimization objectives and techniques and emphasize the importance of user control.
Front-end personalization has traditionally relied on static designs or rule-based adaptations, which fail to fully capture user behavior patterns. This paper presents an AI driven approach for dynamic front-end personalization, where UI layouts, content, and features adapt in real-time based on predicted user behavior. We propose three strategies: dynamic layout adaptation using user path prediction, content prioritization through reinforcement learning, and a comparative analysis of AI-driven vs. rule-based personalization. Technical implementation details, algorithms, system architecture, and evaluation methods are provided to illustrate feasibility and performance gains.
Designing user interfaces (UIs) is a critical step when launching products, building portfolios, or personalizing projects, yet end users without design expertise often struggle to articulate their intent and to trust design choices. Existing example-based tools either promote broad exploration, which can cause overwhelm and design drift, or require adapting a single example, risking design fixation. We present UI Remix, an interactive system that supports mobile UI design through an example-driven design workflow. Powered by a multimodal retrieval-augmented generation (MMRAG) model, UI Remix enables iterative search, selection, and adaptation of examples at both the global (whole interface) and local (component) level. To foster trust, it presents source transparency cues such as ratings, download counts, and developer information. In an empirical study with 24 end users, UI Remix significantly improved participants' ability to achieve their design goals, facilitated effective iteration, and encouraged exploration of alternative designs. Participants also reported that source transparency cues enhanced their confidence in adapting examples. Our findings suggest new directions for AI-assisted, example-driven systems that empower end users to design with greater control, trust, and openness to exploration.
ReDemon UI synthesizes React applications from user demonstrations, enabling designers and non-expert programmers to create UIs that integrate with standard UI prototyping workflows. Users provide a static mockup sketch with event handler holes and demonstrate desired runtime behaviors by interacting with the rendered mockup and editing the sketch. ReDemon UI identifies reactive data and synthesizes a React program with correct state update logic. We utilize enumerative synthesis for simple UIs and LLMs for more complex UIs.
Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG
The importance of computational modeling of mobile user interfaces (UIs) is undeniable. However, these require a high-quality UI dataset. Existing datasets are often outdated, collected years ago, and are frequently noisy with mismatches in their visual representation. This presents challenges in modeling UI understanding in the wild. This paper introduces a novel approach to automatically mine UI data from Android apps, leveraging Large Language Models (LLMs) to mimic human-like exploration. To ensure dataset quality, we employ the best practices in UI noise filtering and incorporate human annotation as a final validation step. Our results demonstrate the effectiveness of LLMs-enhanced app exploration in mining more meaningful UIs, resulting in a large dataset MUD of 18k human-annotated UIs from 3.3k apps. We highlight the usefulness of MUD in two common UI modeling tasks: element detection and UI retrieval, showcasing its potential to establish a foundation for future research into high-quality, modern UIs.
Grasp User Interfaces (grasp UIs) enable dual-tasking in XR by allowing interaction with digital content while holding physical objects. However, current grasp UI design practices face a fundamental challenge: existing approaches either capture user preferences through labor-intensive elicitation studies that are difficult to scale or rely on biomechanical models that overlook subjective factors. We introduce GraspR, the first computational model that predicts user preferences for single-finger microgestures in grasp UIs. Our data-driven approach combines the scalability of computational methods with human preference modeling, trained on 1,520 preferences collected via a two-alternative forced choice paradigm across eight participants and four frequently used grasp variations. We demonstrate GraspR's effectiveness through a working prototype that dynamically adjusts interface layouts across four everyday tasks. We release both the dataset and code to support future research in adaptive grasp UIs.
User interface (UI) agents promise to make inaccessible or complex UIs easier to access for blind and low-vision (BLV) users. However, current UI agents typically perform tasks end-to-end without involving users in critical choices or making them aware of important contextual information, thus reducing user agency. For example, in our field study, a BLV participant asked to buy the cheapest available sparkling water, and the agent automatically chose one from several equally priced options, without mentioning alternative products with different flavors or better ratings. To address this problem, we introduce Morae, a UI agent that automatically identifies decision points during task execution and pauses so that users can make choices. Morae uses large multimodal models to interpret user queries alongside UI code and screenshots, and prompt users for clarification when there is a choice to be made. In a study over real-world web tasks with BLV participants, Morae helped users complete more tasks and select options that better matched their preferences, as compared to baseline agents, including OpenAI Operator. More broadly, this work exemplifies a mixed-initiative approach in which users benefit from the automation of UI agents while being able to express their preferences.
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5. To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models. To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies. To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/inclusionAI/UI-Venus.
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
The recent advances in Large Language Models (LLMs) have stimulated interest among researchers and industry professionals, particularly in their application to tasks concerning mobile user interfaces (UIs). This position paper investigates the use of LLMs for UI layout generation. Central to our exploration is the introduction of UI grammar -- a novel approach we proposed to represent the hierarchical structure inherent in UI screens. The aim of this approach is to guide the generative capacities of LLMs more effectively and improve the explainability and controllability of the process. Initial experiments conducted with GPT-4 showed the promising capability of LLMs to produce high-quality user interfaces via in-context learning. Furthermore, our preliminary comparative study suggested the potential of the grammar-based approach in improving the quality of generative results in specific aspects.
Texts, widgets, and images on a UI page do not work separately. Instead, they are partitioned into groups to achieve certain interaction functions or visual information. Existing studies on UI elements grouping mainly focus on a specific single UI-related software engineering task, and their groups vary in appearance and function. In this case, we propose our semantic component groups that pack adjacent text and non-text elements with similar semantics. In contrast to those task-oriented grouping methods, our semantic component group can be adopted for multiple UI-related software tasks, such as retrieving UI perceptual groups, improving code structure for automatic UI-to-code generation, and generating accessibility data for screen readers. To recognize semantic component groups on a UI page, we propose a robust, deep learning-based vision detector, UISCGD, which extends the SOTA deformable-DETR by incorporating UI element color representation and a learned prior on group distribution. The model is trained on our UI screenshots dataset of 1988 mobile GUIs from more than 200 apps in both iOS and Android platforms. The evaluation shows that our UISCGD achieves 6.1\% better than the best baseline algorithm and 5.4 \% better than deformable-DETR in which it is based.
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to multi-step UI navigation and planning.
Humans can learn to operate the user interface (UI) of an application by reading an instruction manual or how-to guide. Along with text, these resources include visual content such as UI screenshots and images of application icons referenced in the text. We explore how to leverage this data to learn generic visio-linguistic representations of UI screens and their components. These representations are useful in many real applications, such as accessibility, voice navigation, and task automation. Prior UI representation models rely on UI metadata (UI trees and accessibility labels), which is often missing, incompletely defined, or not accessible. We avoid such a dependency, and propose Lexi, a pre-trained vision and language model designed to handle the unique features of UI screens, including their text richness and context sensitivity. To train Lexi we curate the UICaption dataset consisting of 114k UI images paired with descriptions of their functionality. We evaluate Lexi on four tasks: UI action entailment, instruction-based UI image retrieval, grounding referring expressions, and UI entity recognition.
Recent popularity of Large Language Models (LLMs) has opened countless possibilities in automating numerous AI tasks by connecting LLMs to various domain-specific models or APIs, where LLMs serve as dispatchers while domain-specific models or APIs are action executors. Despite the vast numbers of domain-specific models/APIs, they still struggle to comprehensively cover super diverse automation demands in the interaction between human and User Interfaces (UIs). In this work, we build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor. This metadata-free grounding model, consisting of a visual encoder and a language decoder, is first pretrained on well studied document understanding tasks and then learns to decode spatial information from UI screenshots in a promptable way. To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm to predict geometric coordinates in a sequence of tokens using a language decoder. We further propose an innovative Reinforcement Learning (RL) based algorithm to supervise the tokens in such sequence jointly with visually semantic metrics, which effectively strengthens the spatial decoding capability of the pixel-to-sequence paradigm. Extensive experiments demonstrate our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin and shows the potential as a generic UI task automation API.
Healthcare professionals need effective ways to use, understand, and validate AI-driven clinical decision support systems. Existing systems face two key limitations: complex visualizations and a lack of grounding in scientific evidence. We present an integrated decision support system that combines interactive visualizations with a conversational agent to explain diabetes risk assessments. We propose a hybrid prompt handling approach combining fine-tuned language models for analytical queries with general Large Language Models (LLMs) for broader medical questions, a methodology for grounding AI explanations in scientific evidence, and a feature range analysis technique to support deeper understanding of feature contributions. We conducted a mixed-methods study with 30 healthcare professionals and found that the conversational interactions helped healthcare professionals build a clear understanding of model assessments, while the integration of scientific evidence calibrated trust in the system's decisions. Most participants reported that the system supported both patient risk evaluation and recommendation.
Many conversational user interfaces facilitate linear conversations with turn-based dialogue, similar to face-to-face conversations between people. However, digital conversations can afford more than simple back-and-forth; they can be layered with interaction techniques and structured representations that scaffold exploration, reflection, and shared understanding between users and AI systems. We introduce Feedstack, a speculative interface that augments feedback conversations with layered affordances for organizing, navigating, and externalizing feedback. These layered structures serve as a shared representation of the conversation that can surface user intent and reveal underlying design principles. This work represents an early exploration of this vision using a research-through-design approach. We describe system features and design rationale, and present insights from two formative (n=8, n=8) studies to examine how novice designers engage with these layered supports. Rather than presenting a conclusive evaluation, we reflect on Feedstack as a design probe that opens up new directions for conversational feedback systems.
The conversational search task aims to enable a user to resolve information needs via natural language dialogue with an agent. In this paper, we aim to develop a conceptual framework of the actions and intents of users and agents explaining how these actions enable the user to explore the search space and resolve their information need. We outline the different actions and intents, before discussing key decision points in the conversation where the agent needs to decide how to steer the conversational search process to a successful and/or satisfactory conclusion. Essentially, this paper provides a conceptualization of the conversational search process between an agent and user, which provides a framework and a starting point for research, development and evaluation of conversational search agents.
Thanks to the powerful language comprehension capabilities of Large Language Models (LLMs), existing instruction-based image editing methods have introduced Multimodal Large Language Models (MLLMs) to promote information exchange between instructions and images, ensuring the controllability and flexibility of image editing. However, these frameworks often build a multi-instruction dataset to train the model to handle multiple editing tasks, which is not only time-consuming and labor-intensive but also fails to achieve satisfactory results. In this paper, we present TalkPhoto, a versatile training-free image editing framework that facilitates precise image manipulation through conversational interaction. We instruct the open-source LLM with a specially designed prompt template to analyze user needs after receiving instructions and hierarchically invoke existing advanced editing methods, all without additional training. Moreover, we implement a plug-and-play and efficient invocation of image editing methods, allowing complex and unseen editing tasks to be integrated into the current framework, achieving stable and high-quality editing results. Extensive experiments demonstrate that our method not only provides more accurate invocation with fewer token consumption but also achieves higher editing quality across various image editing tasks.
The advent of LLMs means that CUIs are cool again, but what isn't so cool is that we're doomed to use them alone. The one user, one account, one device paradigm has dominated the design of CUIs and is not going away as new conversational technologies emerge. In this provocation we explore some of the technical, legal, and design difficulties that seem to make multi-user CUIs so difficult to implement. Drawing inspiration from the ways that people manage messy group discussions, such as parliamentary and consensus-based paradigms, we show how LLM-based CUIs might be well suited to bridging the gap. With any luck, this might even result in everyone having to sit through fewer poorly run meetings and agonising group discussions - truly a laudable goal!
We introduce Brain-Artificial Intelligence Interfaces (BAIs) as a new class of Brain-Computer Interfaces (BCIs). Unlike conventional BCIs, which rely on intact cognitive capabilities, BAIs leverage the power of artificial intelligence to replace parts of the neuro-cognitive processing pipeline. BAIs allow users to accomplish complex tasks by providing high-level intentions, while a pre-trained AI agent determines low-level details. This approach enlarges the target audience of BCIs to individuals with cognitive impairments, a population often excluded from the benefits of conventional BCIs. We present the general concept of BAIs and illustrate the potential of this new approach with a Conversational BAI based on EEG. In particular, we show in an experiment with simulated phone conversations that the Conversational BAI enables complex communication without the need to generate language. Our work thus demonstrates, for the first time, the ability of a speech neuroprosthesis to enable fluent communication in realistic scenarios with non-invasive technologies.
Objective: The article investigates the integration of advanced Generative Pretrained Transformers (GPT) models into a user-friendly Graphical User Interface (GUI). The primary objective of this work is to simplify access to complex Natural Language Processing (NLP) tasks for a diverse range of users, including those with limited technical background. Methods: The development process of the GUI was comprehensive and systematic: • Needs Assessment: This stage involved understanding the requirements and expectations of potential users to ensure the GUI effectively addresses their needs. • Preliminary Design and Development: The initial designs were created and developed into a functional GUI, emphasizing the integration of features supporting various NLP tasks like text summarization, translation, and question-answering. • Iterative Refinement: Continuous improvements were made based on user feedback, focusing on enhancing user experience, ease of navigation, and customization capabilities. Results: The developed GUI successfully integrated GPT models, including GPT-4 Turbo and GPT-3.5, resulting in an intuitive and adaptable interface. It demonstrated efficiency in performing various NLP tasks, thereby making these advanced language processing tools accessible to a broader audience. The GUI's design, emphasizing user-friendliness and adaptability, was particularly noted for its ability to cater to both technical and non-technical users. Conclusion: In conclusion, the article illustrates the significant impact of combining advanced GPT models with a Graphical User Interface to democratize the use of NLP tools. This integration not only makes complex language processing more accessible but also marks a pivotal step in the inclusive application of AI technology across various domains. The successful implementation of the GUI highlights the potential of AI in enhancing user interaction and broadening the scope of technology usage in everyday tasks.
最终分组结果构建了一个从底层技术到高层伦理的完整研究图谱。研究涵盖了以MLLM为核心的UI感知建模,利用生成式算法实现的布局即时合成,以及向对话式、多模态交互范式的演进。同时,报告深入探讨了智能体在自动化重构中的作用,以及在混合现实等复杂空间环境下的自适应优化。最后,通过对认知理论、伦理安全及垂直领域实践的整合,强调了AI驱动的界面重构正朝着“上下文感知、以人为本、领域增强”的智能化生态系统方向发展。