数字人动作生成
语音驱动的面部表情与口型同步生成
该组文献聚焦于通过音频信号驱动数字人的面部动作,涵盖了口型同步、情感表达以及从单张图像生成说话人视频的技术方案。
- Data-Driven Expressive 3D Facial Animation Synthesis for Digital Humans(Kazi Injamamul Haque, 2023, SIGGRAPH Asia 2023 Doctoral Consortium)
- VQ-VAE Based Audio-Driven Talking Face Generation from a Single Image(Yixin Li, Xizhong Shen, 2025, 2025 5th International Conference on Artificial Intelligence, Virtual Reality and Visualization (AIVRV))
- Audio-driven single image talking face animation with transformers(Yixin Li, Xizhong Shen, 2026, Scientific Reports)
- Virtual conversation with a real talking head(O. Gambino, A. Augello, A. Caronia, G. Pilato, R. Pirrone, S. Gaglio, 2008, 2008 Conference on Human System Interactions)
- Comparative Study of Digital Sibling Video AI Platform(Leonard Mars Kurniaputra, R. Ferdiana, L. Nugroho, 2025, 2025 International Conference on Metaverse Computing, Networking and Applications (MetaCom))
- Svara Rachana - Audio Driven Facial Expression Synthesis(Karan Khandelwal, Krishiv Pandita, Kshitij Priyankar, Kumar Parakram, T. K, 2024, International Journal for Research in Applied Science and Engineering Technology)
全身肢体动作、手势与复杂行为合成
这组论文探讨了数字人的全身动作生成,包括基于语义的动作拼接、舞蹈动作控制、抓取行为模拟以及对话过程中的手势生成,强调动作的自然度与连贯性。
- Semantic-Driven 2D Pose Stitching for Low-Cost and Controllable Digital Human Animation(Ge Cheng, Yun Zhang, Pengyuan Xie, 2025, Proceedings of the 2025 International Conference on Generative Artificial Intelligence for Business)
- 3D Human Animation Synthesis based on a Temporal Diffusion Generative Model(Baoping Cheng, Wenke Feng, Qinghang Wu, Jie Chen, Zhibiao Cai, Yemeng Zhang, Sheng Wang, Bin Che, 2024, 2024 2nd International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA))
- Emotion control of unstructured dance movements(A. Aristidou, Qiong Zeng, E. Stavrakis, KangKang Yin, D. Cohen-Or, Y. Chrysanthou, Baoquan Chen, 2017, Proceedings of the ACM SIGGRAPH / Eurographics Symposium on Computer Animation)
- Target Pose Guided Whole-body Grasping Motion Generation for Digital Humans(Quanquan Shao, Yi Fang, 2024, 2024 International Conference on Advanced Robotics and Mechatronics (ICARM))
- SSGesture: Multimodal Gesture Generation Framework for Human Animation Synthesis.(Xinyi Wang, Shiguang Liu, Xu Yang, 2025, IEEE Computer Graphics and Applications)
- Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss(Qifan Fu, Xiaohang Yang, Muhammad Asad, Changjae Oh, Shanxin Yuan, Gregory G. Slabaugh, 2024, No journal)
- A Virtual Modeling Method of Digital Media Image Synchronization Based on Motion Hybrid Algorithm(Yanyan Yang, Zhiping Wang, Caixia Yang, H. Zhu, 2021, Journal of Physics: Conference Series)
基于大语言模型与多模态驱动的交互系统
此类文献研究如何将大语言模型(如ChatGPT/GPT-4)与数字人动作生成结合,构建具有情感响应、实时对话和情境理解能力的智能交互数字人。
- Application of ChatGPT-Based Digital Human in Animation Creation(Chong-yu Lan, Yongsheng Wang, Chengze Wang, Shirong Song, Zheng Gong, 2023, Future Internet)
- Development of an Interactive Digital Human with Context-Sensitive Facial Expressions(Fan Yang, Lei Fang, R. Suo, Jing Zhang, Mincheol Whang, 2025, Sensors)
- Digital Human in an Integrated Physical-Digital World (IPhD)(Zhengyou Zhang, 2021, Proceedings of the 29th ACM International Conference on Multimedia)
- GenAI-Powered Multilingual Digital Human: An Intelligent Conversational Companion for Enhancing Elderly Mental and Emotional Well-being(Sanika Deshpande, Supriya Kelkar, 2025, 2025 International Conference on Sustainable Technologies for Humanity and Smart World (HSWTech))
- Toward Industry 5.0: Evaluating Multimodal Virtual Human Interaction for Smart Healthcare in Simulated VR Environments(Han Yang, Qiuyu Tian, Xiaowen Gu, 2025, Internet Technology Letters)
可驱动数字人建模与动态外观重构
该组文献关注数字人的资产构建,包括从单图或单目视频中重构可驱动的3D人体模型、处理服装形变以及卡通风格化人脸的生成技术。
- Creative Cartoon Face Synthesis System for Mobile Entertainment(Junfa Liu, Yiqiang Chen, Wen Gao, Rong Fu, Renqin Zhou, 2005, Lecture Notes in Computer Science)
- Design of Virtual Digital Human Image and Interaction for Elementary School Students' Ecological Education(Mingyue Wang, Nahua Shi, Yuanrong Zhao, Han Li, Qian Liu, 2025, 2025 6th International Conference on Information Science and Education (ICISE-IE))
- Dance In the Wild: Monocular Human Animation with Neural Dynamic Appearance Synthesis(Tuanfeng Y. Wang, Duygu Ceylan, Krishna Kumar Singh, N. Mitra, 2021, 2021 International Conference on 3D Vision (3DV))
- Photo2Avatar: Single-Image to Animatable 3D Human Avatar via Multi-View Synthesis and Face-Aware Consistency Enhancement(Wengang Zhong, Yu Ni, Weimin Lei, Wei Zhang, 2025, 2025 International Conference on Virtual Reality and Visualization (ICVRV))
- DFGA: Digital Human Faces Generation and Animation from the RGB Video using Modern Deep Learning Technology(Diqiong Jiang, Li You, Jian Chang, Ruofeng Tong, 2022, Pacific Graphics Short Papers, Posters, and Work-in-Progress Papers)
- D3-Human: Dynamic Disentangled Digital Human from Monocular Video(Honghu Chen, Bo Peng, Yunfan Tao, Juyong Zhang, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
动作生成框架、评估体系与行业应用研究
这组文献涉及数字人动作生成的底层算法框架优化(如扩散模型、VAE)、性能评估标准以及在教育、医疗等特定领域的应用实践。
- Efficient multi-constrained optimization for example-based synthesis(Stefan Hartmann, E. Trunz, Björn Krüger, Reinhard Klein, M. Hullin, 2015, The Visual Computer)
- Combining heterogeneous digital human simulations: presenting a novel co-simulation approach for incorporating different character animation technologies(Felix Gaisbauer, Eva Lampen, Philipp Agethen, E. Rukzio, 2020, The Visual Computer)
- Deep Learning-Driven Animation: Enhancing Real-Time Character Motion Synthesis(Qi Li, Tianyi Sun, Meidi Zhang, 2025, IEEE Access)
- “Wild West” of Evaluating Speech‐Driven 3D Facial Animation Synthesis: A Benchmark Study(Kazi Injamamul Haque, Alkiviadis Pavlou, Zerrin Yumak, 2025, Computer Graphics Forum)
- P‐3.10: Research on Key Technologies of Virtual Digital Human(Songzhen Sang, Wanlin Li, 2025, SID Symposium Digest of Technical Papers)
- Digital Human Technology in E-Learning: Custom Content Solutions(Sinan Chen, Liuyi Yang, Yue Zhang, Miao Zhang, Yangmei Xie, Zhiyi Zhu, Jialong Li, 2025, Applied Sciences)
数字人动作生成的研究正从单一的口型同步向多模态深度集成演进。目前的研究方向主要集中在:1) 利用生成式AI(如Diffusion和Transformer)提升肢体与面部动作的真实感;2) 结合大语言模型构建具备感知和交互能力的具身智能数字人;3) 探索低成本、高质量的单目视频/图像三维重建与动作驱动技术。同时,建立标准化的客观评估指标与感官评价体系已成为该领域进一步发展的关键需求。
总计30篇相关文献
This doctoral research focuses on generating expressive 3D facial animation for digital humans by studying and employing data-driven techniques. Face is the first point of interest during human interaction, and it is not any different for interacting with digital humans. Even minor inconsistencies in facial animation can disrupt user immersion. Traditional animation workflows prove realistic but time-consuming and labor-intensive that cannot meet the ever-increasing demand for 3D contents in recent years. Moreover, recent data-driven approaches focus on speech-driven lip synchrony, leaving out facial expressiveness that resides throughout the face. To address the emerging demand and reduce production efforts, we explore data-driven deep learning techniques for generating controllable, emotionally expressive facial animation. We evaluate the proposed models against state-of-the-art methods and ground-truth, quantitatively, qualitatively, and perceptually. We also emphasize the need for non-deterministic approaches in addition to deterministic methods in order to ensure natural randomness in the non-verbal cues of facial animation.
No abstract available
In recent years, the demand for realistic and responsive digital characters has grown rapidly, especially in areas like gaming, virtual reality, and interactive media. However, traditional animation methods often fail to balance realism, flexibility, and efficiency, particularly when generating complex human motion in real-time environments. To address these challenges, we introduce a comprehensive framework that synergizes deep learning architectures with domain-specific strategies to enhance real-time character motion synthesis. Our approach comprises three core components: a formalized problem setup that encapsulates the temporal and stylistic intricacies of animation sequences; the development of the Temporal-Stylistic Latent Animator (TSLA), a novel architecture that integrates variational latent inference with attention-enhanced recurrent dynamics and style-adaptive normalization to ensure high-fidelity synthesis; and the implementation of the Domain-Informed Animation Realignment Strategy (DIARS), which incorporates narrative graph embeddings and character role anchoring to maintain semantic consistency and stylistic coherence across sequences. Empirical evaluations demonstrate that framework significantly outperforms existing methods in tasks such as animation completion, style transfer, and semantic editing, thereby contributing to the advancement of computational animation research within the realms of computer graphics and visualization.
Reconstructing animatable 3D human avatars from minimal visual input is a challenging task in digital human modeling and virtual content creation. Existing methods predominantly rely on multi-view observations or monocular videos, limiting their applicability in image-sparse scenarios. We introduce Photo2Avatar, a unified pipeline that reconstructs a 3D human avatar from a single image while supporting SMPL-X-driven animation. The method first synthesizes a dense set of multi-view images guided by parametric body priors, enabling spatially consistent supervision from a monocular portrait. A face-aware consistency enhancement module is then applied to improve identity preservation and cross-view coherence, particularly in facial regions. These refined views are used to supervise an animatable avatar learner under a differentiable rendering objective, allowing motion-conditioned geometry and appearance learning with minimal input. Extensive experiments demonstrate that Photo2Avatar achieves superior identity consistency, visual quality, and animation controllability compared to existing single-image baselines. The proposed method offers a practical solution for one-shot digital human reconstruction and bridges the gap between static image perception and dynamic 3D avatar animation. Project page: https://photo2avatar.github.io/.
With the rapid development of digital technologies such as VR, AR, XR, and more importantly the almost ubiquitous mobile broadband coverage, we are entering an Integrated Physical-Digital World (IPhD), the tight integration of virtual world with the physical world. The IPhD is characterized with four key technologies: Virtualization of the physical world, Realization of the virtual world, Holographic internet, and Intelligent Agent. Internet will continue its development with faster speed and broader bandwidth, and will eventually be able to communicate holographic contents including 3D shape, appearance, spatial audio, touch sensing and smell. Intelligent agents, such as digital human, and digital/physical robots, travels between digital and physical worlds. In this talk, we will describe our work on digital human for this IPhD world. This includes: computer vision techniques for building digital humans, multimodal text-to-speech synthesis (voice and lip shapes), speech-driven face animation, neural-network-based body motion control, human-digital-human interaction, and an emotional video game anchor.
Audio-driven talking-head video generation is a critical task in cross-modal expressive synthesis, with applications in virtual humans, digital content creation, and human-computer interaction. Existing methods, however, often suffer from unnatural lip movements and distortions in non-speech facial regions, especially under exaggerated expressions or emotional variations. These issues arise due to the entanglement of linguistic content, prosodic emotion, and speaker-specific attributes within the audio signal. To address these challenges, we propose ExpNet, a Transformer-based expression regression framework that decouples global head motion from local facial expressions using 3DMM coefficients. The method employs a conditional VAE for robust head pose coefficient generation, while a CNN-Transformer architecture regresses expression coefficients. ExpNet introduces ALiBi-based relative positional bias in the self-attention mechanism, which captures long-range dependencies while focusing on local temporal context. It also conditions on the first-frame expression coefficient to preserve identity and emotion consistency throughout the video. Experimental evaluations on multiple datasets, including HDTF, MEAD, and LRS3, demonstrate that the method presented in this paper outperforms existing methods in terms of expression realism, lip synchronization, and video quality. Ablation studies reveal that key components such as ALiBi, landmark supervision, and the Transformer module are crucial for improving temporal stability, reducing lip jitter, and enhancing overall facial animation consistency.
To support the human‐centric goals of Industry 5.0, this paper proposes a modular framework for constructing low‐cost, high‐efficiency digital humans by combining retrieval‐augmented generation (RAG), large language models (LLMs), and AIGC (AI‐generated content). The framework enables embodied agents capable of reliable reasoning, contextual alignment, and expressive interaction across industrial environments. As a representative application in Industry 5.0 smart healthcare, we deploy three variants—scripted, LLM‐only, and LLM + RAG—in a VR‐based hospital triage simulation, integrating automatic speech recognition, semantic retrieval, neural speech synthesis, facial animation, and gesture generation. A within‐subject user study (n = 45) evaluates task accuracy, perceived naturalness, and response latency. Results show that the LLM + RAG agent significantly outperforms others in both task success (95.1%) and naturalness rating (4.61/5), as assessed via expert consensus and standardized Likert‐based user ratings. These findings demonstrate that retrieval‐enhanced digital humans can combine factual precision, real‐time responsiveness, and multimodal expressiveness—key requirements in high‐stakes, affect‐sensitive domains. While healthcare is the tested use case, the architecture and evaluation protocol offer a reusable foundation for Industry 5.0 applications more broadly, including frontline services, education, and multilingual teleconsultation. The study contributes both a validated design pathway and a repeatable evaluation method for deploying scalable, trustworthy virtual agents in real‐world industrial systems.
Portrait animation is a hyper-realistic video synthesis that produces talking human video with the use of a static image and driving audio. Portrait animation technology can be implemented as a visualization for Digital Sibling AI, an AI-based platform that may replicate someone as a digital representation that’s able to communicate like an actual human being. Having realistic portrait animation is important to enhance the level of immersive on digital sibling, hence some criteria need to be identified to grade a "good" portrait animation generator. Then, this portrait animation generator will be applied as an integration design to the digital sibling platform. This research defines criterias for basic provisions, credibility, performance, and visualization as the main criteria that will decide the feasibility of a portrait animation model as a digital representation visualization. Based on the testing conducted as well as the analysis using Multi-Criteria Decision Making (MCDM), it is found that the SadTalker model is the best portrait animation model in accordance to the defined criteria. After that, the model will be developed into an API and deployed as a container using Azure Container Apps services. The model will be inferenced via API request, where the API will return a url containing the portrait animation video output. Last but not least, this research suggests the ideal VM specification with NVIDIA A100 GPU, 1 vCPU Core, and 7 GB memories.
Abstract: Svara Rachana is a fusion of artificial intelligence and facial animation which aims to revolutionize the field of digital communication. Harnessing the ever-evolving power of neural networks in the form of Long Short-Term Memory (LSTM) model, Svara Rachana offers a cutting edge, interactive web application designed to synchronize human speech with realistic 3D facial animation. Users can upload or record an audio file and upload it to the web interface containing human speech, with the core functionality being the generation of synchronized lip movements on a 3D avatar. The system gives special emphasis on the accuracy of the system to generate reliable facial animation movements. By providing an interactive, human like 3D model, Svara Rachana aims to make machine to human interaction a more impactful experience by blurring the lines between humans and machines.
Grasping manipulation is a fundamental mode for human interaction with daily life objects. The synthesis of grasping motion is also greatly demanded in many applications such as animation and robotics. In objects grasping research field, most works focus on generating the last static grasping pose with a parallel gripper or dexterous hand. Grasping motion generation for the full arm especially for the full humanlike intelligent agent is still under-explored. In this work, we propose a grasping motion generation framework for digital human which is an anthropomorphic intelligent agent with high degrees of freedom in virtual world. Given an object known initial pose in 3D space, we first generate a target pose for whole-body digital human based on off-the-shelf target grasping pose generation methods. With an initial pose and this generated target pose, a transformer-based neural network is used to generate the whole grasping trajectory, which connects initial pose and target pose smoothly and naturally. Additionally, two post optimization components are designed to mitigates foot-skating issue and hand-object interpenetration separately. Experiments are conducted on GRAB dataset to demonstrate effectiveness of this proposed method for whole-body grasping motion generation with randomly placed unknown objects.
Audio-driven facial animation holds broad promise for applications in virtual avatars, human–computer interaction, and digital media production. This work introduces a new method: a single-image audio-driven facial animation model built upon a Vector Quantized Variational Autoencoder. The model maps voice features to expression parameters, enabling the synthesis of natural and smooth talking-face videos from a single static portrait. Specifically, the VQ-VAE discretizes continuous latent representations into a compact and expressive codebook, thereby improving the audio-to-expression mapping. To enhance temporal coherence, we introduce a temporal smoothing loss that explicitly constrains abrupt expression changes between consecutive frames, while a reconstruction loss is employed to ensure accurate recovery of expression parameters. Furthermore, a conditional VAE framework is adopted to generate diverse and stable head movements by mapping 3D motion coefficients to unsupervised keypoints, which are then used to produce dynamic facial animations. We conducted extensive experiments, and the results indicate that the proposed approach outperforms conventional models regarding expression naturalness and temporal stability, highlighting its potential for improving both the naturalness and controllability of speaker-driven facial animation.
In virtual reality digital media, motion capture system is used to build basic motion library. Then the basic motion is processed by motion editing technology. Motion blending is one of the most practical and complex editing techniques. This paper proposes a real-time synchronization algorithm based on motion blending, so as to better mix motion dynamically, avoid unexpected effects, and create complex virtual digital media animation. This paper adopts the strategy of data and model hybrid drive. The motion generation and control technology of virtual human is studied from the following aspects: Modeling and simulation method of motion system, grasping of virtual human considering the change of whole body posture, fast generation of operation action, automatic interactive generation of virtual human’s key frame posture, multi priority editing synthesis and interactive control of virtual human’s whole body motion. In this paper, the corresponding control strategy and model are proposed, and the traditional algorithm is improved. Experimental results show that the method improves the efficiency and accuracy of digital media generation. This method provides a reference for the research of digital media image synchronous virtual modeling method based on motion hybrid algorithm.
No abstract available
No abstract available
No abstract available
No abstract available
This paper proposes a semantic-driven 2D pose stitching method for low-cost and highly reusable digital human animation generation. The method converts natural language scripts into semantic action labels, retrieves and stitches action segments from a 2D pose fragment library (PoseDB), and uses transition frame interpolation and structural regularization to generate coherent and natural animation sequences. The method features low computational cost, structured control, and strong scalability, suitable for applications in education, advertising, and product demonstrations. Experimental results show that this method outperforms traditional approaches in naturalness, continuity, and controllability, providing an efficient and easily integrable solution for digital human animation generation.
Technological innovations are reshaping the development of animation production. As virtual characters are increasingly used in animation creation and smart assistants, a key challenge is how to automatically generate dialogue gestures. However, current approaches often overlook a wide range of modalities and their interactions, resulting in gestures that have low contextual variation and noticeable jitter. To address these issues, we propose SSGesture, a novel diffusion-based framework that effectively captures cross-modal associations. Our three-layer attention structure enhances multimodal processing. We propose the first method to automatically resolve style conflicts through interpolation-based gesture style control, while implementing a unified unmarked style prompt structure via the PAAN layer. Our framework is practically applied in the field of intelligent virtual assistants to generate gestures in human animation synthesis and to realize various new applications. Extensive experiments and user studies have demonstrated that our proposed framework, provides substantial assistance in enhancing the efficiency of human animation production.
Three-dimensional human motion generation is an important branch of computer graphics and has broad application prospects. Traditional human animation synthesis technologies rely on professional simulation platforms with high labor and time costs. Existing learning-based methods usually generate human animations by giving the prior motion seed, which lacks generative ability and cannot generate a wide variety of human motions. On the other hand, established generative methods rely on a given prior sample distribution, and their creation capabilities are relatively limited. To that end, we propose a distribution-free human motion synthesis workflow based on a temporal diffusion model. By specifying the high-level semantic motion info, our method is able to generate a wide variety of human motions with diverse styles. Firstly, we construct our human motion dataset by selecting the human motion sequences that cover different motion types and labeling them with the corresponding motion semantics. Secondly, we use the temporal network Transformer to extract the motion semantic features of different kinds of human motion sequences and introduce the self-attention mechanism to ensure temporal continuity between adjacent motion frames. Then, we use the diffusion model to denoise the extracted motion semantic features to generate visually continuous, realistic, and delicate motion sequences. Finally, we conduct a series of experiments on HumanAct12 and UESTC datasets. The experimental results demonstrate that our method achieves better performance in motion reconstruction and generation, and has greater improvements on a few metrics including RMSE, STED, FID, diversity, etc.
Traditional 3D animation creation involves a process of motion acquisition, dubbing, and mouth movement data binding for each character. To streamline animation creation, we propose combining artificial intelligence (AI) with a motion capture system. This integration aims to reduce the time, workload, and cost associated with animation creation. By utilizing AI and natural language processing, the characters can engage in independent learning, generating their own responses and interactions, thus moving away from the traditional method of creating digital characters with pre-defined behaviors. In this paper, we present an approach that employs a digital person’s animation environment. We utilized Unity plug-ins to drive the character’s mouth Blendshape, synchronize the character’s voice and mouth movements in Unity, and connect the digital person to an AI system. This integration enables AI-driven language interactions within animation production. Through experimentation, we evaluated the correctness of the natural language interaction of the digital human in the animated scene, the real-time synchronization of mouth movements, the potential for singularity in guiding users during digital human animation creation, and its ability to guide user interactions through its own thought process.
Diffusion models have shown their remarkable ability to synthesize images, including the generation of humans in specific poses. However, current models face challenges in adequately expressing conditional control for detailed hand pose generation, leading to significant distortion in the hand regions. To tackle this problem, we first curate the How2Sign dataset to provide richer and more accurate hand pose annotations. In addition, we introduce adaptive, multi-modal fusion to integrate characters' physical features expressed in different modalities such as skeleton, depth, and surface normal. Furthermore, we propose a novel Region-Aware Cycle Loss (RACL) that enables the diffusion model training to focus on improving the hand region, resulting in improved quality of generated hand gestures. More specifically, the proposed RACL computes a weighted keypoint distance between the full-body pose keypoints from the generated image and the ground truth, to generate higher-quality hand poses while balancing overall pose accuracy. Moreover, we use two hand region metrics, named hand-PSNR and hand-Distance for hand pose generation evaluations. Our experimental evaluations demonstrate the effectiveness of our proposed approach in improving the quality of digital human pose generation using diffusion models, especially the quality of the hand region. The source code is available at https://github.com/fuqifan/Region-Aware-Cycle-Loss.
Recent advancements in the field of audio‐driven 3D facial animation have accelerated rapidly, with numerous papers being published in a short span of time. This surge in research has garnered significant attention from both academia and industry with its potential applications on digital humans. Various approaches, both deterministic and non‐deterministic, have been explored based on foundational advancements in deep learning algorithms. However, there remains no consensus among researchers on standardized methods for evaluating these techniques. Additionally, rather than converging on a common set of datasets and objective metrics suited for specific methods, recent works exhibit considerable variation in experimental setups. This inconsistency complicates the research landscape, making it difficult to establish a streamlined evaluation process and rendering many cross‐paper comparisons challenging. Moreover, the common practice of A/B testing in perceptual studies focus only on two common metrics and not sufficient for non‐deterministic and emotion‐enabled approaches. The lack of correlations between subjective and objective metrics points out that there is a need for critical analysis in this space. In this study, we address these issues by benchmarking state‐of‐the‐art deterministic and non‐deterministic models, utilizing a consistent experimental setup across a carefully curated set of objective metrics and datasets. We also conduct a perceptual user study to assess whether subjective perceptual metrics align with the objective metrics. Our findings indicate that model rankings do not necessarily generalize across datasets, and subjective metric ratings are not always consistent with their corresponding objective metrics. The supplementary video, edited code scripts for training on different datasets and documentation related to this benchmark study are made publicly available‐ https://galib360.github.io/face-benchmark-project/.
Synthesizing dynamic appearances of humans in motion plays a central role in applications such as ARWR and video editing. While many recent methods have been proposed to tackle this problem, handling loose garments with complex textures and high dynamic motion still remains challenging. In this paper, we propose a video based appearance synthesis method that tackles such challenges and demonstrates high quality results for in-the-wild videos that have not been shown before. Specifically, we adopt a StyleGAN based architecture to the task of person specific video based motion retargeting. We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes as well as regularizing the single frame based pose estimates to improve temporal coherency. We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the-art performance both qualitatively and quantitatively.
No abstract available
We introduce D3-Human, a method for reconstructing Dynamic Disentangled Digital Human geometry from monocular videos. Past monocular video human reconstruction primarily focuses on reconstructing undecoupled clothed human bodies or only reconstructing clothing, making it difficult to apply directly in applications such as animation production. The challenge in reconstructing decoupled clothing and body lies in the occlusion caused by clothing over the body. To this end, the details of the visible area and the plausibility of the invisible area must be ensured during the reconstruction process. Our proposed method combines explicit and implicit representations to model the decoupled clothed human body, leveraging the robustness of explicit representations and the flexibility of implicit representations. Specifically, we reconstruct the visible region as SDF and propose a novel human manifold signed distance field (hmSDF) to segment the visible clothing and visible body, and then merge the visible and invisible body. Extensive experimental results demonstrate that, compared with existing reconstruction schemes, D3-Human can achieve high-quality decoupled reconstruction of the human body wearing different clothing, and can be directly applied to clothing transfer and animation production. Code is available at https://ustc3dv.github.io/D3Human/.
At present, ecological education for primary school students is facing urgent needs to enhance immersion and interaction in teaching methods and media. Traditional methods are insufficient in presenting the complexity of the ecosystem and guiding students ‘emotional investment, and virtual digital human technology provides a new way to solve this problem. This research aims to construct a complete technical framework from image automatic generation to intelligent interaction. In the aspect of image generation, by optimizing the generator and classifier of CycleGAN, the high quality and identity preservation migration from real face to specific archaic cartoon style is realized. A hybrid drive model based on skeleton animation and Blendshape is proposed in the core of interactive drive, and a low latency phoneme-visual real-time synchronization algorithm is designed to ensure the natural performance of digital human consistent with words and deeds. In order to realize cross-platform deployment, a lightweight rendering engine based on WebGL is developed, which ensures smooth operation on Mobile device through multi-level of detail (LOD), partial redrawing and other optimization technologies. Finally, through system integration, an application prototype of “ecological intelligence” with knowledge explanation, intelligent question and answer and multimodal interaction functions was developed. The evaluation shows that the scheme is superior to the traditional method in image generation quality (FID index is improved by about 36%), synchronization real-time (delay $\leqslant 85 ~\text{ms}$) and user acceptance (preference rate reaches 68.5%), which provides a feasible technical scheme and practical example for creating ecological education digital assistant with strong affinity and high interactivity.
This paper introduces an Innovative Generative AI (GenAI) based Digital Human which is specifically built to mitigate the challenges of social isolation and mental health among elderly population. This makes use of large language models and 3D animation technologies. This system has the capacity to combine multi-modal GenAI intelligence with real-time, photorealistic digital facial avatars to create emotionally responsive and context-aware conversational companions. A powerful natural language processing engine based on the Mistral 8x7B model lies at the fundamental level. A key innovation in this work is the realistic rendering of digital human faces, featuring synchronized lip movements which leads to an emotionally engaging conversation. The smooth integration significantly enhances the user experience, especially for elderly users who may struggle with traditional interfaces. This research lays a strong foundation for the future of emotionally intelligent digital companions, which inturn emphasizes their role not just as assistive technologies, but as proactive agents in promoting psychosocial well-being and digital inclusion for aging populations.
With advances in digital transformation (DX) in education and digital technologies becoming more deeply integrated into educational settings, global demand for video-based learning materials continues to rise, resulting in substantial effort being required from teachers to create e-learning videos. Furthermore, while many existing services offer visual content, they primarily rely on templates, making it challenging to design custom content that addresses specific needs. In this study, we develop a web service that facilitates e-learning video creation through integrated artificial intelligence (AI) and digital human technology. This service enhances educational content by integrating digital human characters and voice synthesis technologies, aiming to create comprehensive e-learning videos by incorporating visual motion and synchronized audio into educational content. In addition, this service also aims to enable the creation of engaging content through advanced visuals and animations, effectively maintaining learner interest.
With the increasing complexity of human–computer interaction scenarios, conventional digital human facial expression systems show notable limitations in handling multi-emotion co-occurrence, dynamic expression, and semantic responsiveness. This paper proposes a digital human system framework that integrates multimodal emotion recognition and compound facial expression generation. The system establishes a complete pipeline for real-time interaction and compound emotional expression, following a sequence of “speech semantic parsing—multimodal emotion recognition—Action Unit (AU)-level 3D facial expression control.” First, a ResNet18-based model is employed for robust emotion classification using the AffectNet dataset. Then, an AU motion curve driving module is constructed on the Unreal Engine platform, where dynamic synthesis of basic emotions is achieved via a state-machine mechanism. Finally, Generative Pre-trained Transformer (GPT) is utilized for semantic analysis, generating structured emotional weight vectors that are mapped to the AU layer to enable language-driven facial responses. Experimental results demonstrate that the proposed system significantly improves facial animation quality, with naturalness increasing from 3.54 to 3.94 and semantic congruence from 3.44 to 3.80. These results validate the system’s capability to generate realistic and emotionally coherent expressions in real time. This research provides a complete technical framework and practical foundation for high-fidelity digital humans with affective interaction capabilities.
Abstract Virtual digital humans can not only express images and sounds, but also simulate the emotions and reactions of real humans through interaction with users. Its application has been widely penetrated into many fields such as education, medical care, entertainment, customer service, etc. This paper focuses on the two core technologies in digital human technology‐speech generation technology TTS (Text‐to‐Speech) and image generation and processing technology, and explores their development history, technical challenges and future development trends. First, this paper analyzes the evolution of speech generation technology TTS. Deep learning models such as Tacotron 2 and FastSpeech have significantly improved the naturalness, fluency and emotional expression of speech synthesis by optimizing the model architecture. At the same time, the rise of multi‐emotional speech synthesis and personalized speech customization technology has promoted the application of virtual digital people in different scenarios, enabling them to show richer emotional levels and personality characteristics, further enhancing the user's immersion and interactive experience. Secondly, with the advancement of deep learning technologies such as generative adversarial networks (GAN) and deep convolutional neural networks (CNN), the facial expressions, body movements and detail processing of virtual digital people have reached a high level. Through facial motion capture and posture estimation technology, the dynamic performance and real‐time interactive capabilities of virtual digital people have been greatly improved, making them more realistic and natural.
数字人动作生成的研究正从单一的口型同步向多模态深度集成演进。目前的研究方向主要集中在:1) 利用生成式AI(如Diffusion和Transformer)提升肢体与面部动作的真实感;2) 结合大语言模型构建具备感知和交互能力的具身智能数字人;3) 探索低成本、高质量的单目视频/图像三维重建与动作驱动技术。同时,建立标准化的客观评估指标与感官评价体系已成为该领域进一步发展的关键需求。