扩散模型用于指纹生成
扩散模型在指纹图像合成与数据增强中的直接应用
这组文献直接探讨了如何利用扩散模型(如DDPM)生成高质量、多样化的指纹图像,以解决指纹识别领域中数据稀缺、隐私保护和数据增强等核心问题。研究涵盖了从全指纹生成到残缺指纹补全,以及针对潜指纹(Latent Fingerprint)的端到端合成。
- Data augmentation-based enhanced fingerprint recognition using deep convolutional generative adversarial network and diffusion models(Yukai Liu, 2024, Applied and Computational Engineering)
- DiffFinger: Advancing Synthetic Fingerprint Generation through Denoising Diffusion Probabilistic Models(Fred M. Grabovski, Lior Yasur, Yaniv Hacmon, Lior Nisimov, Stav Nimrod, 2024, ArXiv)
- Enhancing Fingerprint Image Synthesis with GANs, Diffusion Models, and Style Transfer Techniques(W. Tang, D. Figueroa, D. Liu, K. Johnsson, A. Sopasakis, 2024, ArXiv)
- DENOISING DIFFUSION PROBABILISTIC MODEL WITH WAVELET PACKET TRANSFORM FOR FINGERPRINT GENERATION(Li Chen, Yong Chan, 2024, Jordanian Journal of Computers and Information Technology)
- Inpainting Diffusion Synthetic and Data Augment With Feature Keypoints for Tiny Partial Fingerprints(Mao-Hsiu Hsu, Yung-Ching Hsu, Ching-Te Chiu, 2025, IEEE Transactions on Biometrics, Behavior, and Identity Science)
- Diffusion Probabilistic Model Based End-to-End Latent Fingerprint Synthesis(Kejian Li, Xiao Yang, 2023, 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML))
- Multiresolution synthetic fingerprint generation(André Brasil Vieira Wyzykowski, M. P. Segundo, R. Lemes, 2022, IET Biom.)
- Fingerprint Synthesis from Diffusion Models and Generative Adversarial Networks(Weizhong Tang, Diego Andre Figueroa Llamosas, Donglin Liu, K. Johnsson, A. Sopasakis, 2025, No journal)
扩散模型的可控生成与身份特征保持技术
这组文献关注扩散模型在生成过程中的可控性,特别是如何通过外部信号(如文本、草图、身份ID)引导生成。对于指纹生成而言,这涉及到保持同一手指的类内变异(Intra-class variations)和身份一致性,以及对指纹类别、传感器类型等属性的精确控制。
- T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models(Chong Mou, Xintao Wang, Liangbin Xie, Jing Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie, 2023, No journal)
- Composer: Creative and Controllable Image Synthesis with Composable Conditions(Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, Jingren Zhou, 2023, ArXiv)
- Adding Conditional Control to Text-to-Image Diffusion Models(Lvmin Zhang, Anyi Rao, Maneesh Agrawala, 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- Universal Fingerprint Generation: Controllable Diffusion Model With Multimodal Conditions(Steven A. Grosz, Anil K. Jain, 2024, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing(Dongxu Li, Junnan Li, Steven C. H. Hoi, 2023, ArXiv)
- UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild(Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Haiquan Wang, Juan Carlos Niebles, Caiming Xiong, S. Savarese, Stefano Ermon, Yun Fu, Ran Xu, 2023, ArXiv)
- ICAS: IP-Adapter and ControlNet-Based Attention Structure for Multi-Subject Style Transfer Optimization(Fuwei Liu, 2025, IEEE Access)
- MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation(Omer Bar-Tal, Lior Yariv, Y. Lipman, Tali Dekel, 2023, No journal)
潜在扩散模型(LDM)架构优化与训练效率提升
这组文献侧重于扩散模型底层架构的改进,特别是潜在扩散模型(Latent Diffusion Models)及其自编码器(VAE)的优化。这些技术进步(如加速收敛、对齐潜在空间、端到端训练)为生成超高分辨率和更高保真度的指纹图像提供了算法基础。
- High-Resolution Image Synthesis with Latent Diffusion Models(Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, B. Ommer, 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis(Dustin Podell, Zion English, Kyle Lacey, A. Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna, Robin Rombach, 2023, ArXiv)
- DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space(Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, Han Cai, 2025, ArXiv)
- Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models(Jingfeng Yao, Xinggang Wang, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Latent Diffusion Model without Variational Autoencoder(Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu, 2025, ArXiv)
- REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers(Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng, 2025, ArXiv)
扩散模型在指纹防伪检测与特定约束任务中的扩展
这组文献展示了扩散模型在指纹生成之外的衍生应用,包括利用重建误差进行指纹活体检测(PAD)、在物理约束下进行生成、4K超高分辨率合成以及通过注意力机制进行图像编辑等前沿方向。
- Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing(Bingyan Liu, Chengyu Wang, Tingfeng Cao, Kui Jia, Jun Huang, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing(Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, Jun-yong Noh, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models(A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, S. Fidler, Karsten Kreis, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Training-Free Constrained Generation With Stable Diffusion Models(S. Zampini, Jacob K. Christopher, Luca Oneto, D. Anguita, Ferdinando Fioretto, 2025, ArXiv)
- Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models(Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Unsupervised Fingerphoto Presentation Attack Detection With Diffusion Models(Hailin Li, Raghavendra Ramachandra, Mohamed Ragab, Soumik Mondal, Yong Kiam Tan, Khin Mi Mi Aung, 2024, 2024 IEEE International Joint Conference on Biometrics (IJCB))
本组文献展示了扩散模型在指纹识别领域的全方位应用:从基础的指纹图像合成与数据增强(如DiffFinger, GenPrint),到通过ControlNet、T2I-Adapter等技术实现对指纹身份和风格的精细控制;同时,研究也涵盖了潜在扩散模型(LDM)在架构优化和训练效率上的最新突破;最后,扩散模型还被扩展到指纹防伪检测(PAD)和超高分辨率生成等特殊任务中,体现了该技术在生物识别领域的巨大潜力。
总计28篇相关文献
The utilization of synthetic data for fingerprint recognition has garnered increased attention due to its potential to alleviate privacy concerns surrounding sensitive biometric data. However, current methods for generating fingerprints have limitations in creating impressions of the same finger with useful intra-class variations. To tackle this challenge, we present GenPrint, a framework to produce fingerprint images of various types while maintaining identity and offering humanly understandable control over different appearance factors, such as fingerprint class, acquisition type, sensor device, and quality level. Unlike previous fingerprint generation approaches, GenPrint is not confined to replicating style characteristics from the training dataset alone: it enables the generation of novel styles from unseen devices without requiring additional fine-tuning. To accomplish these objectives, we developed GenPrint using latent diffusion models with multimodal conditions (text and image) for consistent generation of style and identity. Our experiments leverage a variety of publicly available datasets for training and evaluation. Results demonstrate the benefits of GenPrint in terms of identity preservation, explainable control, and universality of generated images. Importantly, the GenPrint-generated images yield comparable or even superior accuracy to models trained solely on real data and further enhances performance when augmenting the diversity of existing real fingerprint datasets.
The majority of contemporary fingerprint synthesis is based on the Generative Adversarial Network (GAN). Recently, the Denoising Diffusion Probabilistic Model (DDPM) has been demonstrated to be more effective than GAN in numerous scenarios, particularly in terms of diversity and fidelity. This research develops a model based on the enhanced DDPM for fingerprint generation. Specifically, the image is decomposed into sub-images of varying frequency sub-bands through the use of a wavelet packet transform (WPT). This method enables DDPM to operate at a more local and detailed level, thereby accurately obtaining the characteristics of the data. Furthermore, a polynomial noise schedule has been designed to replace the linear noise strategy, which can result in a smoother noise addition process. Experiments based on multiple metrics on the datasets SOCOFing and NIST4 demonstrate that the proposed model is superior to existing models.
We present novel approaches involving generative adversarial networks and diffusion models in order to synthesize high quality, live and spoof fingerprint images while preserving features such as uniqueness and diversity. We generate live fingerprints from noise with a variety of methods, and we use image translation techniques to translate live fingerprint images to spoof. To generate different types of spoof images based on limited training data we incorporate style transfer techniques through a cycle autoencoder equipped with a Wasserstein metric along with Gradient Penalty (CycleWGAN-GP) in order to avoid mode collapse and instability. We find that when the spoof training data includes distinct spoof characteristics, it leads to improved live-to-spoof translation. We assess the diversity and realism of the generated live fingerprint images mainly through the Fr\'echet Inception Distance (FID) and the False Acceptance Rate (FAR). Our best diffusion model achieved a FID of 15.78. The comparable WGAN-GP model achieved slightly higher FID while performing better in the uniqueness assessment due to a slightly lower FAR when matched against the training data, indicating better creativity. Moreover, we give example images showing that a DDPM model clearly can generate realistic fingerprint images.
The progress of fingerprint recognition applications encounters substantial hurdles due to privacy and security concerns, leading to limited fingerprint data availability and stringent data quality requirements. This article endeavors to tackle the challenges of data scarcity and data quality in fingerprint recognition by implementing data augmentation techniques. Specifically, this research employed two state-of-the-art generative models in the domain of deep learning, namely Deep Convolutional Generative Adversarial Network (DCGAN) and the Diffusion model, for fingerprint data augmentation. Generative Adversarial Network (GAN), as a popular generative model, effectively captures the features of sample images and learns the diversity of the sample images, thereby generating realistic and diverse images. DCGAN, as a variant model of traditional GAN, inherits the advantages of GAN while alleviating issues such as blurry images and mode collapse, resulting in improved performance. On the other hand, Diffusion, as one of the most popular generative models in recent years, exhibits outstanding image generation capabilities and surpasses traditional GAN in some image generation tasks. The experimental results demonstrate that both DCGAN and Diffusion can generate clear, high-quality fingerprint images, fulfilling the requirements of fingerprint data augmentation. Furthermore, through the comparison between DCGAN and Diffusion, it is concluded that the quality of fingerprint images generated by DCGAN is superior to the results of Diffusion, and DCGAN exhibits higher efficiency in both training and generating images compared to Diffusion.
Fingerprints have been crucial evidence for law enforcement agencies for a long time. Though the rapidly developing deep learning has dramatically improved the performance of the latent fingerprint recognition algorithm, a fully automated latent fingerprint identification system is still far from meeting actual needs. One major issue is the lack of publicly available latent fingerprint databases. Recently, diffusion probabilistic models have emerged as state-of-the-art deep generative methods for image synthesis. These models have better distribution coverage and less mode collapse than the popular Generative Adversarial Networks. In this paper, we propose an end-to-end latent fingerprint synthetic approach based on the improved denoising diffusion probabilistic model. The proposed approach could simultaneously generate latent, rolled, and plain fingerprints of high visual realism. Several primary degradation factors, such as various background textures, limited area of ridge patterns, and structural noise, can be directly generated without any postprocessing, unlike existing methods. We conduct NFIQ2 and perceptual analysis in the experiments to evaluate the proposed approach. The results indicate that the quality and visual realism of the proposed synthetic fingerprints is similar to the natural ones, demonstrating the effectiveness of our approach.
Smartphone-based contactless fingerphoto authentication has become a reliable alternative to traditional contact-based fingerprint biometric systems owing to rapid advances in smartphone camera technology. Despite its convenience, fingerprint authentication through fingerphotos is more vulnerable to presentation attacks, which has motivated recent research efforts towards developing fingerphoto Presentation Attack Detection (PAD) techniques. However, prior PAD approaches utilized supervised learning methods that require labeled training data for both bona fide and attack samples. This can suffer from two key issues, namely (i) generalization—the detection of novel presentation attack instruments (PAIs) unseen in the training data, and (ii) scalability—the collection of a large dataset of attack samples using different PAIs. To address these challenges, we propose a novel unsupervised approach based on a state-of-the-art deep-learning-based diffusion model, the Denoising Diffusion Probabilistic Model (DDPM), which is trained solely on bona fide samples. The proposed approach detects Presentation Attacks (PA) by calculating the reconstruction similarity between the input and output pairs of the DDPM. We present extensive experiments across three PAI datasets to test the accuracy and generalization capability of our approach. The results show that the proposed DDPM-based PAD method achieves significantly better detection error rates on several PAI classes compared to other baseline unsupervised approaches.
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for suboptimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pretrained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256×256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs – representing an over 21× convergence speedup compared to the original DiT. Models and codes are available at https://github.com/hustvl/LightningDiT.
In this paper we tackle a fundamental question:"Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?"Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and finetuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution $512 \times 1024$, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pretrained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to $1280 \times 2048$. We show that the temporal layers trained in this way generalize to different finetuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://nv-tlabs.github.io/VideoLDM/
In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis. Code is available at https://github.com/zhang0jhon/diffusion-4k.
Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.
Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.
We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.
Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative text-to-image generation. However, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers. This approach modifies objects or object properties in images by manipulating feature components in attention layers during the generation process. Nevertheless, little is known about the semantic meanings that these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper, we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information, which can result in editing failures. In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the source image during the transformation to the target image. Our analysis offers valuable insights into understanding cross and self-attention mechanisms in diffusion models. Furthermore, based on our findings, we propose a simplified, yet more stable and efficient, tuning-free procedure that modifies only the self-attention maps of specified attention layers during the denoising process. Experimental results show that our simplified method consistently surpasses the performance of popular approaches on multiple datasets. 11Source code and datasets are available at https://github.com/alibaba/EasyNLP/tree/master/diffusion/FreePromptEditing.
Stable diffusion models represent the state-of-the-art in data synthesis across diverse domains and hold transformative potential for applications in science and engineering, e.g., by facilitating the discovery of novel solutions and simulating systems that are computationally intractable to model explicitly. While there is increasing effort to incorporate physics-based constraints into generative models, existing techniques are either limited in their applicability to latent diffusion frameworks or lack the capability to strictly enforce domain-specific constraints. To address this limitation this paper proposes a novel integration of stable diffusion models with constrained optimization frameworks, enabling the generation of outputs satisfying stringent physical and functional requirements. The effectiveness of this approach is demonstrated through material design experiments requiring adherence to precise morphometric properties, challenging inverse design tasks involving the generation of materials inducing specific stress-strain responses, and copyright-constrained content generation tasks. All code has been released at https://github.com/RAISELab-atUVA/Constrained-Stable-Diffusion.
This study explores the generation of synthesized fingerprint images using Denoising Diffusion Probabilistic Models (DDPMs). The significant obstacles in collecting real biometric data, such as privacy concerns and the demand for diverse datasets, underscore the imperative for synthetic biometric alternatives that are both realistic and varied. Despite the strides made with Generative Adversarial Networks (GANs) in producing realistic fingerprint images, their limitations prompt us to propose DDPMs as a promising alternative. DDPMs are capable of generating images with increasing clarity and realism while maintaining diversity. Our results reveal that DiffFinger not only competes with authentic training set data in quality but also provides a richer set of biometric data, reflecting true-to-life variability. These findings mark a promising stride in biometric synthesis, showcasing the potential of DDPMs to advance the landscape of fingerprint identification and authentication systems.
Inpainting Diffusion Synthetic and Data Augment With Feature Keypoints for Tiny Partial Fingerprints
The advancement of fingerprint research within public academic circles has been trailing behind facial recognition, primarily due to the scarcity of extensive publicly available datasets, despite fingerprints being widely used across various domains. Recent progress has seen the application of deep learning techniques to synthesize fingerprints, predominantly focusing on large-area fingerprints within existing datasets. However, with the emergence of AIoT and edge devices, the importance of tiny partial fingerprints has been underscored for their faster and more cost-effective properties. Yet, there remains a lack of publicly accessible datasets for such fingerprints. To address this issue, we introduce publicly available datasets tailored for tiny partial fingerprints. Using advanced generative deep learning, we pioneer diffusion methods for fingerprint synthesis. By combining random sampling with inpainting diffusion guided by feature keypoints masks, we enhance data augmentation while preserving key features, achieving up to 99.1% recognition matching rate. To demonstrate the usefulness of our fingerprint images generated using our approach, we conducted experiments involving model training for various tasks, including denoising, deblurring, and deep forgery detection. The results showed that models trained with our generated datasets outperformed those trained without our datasets or with other synthetic datasets. This indicates that our approach not only produces diverse fingerprints but also improves the model’s generalization capabilities. Furthermore, our approach ensures confidentiality without compromise by partially transforming randomly sampled synthetic fingerprints, which reduces the likelihood of real fingerprints being leaked. The total number of generated fingerprints published in this article amounts to 818,077. Moving forward, we are ongoing updates and releases to contribute to the advancement of the tiny partial fingerprint field. The code and our generated tiny partial fingerprint dataset can be accessed at https://github.com/Hsu0623/Inpainting-Diffusion-Synthetic-and-Data-Augment-with-Feature-Keypoints-for-Tiny-Partial-Fingerprints.git
No abstract available
The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., structure and color) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn low-cost T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications. Our code is available at https://github.com/TencentARC/T2I-Adapter.
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Code and models will be released at https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project page at https://dxli94.github.io/BLIP-Diffusion-website/.
Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.
Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. Project webpage: https://multidiffusion.github.io
Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability. This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity. With compositionality as the core idea, we first decompose an image into representative factors, and then train a diffusion model with all these factors as the conditions to recompose the input. At the inference stage, the rich intermediate representations work as composable elements, leading to a huge design space (i.e., exponentially proportional to the number of decomposed factors) for customizable content creation. It is noteworthy that our approach, which we call Composer, supports various levels of conditions, such as text description as the global information, depth map and sketch as the local guidance, color histogram for low-level details, etc. Besides improving controllability, we confirm that Composer serves as a general framework and facilitates a wide range of classical generative tasks without retraining. Code and models will be made available.
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.
Generating multi-subject stylized images remains a significant challenge due to the ambiguity in defining style attributes (e.g., color, texture, atmosphere, and structure) and the difficulty in consistently applying them across multiple subjects. Although recent diffusion-based text-to-image models have achieved remarkable progress, existing methods typically rely on computationally expensive inversion procedures or large-scale stylized datasets. Moreover, these methods often struggle with maintaining multi-subject semantic fidelity and are limited by high inference costs. To address these limitations, this paper proposes ICAS (IP-Adapter and ControlNet-based Attention Structure), a novel framework for efficient and controllable multi-subject style transfer. Instead of full-model tuning, ICAS adaptively fine-tunes only the content injection branch of a pre-trained diffusion model, thereby preserving identity-specific semantics while enhancing style controllability. By combining IP-Adapter for adaptive style injection with ControlNet for structural conditioning, our framework ensures faithful global layout preservation alongside accurate local style synthesis. Furthermore, ICAS introduces a cyclic multi-subject content embedding mechanism, which enables effective style transfer under limited-data settings without the need for extensive stylized corpora. Extensive experiments show that ICAS achieves superior performance in structure preservation, style consistency, and inference efficiency. In particular, in a comprehensive user study, the proposed method was rated superior to baselines, achieving an average score of 4.32 out of 5.0 for Style Fidelity and 4.28 for Subject Clarity. This performance establishes a new paradigm for multi-subject style transfer in real-world applications.
No abstract available
本组文献展示了扩散模型在指纹识别领域的全方位应用:从基础的指纹图像合成与数据增强(如DiffFinger, GenPrint),到通过ControlNet、T2I-Adapter等技术实现对指纹身份和风格的精细控制;同时,研究也涵盖了潜在扩散模型(LDM)在架构优化和训练效率上的最新突破;最后,扩散模型还被扩展到指纹防伪检测(PAD)和超高分辨率生成等特殊任务中,体现了该技术在生物识别领域的巨大潜力。