扩散模型用于指纹生成
扩散模型基础理论与核心架构
这些文献奠定了扩散模型的基础,涵盖了从早期DDPM到潜空间扩散(LDM)以及基于Transformer的架构演进。
- Denoising Diffusion Probabilistic Models(Jonathan Ho, Ajay Jain, P. Abbeel, 2020, Neural Information Processing Systems)
- Improved Denoising Diffusion Probabilistic Models(Alex Nichol, Prafulla Dhariwal, 2021, International Conference on Machine Learning)
- Denoising Diffusion Implicit Models(Jiaming Song, Chenlin Meng, Stefano Ermon, 2020, International Conference on Learning Representations)
- High-Resolution Image Synthesis with Latent Diffusion Models(Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, B. Ommer, 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Scalable Diffusion Models with Transformers(William S. Peebles, Saining Xie, 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers(Nanye Ma, Mark Goldstein, M. S. Albergo, N. Boffi, Eric Vanden-Eijnden, Saining Xie, 2024, European Conference on Computer Vision)
- Simplified and Generalized Masked Diffusion for Discrete Data(Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, Michalis K. Titsias, 2024, Neural Information Processing Systems)
- Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think(Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, Saining Xie, 2024, International Conference on Learning Representations)
指纹与生物识别图像生成应用
专门探讨将扩散模型应用于指纹和掌纹生成的文献,旨在解决生物识别领域的数据稀缺、隐私保护和增强识别准确性等问题。
- DiffFinger: Advancing Synthetic Fingerprint Generation through Denoising Diffusion Probabilistic Models(Fred M. Grabovski, Lior Yasur, Yaniv Hacmon, Lior Nisimov, Stav Nimrod, 2024, arXiv.org)
- DENOISING DIFFUSION PROBABILISTIC MODEL WITH WAVELET PACKET TRANSFORM FOR FINGERPRINT GENERATION(Li Chen, Yong Chan, 2024, Jordanian Journal of Computers and Information Technology)
- Diffusion Probabilistic Model Based End-to-End Latent Fingerprint Synthesis(Kejian Li, Xiao Yang, 2023, 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML))
- Fingerprint Synthesis from Diffusion Models and Generative Adversarial Networks(Weizhong Tang, Diego Andre Figueroa Llamosas, Donglin Liu, K. Johnsson, A. Sopasakis, 2025, Lecture Notes in Networks and Systems)
- Data augmentation-based enhanced fingerprint recognition using deep convolutional generative adversarial network and diffusion models(Yukai Liu, 2024, Applied and Computational Engineering)
- PalmDiff: When Palmprint Generation Meets Controllable Diffusion Model(Long Tang, Tingting Chai, Zheng Zhang, Miao Zhang, Xiangqian Wu, 2025, IEEE Transactions on Image Processing)
- Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models(Jianlong Jin, Chenglong Zhao, Ruixin Zhang, Sheng Shang, Jianqing Xu, Jingyu Zhang, Shaoming Wang, Yang Zhao, Shouhong Ding, Wei Jia, Yunsheng Wu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Enhancing Fingerprint Image Synthesis with GANs, Diffusion Models, and Style Transfer Techniques(W. Tang, D. Figueroa, D. Liu, K. Johnsson, A. Sopasakis, 2024, arXiv.org)
身份保持与可控生成技术
聚焦于如何实现精确的条件控制(如布局、身份信息ID、姿态等),这对于生成同一指纹的不同样本(类内差异)至关重要。
- Adding Conditional Control to Text-to-Image Diffusion Models(Lvmin Zhang, Anyi Rao, Maneesh Agrawala, 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- FPGAN-Control: A Controllable Fingerprint Generator for Training with Synthetic Data(Alon Shoshan, Nadav Bhonker, Emanuel Ben Baruch, Ori Nizan, Igor Kviatkovsky, Joshua Engelsma, Manoj Aggarwal, Gérard G. Medioni, 2023, 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))
- Universal Fingerprint Generation: Controllable Diffusion Model With Multimodal Conditions(Steven A. Grosz, Anil K. Jain, 2024, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- DCFace: Synthetic Face Generation with Dual Condition Diffusion Model(Minchul Kim, Feng Liu, Anil Jain, Xiaoming Liu, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild(Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Haiquan Wang, Juan Carlos Niebles, Caiming Xiong, S. Savarese, Stefano Ermon, Yun Fu, Ran Xu, 2023, Neural Information Processing Systems)
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models(Hu Ye, Jun Zhang, Siyi Liu, Xiao Han, Wei Yang, 2023, arXiv.org)
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation(Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Y. Pritch, Michael Rubinstein, Kfir Aberman, 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers(Sida Huang, Siqi Huang, Ping Luo, Hongyuan Zhang, 2025, ArXiv)
- Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs(Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui, 2024, International Conference on Machine Learning)
- Data-Driven Fingerprint Reconstruction from Minutiae Based on Real and Synthetic Training Data(A. Makrushin, V. Mannam, J. Dittmann, 2023, Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications)
多模态对齐、引导机制与大规模模型
研究如何通过分类器引导或无分类器引导增强图像质量,以及多模态(文本/图像)信息的深度对齐与推理机制。
- Diffusion Models Beat GANs on Image Synthesis(Prafulla Dhariwal, Alex Nichol, 2021, Neural Information Processing Systems)
- Classifier-Free Diffusion Guidance(Jonathan Ho, 2022, arXiv.org)
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models(Alex Nichol, Prafulla Dhariwal, A. Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, I. Sutskever, Mark Chen, 2021, International Conference on Machine Learning)
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding(Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. S. Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, Mohammad Norouzi, 2022, Neural Information Processing Systems)
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis(Dustin Podell, Zion English, Kyle Lacey, A. Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna, Robin Rombach, 2023, International Conference on Learning Representations)
- I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models(Zhenxing Mi, K. Wang, G. Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu, 2025, International Conference on Machine Learning)
- MMaDA: Multimodal Large Diffusion Language Models(Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang, 2025, arXiv.org)
- Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers(Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong, 2025, arXiv.org)
- LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer(Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang, 2025, ArXiv)
- Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces(Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X.-F. Ye, Molei Tao, 2025, International Conference on Machine Learning)
- Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing(Bingyan Liu, Chengyu Wang, Tingfeng Cao, Kui Jia, Jun Huang, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
扩散模型推理加速与蒸馏技术
针对扩散模型推理步骤多、速度慢的问题,提出高效的采样和蒸馏方案,以实现快速生成。
- SDXL-Lightning: Progressive Adversarial Diffusion Distillation(Shanchuan Lin, Anran Wang, Xiao Yang, 2024, arXiv.org)
- Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation(Clément Chadebec, O. Tasar, Eyal Benaroche, Benjamin Aubin, 2024, AAAI Conference on Artificial Intelligence)
该组论文展示了扩散模型从基础生成理论到指纹识别等特定生物识别领域的完整发展路径。研究核心集中在如何通过改进架构(如从U-Net转向Transformer)和控制机制(如ControlNet、ID-Loss)来生成既具有高度真实感又能保持身份一致性的合成指纹。此外,为了克服数据稀缺与隐私限制,研究者利用这些模型进行大规模数据增强,并致力于解决生成过程中的模态对齐效率和推理加速挑战。
总计39篇相关文献
The utilization of synthetic data for fingerprint recognition has garnered increased attention due to its potential to alleviate privacy concerns surrounding sensitive biometric data. However, current methods for generating fingerprints have limitations in creating impressions of the same finger with useful intra-class variations. To tackle this challenge, we present GenPrint, a framework to produce fingerprint images of various types while maintaining identity and offering humanly understandable control over different appearance factors, such as fingerprint class, acquisition type, sensor device, and quality level. Unlike previous fingerprint generation approaches, GenPrint is not confined to replicating style characteristics from the training dataset alone: it enables the generation of novel styles from unseen devices without requiring additional fine-tuning. To accomplish these objectives, we developed GenPrint using latent diffusion models with multimodal conditions (text and image) for consistent generation of style and identity. Our experiments leverage a variety of publicly available datasets for training and evaluation. Results demonstrate the benefits of GenPrint in terms of identity preservation, explainable control, and universality of generated images. Importantly, the GenPrint-generated images yield comparable or even superior accuracy to models trained solely on real data and further enhances performance when augmenting the diversity of existing real fingerprint datasets.
The majority of contemporary fingerprint synthesis is based on the Generative Adversarial Network (GAN). Recently, the Denoising Diffusion Probabilistic Model (DDPM) has been demonstrated to be more effective than GAN in numerous scenarios, particularly in terms of diversity and fidelity. This research develops a model based on the enhanced DDPM for fingerprint generation. Specifically, the image is decomposed into sub-images of varying frequency sub-bands through the use of a wavelet packet transform (WPT). This method enables DDPM to operate at a more local and detailed level, thereby accurately obtaining the characteristics of the data. Furthermore, a polynomial noise schedule has been designed to replace the linear noise strategy, which can result in a smoother noise addition process. Experiments based on multiple metrics on the datasets SOCOFing and NIST4 demonstrate that the proposed model is superior to existing models.
This study explores the generation of synthesized fingerprint images using Denoising Diffusion Probabilistic Models (DDPMs). The significant obstacles in collecting real biometric data, such as privacy concerns and the demand for diverse datasets, underscore the imperative for synthetic biometric alternatives that are both realistic and varied. Despite the strides made with Generative Adversarial Networks (GANs) in producing realistic fingerprint images, their limitations prompt us to propose DDPMs as a promising alternative. DDPMs are capable of generating images with increasing clarity and realism while maintaining diversity. Our results reveal that DiffFinger not only competes with authentic training set data in quality but also provides a richer set of biometric data, reflecting true-to-life variability. These findings mark a promising stride in biometric synthesis, showcasing the potential of DDPMs to advance the landscape of fingerprint identification and authentication systems.
In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pre-trained diffusion models. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as *text-to-image*, *inpainting*, *face-swapping*, *super-resolution* and using different backbones such as UNet-based denoisers (SD1.5, SDXL), DiT (Pixart) and MMDiT (SD3), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation.
Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.
Generating synthetic datasets for training face recognition models is challenging because dataset generation entails more than creating high fidelity images. It involves generating multiple images of same subjects under different factors (e.g., variations in pose, illumination, expression, aging and occlusion) which follows the real image conditional distribution. Previous works have studied the generation of synthetic datasets using GAN or 3D models. In this work, we approach the problem from the aspect of combining subject appearance (ID) and external factor (style) conditions. These two conditions provide a direct way to control the inter-class and intra-class variations. To this end, we propose a Dual Condition Face Generator (DCFace) based on a diffusion model. Our novel Patch-wise style extractor and Time-step dependent ID loss enables DCFace to consistently produce face images of the same subject under different styles with precise control. Face recognition models trained on synthetic images from the proposed DCFace provide higher verification accuracies compared to previous works by 6.11% on average in 4 out of 5 test datasets, LFW, CFP-FP, CPLFW, AgeDB and CALFW. Code Link
Fingerprints have been crucial evidence for law enforcement agencies for a long time. Though the rapidly developing deep learning has dramatically improved the performance of the latent fingerprint recognition algorithm, a fully automated latent fingerprint identification system is still far from meeting actual needs. One major issue is the lack of publicly available latent fingerprint databases. Recently, diffusion probabilistic models have emerged as state-of-the-art deep generative methods for image synthesis. These models have better distribution coverage and less mode collapse than the popular Generative Adversarial Networks. In this paper, we propose an end-to-end latent fingerprint synthetic approach based on the improved denoising diffusion probabilistic model. The proposed approach could simultaneously generate latent, rolled, and plain fingerprints of high visual realism. Several primary degradation factors, such as various background textures, limited area of ridge patterns, and structural noise, can be directly generated without any postprocessing, unlike existing methods. We conduct NFIQ2 and perceptual analysis in the experiments to evaluate the proposed approach. The results indicate that the quality and visual realism of the proposed synthetic fingerprints is similar to the natural ones, demonstrating the effectiveness of our approach.
We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
No abstract available
Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative text-to-image generation. However, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers. This approach modifies objects or object properties in images by manipulating feature components in attention layers during the generation process. Nevertheless, little is known about the semantic meanings that these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper, we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information, which can result in editing failures. In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the source image during the transformation to the target image. Our analysis offers valuable insights into understanding cross and self-attention mechanisms in diffusion models. Furthermore, based on our findings, we propose a simplified, yet more stable and efficient, tuning-free procedure that modifies only the self-attention maps of specified attention layers during the denoising process. Experimental results show that our simplified method consistently surpasses the performance of popular approaches on multiple datasets. 11Source code and datasets are available at https://github.com/alibaba/EasyNLP/tree/master/diffusion/FreePromptEditing.
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at this https URL
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
Classifier guidance is a recently introduced method to trade off mode coverage and sample fidelity in conditional diffusion models post training, in the same spirit as low temperature sampling or truncation in other types of generative models. Classifier guidance combines the score estimate of a diffusion model with the gradient of an image classifier and thereby requires training an image classifier separate from the diffusion model. It also raises the question of whether guidance can be performed without a classifier. We show that guidance can be indeed performed by a pure generative model without such a classifier: in what we call classifier-free guidance, we jointly train a conditional and an unconditional diffusion model, and we combine the resulting conditional and unconditional score estimates to attain a trade-off between sample quality and diversity similar to that obtained using classifier guidance.
We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128$\times$128, 4.59 on ImageNet 256$\times$256, and 7.72 on ImageNet 512$\times$512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256$\times$256 and 3.85 on ImageNet 512$\times$512. We release our code at https://github.com/openai/guided-diffusion
Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for “personalization” of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views and lighting conditions that do not appear in the reference images. We apply our technique to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features. We also provide a new dataset and evaluation protocol for this new task of subject-driven generation. Project page: https://dreambooth.github.io/
We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops—through increased transformer depth/width or increased number of input tokens—consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512×512 and 256×256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.
Denoising diffusion probabilistic models (DDPM) are a class of generative models which have recently been shown to produce excellent samples. We show that with a few simple modifications, DDPMs can also achieve competitive log-likelihoods while maintaining high sample quality. Additionally, we find that learning variances of the reverse diffusion process allows sampling with an order of magnitude fewer forward passes with a negligible difference in sample quality, which is important for the practical deployment of these models. We additionally use precision and recall to compare how well DDPMs and GANs cover the target distribution. Finally, we show that the sample quality and likelihood of these models scale smoothly with model capacity and training compute, making them easily scalable. We release our code at https://github.com/openai/improved-diffusion
Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.
Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes:"an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.
We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: learning in discrete or continuous time, the objective function, the interpolant that connects the distributions, and deterministic or stochastic sampling. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 and 512x512 benchmark using the exact same model structure, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06 and 2.62, respectively.
Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.
Masked (or absorbing) diffusion is actively explored as an alternative to autoregressive models for generative modeling of discrete data. However, existing work in this area has been hindered by unnecessarily complex model formulations and unclear relationships between different perspectives, leading to suboptimal parameterization, training objectives, and ad hoc adjustments to counteract these issues. In this work, we aim to provide a simple and general framework that unlocks the full potential of masked diffusion models. We show that the continuous-time variational objective of masked diffusion models is a simple weighted integral of cross-entropy losses. Our framework also enables training generalized masked diffusion models with state-dependent masking schedules. When evaluated by perplexity, our models trained on OpenWebText surpass prior diffusion language models at GPT-2 scale and demonstrate superior performance on 4 out of 5 zero-shot language modeling tasks. Furthermore, our models vastly outperform previous discrete diffusion models on pixel-level image modeling, achieving 2.75 (CIFAR-10) and 3.40 (ImageNet 64x64) bits per dimension that are better than autoregressive models of similar sizes. Our code is available at https://github.com/google-deepmind/md4.
We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.
We present novel approaches involving generative adversarial networks and diffusion models in order to synthesize high quality, live and spoof fingerprint images while preserving features such as uniqueness and diversity. We generate live fingerprints from noise with a variety of methods, and we use image translation techniques to translate live fingerprint images to spoof. To generate different types of spoof images based on limited training data we incorporate style transfer techniques through a cycle autoencoder equipped with a Wasserstein metric along with Gradient Penalty (CycleWGAN-GP) in order to avoid mode collapse and instability. We find that when the spoof training data includes distinct spoof characteristics, it leads to improved live-to-spoof translation. We assess the diversity and realism of the generated live fingerprint images mainly through the Fr\'echet Inception Distance (FID) and the False Acceptance Rate (FAR). Our best diffusion model achieved a FID of 15.78. The comparable WGAN-GP model achieved slightly higher FID while performing better in the uniqueness assessment due to a slightly lower FAR when matched against the training data, indicating better creativity. Moreover, we give example images showing that a DDPM model clearly can generate realistic fingerprint images.
The progress of fingerprint recognition applications encounters substantial hurdles due to privacy and security concerns, leading to limited fingerprint data availability and stringent data quality requirements. This article endeavors to tackle the challenges of data scarcity and data quality in fingerprint recognition by implementing data augmentation techniques. Specifically, this research employed two state-of-the-art generative models in the domain of deep learning, namely Deep Convolutional Generative Adversarial Network (DCGAN) and the Diffusion model, for fingerprint data augmentation. Generative Adversarial Network (GAN), as a popular generative model, effectively captures the features of sample images and learns the diversity of the sample images, thereby generating realistic and diverse images. DCGAN, as a variant model of traditional GAN, inherits the advantages of GAN while alleviating issues such as blurry images and mode collapse, resulting in improved performance. On the other hand, Diffusion, as one of the most popular generative models in recent years, exhibits outstanding image generation capabilities and surpasses traditional GAN in some image generation tasks. The experimental results demonstrate that both DCGAN and Diffusion can generate clear, high-quality fingerprint images, fulfilling the requirements of fingerprint data augmentation. Furthermore, through the comparison between DCGAN and Diffusion, it is concluded that the quality of fingerprint images generated by DCGAN is superior to the results of Diffusion, and DCGAN exhibits higher efficiency in both training and generating images compared to Diffusion.
: Fingerprint reconstruction from minutiae performed by model-based approaches often lead to fingerprint patterns that lack realism. In contrast, data-driven reconstruction leads to realistic fingerprints, but the reproduction of a fingerprint’s identity remain a challenging problem. In this paper, we examine the pix2pix network to fit for the reconstruction of realistic high-quality fingerprint images from minutiae maps. For encoding minutiae in minutiae maps we propose directed line and pointing minutiae approaches. We extend the pix2pix architecture to process complete plain fingerprints at their native resolution. Although our focus is on biometric fingerprints, the same concept fits for synthesis of latent fingerprints. We train models based on real and synthetic datasets and compare their performances regarding realistic appearance of generated fingerprints and reconstruction success. Our experiments establish pix2pix to be a valid and scalable solution. Reconstruction from minutiae enables identity-aware generation of synthetic fingerprints which in turn enables compilation of large-scale privacy-friendly synthetic fingerprint datasets including mated impressions.
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA
Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}
Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster
In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We propose LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly.
With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model. To effectively activate the copied parameters and avoid disturbance from unstable control conditions, we adopt a dedicated initialization scheme for Laytrol. In this scheme, the layout encoder is initialized as a pure text encoder to ensure that its output tokens remain within the data domain of MM-DiT. Meanwhile, the outputs of the layout control network are initialized to zero. In addition, we apply Object-level Rotary Position Embedding to the layout tokens to provide coarse positional information. Qualitative and quantitative experiments demonstrate the effectiveness of our method.
This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.
Due to its distinctive texture and intricate details, palmprint has emerged as a critical modality in biometric identity recognition. The absence of large-scale public palmprint datasets has substantially impeded the advancement of palmprint research, resulting in inadequate accuracy in commercial palmprint recognition systems. However, existing generative methods exhibit insufficient generalization, as the images they generate differ in specific ways from the conditional images. This paper proposes a method for generating palmprint images using a controllable diffusion model (PalmDiff), which addresses the issue of insufficient datasets by generating palmprint data, improving the accuracy of palmprint recognition. We introduce a diffusion process that effectively tackles the problems of excessive noise and loss of texture details commonly encountered in diffusion models. A linear attention mechanism is employed to enhance the backbone’s expressive capacity and reduce the computational complexity. To this end, we proposed an ID loss function to enable the diffusion model to generate palmprint images under the same identical space consistently. PalmDiff is compared with other generation methods in terms of both image quality and the enhancement of palmprint recognition performance. Experiments show that PalmDiff performs well in image generation, with an FID score of 13.311 on MPD and 18.434 on Tongji. Besides, PalmDiff has significantly improved various backbones for palmprint recognition compared to other generation methods.
Palmprint recognition is significantly limited by the lack of large-scale publicly available datasets. Previous methods have adopted Bézier curves to simulate the palm creases, which then serve as input for conditional GANs to generate realistic palmprints. However, without employing real data fine-tuning, the performance of the recognition model trained on these synthetic datasets would drastically decline, indicating a large gap between generated and real palmprints. This is primarily due to the utilization of an inaccurate palm crease representation and challenges in balancing intra-class variation with identity consistency. To address this, we introduce a polynomial-based palm crease representation that provides a new palm crease generation mechanism more closely aligned with the real distribution. We also propose the palm creases conditioned diffusion model with a novel intra-class variation control method. By applying our proposed K-step noise-sharing sampling, we are able to synthesize palmprint datasets with large intra-class variation and high identity consistency. Experimental results show that, for the first time, recognition models trained solely on our synthetic datasets, without any fine-tuning, outperform those trained on real datasets. Furthermore, our approach achieves superior recognition performance as the number of generated identities increases.
Training fingerprint recognition models using synthetic data has recently gained increased attention in the biometric community as it alleviates the dependency on sensitive personal data. Existing approaches for fingerprint generation are limited in their ability to generate diverse impressions of the same finger, a key property for providing effective data for training recognition models. To address this gap, we present FPGAN-Control, an identity preserving image generation framework which enables control over the fingerprint’s image appearance (e.g., fingerprint type, acquisition device, pressure level) of generated fingerprints. We introduce a novel appearance loss that encourages disentanglement between the fingerprint’s identity and appearance properties. In our experiments, we used the publicly available NIST SD302 (N2N) dataset for training the FPGAN-Control model. We demonstrate the merits of FPGAN-Control, both quantitatively and qualitatively, in terms of identity preservation level, degree of appearance control, and low synthetic-to-real domain gap. Finally, training recognition models using only synthetic datasets generated by FPGAN-Control lead to recognition accuracies that are on par or even surpass models trained using real data. To the best of our knowledge, this is the first work to demonstrate this.
该组论文展示了扩散模型从基础生成理论到指纹识别等特定生物识别领域的完整发展路径。研究核心集中在如何通过改进架构(如从U-Net转向Transformer)和控制机制(如ControlNet、ID-Loss)来生成既具有高度真实感又能保持身份一致性的合成指纹。此外,为了克服数据稀缺与隐私限制,研究者利用这些模型进行大规模数据增强,并致力于解决生成过程中的模态对齐效率和推理加速挑战。