<-Previous Article Next Article->

[1]GUAN Fengxu,ZHANG Hanyu,LU Siqi,et al.Research status of diffusion models in computer vision[J].CAAI Transactions on Intelligent Systems,2025,20(2):265-282.[doi:10.11992/tis.202312041]

Copy

Research status of diffusion models in computer vision

PDF Download HTML

CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume: 20 Number of periods: 2025 2 Page number: 265-282 Column: 综述 Public date: 2025-03-05

Title:: Research status of diffusion models in computer vision

Author(s):: GUAN Fengxu; ZHANG Hanyu; LU Siqi; LAI Haitao; DU Xue; ZHENG Yan; College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China

Keywords:: diffusion model; denoising diffusion probabilistic model; score-based generative model; deep learning; computer vision; image generation; generative model; generative adversarial network

CLC:: TP18

DOI:: 10.11992/tis.202312041

Abstract:: The diffusion model is a new generative model inspired by molecular thermodynamics. This model offers stable training and low dependence on model settings, making it a popular benchmark in computer vision. In recent years, the diffusion model has been widely applied to various tasks, yielding diverse and high-quality results compared to traditional generative models. At present, the diffusion model is a prominent method in the field of computer vision. This paper provides a comprehensive overview of the diffusion model to further stimulate its development in this domain. First, the paper compares the advantages and disadvantages of diffusion models with other generative models and introduces the underlying mathematical principles. Then, the study presents recent efforts by researchers to improve diffusion models, starting with common challenges and highlighting application examples in various visual tasks. Finally, the study discusses existing issues with diffusion models and outlines potential future development trends.

References:: [1] CRIMINISI A, PEREZ P, TOYAMA K. Object removal by exemplar-based inpainting[C]//2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Madison: IEEE, 2003: II.
[2] WANG Zhou, YU Yinglin, ZHANG D. Best neighborhood matching: an information loss restoration technique for block-based image coding systems[J]. IEEE transactions on image processing, 1998, 7(7): 1056-1061.
[3] TURK M A, PENTLAND A P. Face recognition using eigenfaces[C]//Proceedings of 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Maui: IEEE, 1991: 586-591.
[4] REZENDE D J, MOHAMED S. Variational inference with normalizing flows[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille: PMLR, 2015: 1530-1538.
[5] KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. (2013-12-20)[2023-12-27]. https://arxiv.org/abs/1312.6114.
[6] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[7] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020: 6840–6851.
[8] MENG Chenlin, HE Yutong, SONG Yang, et al. SDEdit: guided image synthesis and editing with stochastic differential equations[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2022: 1-33.
[9] LI Haoying, YANG Yifan, CHANG Meng, et al. SRDiff: Single image super-resolution with diffusion probabilistic models[J]. Neurocomputing, 2022, 479: 47-59.
[10] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10674-10685.
[11] 闫志浩, 周长兵, 李小翠. 生成扩散模型研究综述[J]. 计算机科学, 2024, 51(1): 273-283.
YAN Zhihao, ZHOU Zhangbing, LI Xiaocui. Survey on generative diffusion model[J]. Computer science, 2024, 51(1): 273-283.
[12] YANG Ling, ZHANG Zhilong, SONG Yang, et al. Diffusion models: a comprehensive survey of methods and applications[J]. ACM computing surveys, 2024, 56(4): 1-39.
[13] Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis[C]//International Conference on Learning Representations. New Orleans: OpenReview.net, 2019: 1–35.
[14] SAUER A, SCHWARZ K, GEIGER A. StyleGAN-XL: scaling StyleGAN to large diverse datasets[C]//Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings. Vancouver: ACM, 2022: 1–10.
[15] 曹锦纲, 李金华, 郑顾平. 基于生成式对抗网络的道路交通模糊图像增强[J]. 智能系统学报, 2020, 15(3): 491-498.
CAO Jingang, LI Jinhua, ZHENG Guping. Enhancement of blurred road-traffic images based on generative adversarial network[J]. CAAI transactions on intelligent systems, 2020, 15(3): 491-498.
[16] 严浙平, 曲思瑜, 邢文. 水下图像增强方法研究综述[J]. 智能系统学报, 2022, 17(5): 860-873.
YAN Zheping, QU Siyu, XING Wen. An overview of underwater image enhancement methods[J]. CAAI transactions on intelligent systems, 2022, 17(5): 860-873.
[17] 姜义, 吕荣镇, 刘明珠, 等. 基于生成对抗网络的人脸口罩图像合成[J]. 智能系统学报, 2021, 16(6): 1073-1080.
JIANG Yi, LYU Rongzhen, LIU Mingzhu, et al. Masked face image synthesis based on a generative adversarial network[J]. CAAI transactions on intelligent systems, 2021, 16(6): 1073-1080.
[18] 毕晓君, 潘梦迪. 基于生成对抗网络的机载遥感图像超分辨率重建[J]. 智能系统学报, 2020, 15(1): 74-83.
BI Xiaojun, PAN Mengdi. Super-resolution reconstruction of airborne remote sensing images based on the generative adversarial networks[J]. CAAI transactions on intelligent systems, 2020, 15(1): 74-83.
[19] REZENDE D J, MOHAMED S, WIERSTRA D. Stochastic backpropagation and approximate inference in deep generative models[C]//Proceedings of the 31st International Conference on International Conference on Machine Learning. Beijing: PMLR, 2014: 3057-3070.
[20] 张冀, 曹艺, 王亚茹, 等. 融合VAE和StackGAN的零样本图像分类方法[J]. 智能系统学报, 2022, 17(3): 593-601.
ZHANG Ji, CAO Yi, WANG Yaru, et al. Zero-shot image classification method combining VAE and StackGAN[J]. CAAI transactions on intelligent systems, 2022, 17(3): 593-601.
[21] RAZAVI A, OORD A V D, VINYALS O. Generating diverse high-fidelity images with VQ-VAE-2[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019: 1–15.
[22] CHANG Huiwen, ZHANG Han, JIANG Lu, et al. MaskGIT: masked generative image transformer[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 11305-11315.
[23] ZHANG Qinsheng, CHEN Yongxin. Diffusion normalizing flow[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 16280–16291.
[24] SOHL-DICKSTEIN J, WEISS E A, MAHESWARANATHAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille: JMLR, 2015: 2246–2255.
[25] SONG Yang, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2020: 1–23.
[26] SONG Yang, ERMON S. Generative modeling by estimating gradients of the data distribution[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019: 1–36.
[27] ANDERSON B D O. Reverse-time diffusion equation models[J]. Stochastic processes and their applications, 1982, 12(3): 313-326.
[28] JOLICOEUR-MARTINEAU A, LI Ke, PICHé-TAILLEFER R, et al. Gotta go fast when generating data with score-based models[EB/OL]. (2021-05-28)[2023-12-27]. https://arxiv.org/abs/2105.14080.
[29] VINCENT P. A connection between score matching and denoising autoencoders[J]. Neural computation, 2011, 23(7): 1661-1674.
[30] SONG Jiaming, MENG Chenlin, ERMON S. Denoising diffusion implicit models[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2021: 1–22.
[31] LIU Luping, REN Yi, LIN Zhijie, et al. Pseudo numerical methods for diffusion models on manifolds[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2022: 1–24.
[32] HARRISON G. Consistency models[M]//Next Generation Databases. Berkeley: Apress, 2015: 127-144.
[33] HUANG Rongjie, ZHAO Zhou, LIU Huadai, et al. ProDiff: progressive fast diffusion model for high-quality text-to-speech[C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 2595–2605.
[34] MENG Chenlin, ROMBACH R, GAO Ruiqi, et al. On distillation of guided diffusion models[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 14297-14306.
[35] SHANG Yuzhang, YUAN Zhihang, XIE Bin, et al. Post-training quantization on diffusion models[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 1972-1981.
[36] LI Yanyu, WANG Huan, JIN Qing, et al. SnapFusion: text-to-image diffusion model on mobile devices within two seconds[EB/OL]. (2023-06-01)[2023-12-27]. https://arxiv.org/abs/2306.00980.
[37] XIAO Zhisheng, KREIS K, VAHDAT A. Tackling the generative learning trilemma with denoising diffusion GANs[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2022: 1–28.
[38] PANDEY K, MUKHERJEE A, RAI P, et al. DiffuseVAE?: efficient, controllable and high-fidelity generation from low-dimensional latents[J]. Transactions on machine learning research, 2022: 1–39.
[39] SONG Yang, DURKAN C, MURRAY I, et al. Maximum likelihood training of score-based diffusion models[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 1415–1428.
[40] LU Cheng, ZHENG Kaiwen, BAO Fan, et al. Maximum likelihood training for score-based diffusion ODEs by high-order denoising score matching[C]//International Conference on Machine Learning. New York: PMLR, 2022: 14429–14460.
[41] KIM D, NA B, KWON S J, et al. Maximum likelihood training of implicit nonlinear diffusion models[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022: 32270–32284.
[42] AUSTIN J, JOHNSON D D, HO J, et al. Structured denoising diffusion models in discrete state-spaces[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 17981–17993.
[43] NICHOL A, DHARIWAL P. Improved denoising diffusion probabilistic models[C]//Proceedings of the 38th International Conference on Machine Learning. New York: PMLR, 2021: 8162–8171.
[44] KINGMA D P, SALIMANS T, POOLE B, et al. Variational diffusion models[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 21696–21707.
[45] HOOGEBOOM E, GRITSENKO A A, BASTINGS J, et al. Autoregressive diffusion models[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2022: 1–23.
[46] CAMPBELL A, BENTON J, DELIGIANNIDIS G, et al. A continuous time framework for discrete denoising models[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022: 28266–28279.
[47] GU Shuyang, CHEN Dong, BAO Jianmin, et al. Vector quantized diffusion model for text-to-image synthesis[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10686-10696.
[48] TANG Zhicong, GU Shuyang, BAO Jianmin, et al. Improved vector quantized diffusion models[EB/OL]. (2022-05-31)[2023-12-27]. https://arxiv.org/abs/2205.16007.
[49] NIU Chenhao, SONG Yang, SONG Jiaming, et al. Permutation invariant graph generation via score-based generative modeling[C]//Proceedings of the 23rd International Conference on Artificial Intelligence and Statistic. Palermo: PMLR, 2020: 4474–4484.
[50] JO J, LEE S, HWANG S J. Score-based generative modeling of graphs via the system of stochastic differential equations[C]//Proceedings of the 39th International Conference on Machine Learning. Baltimore: PMLR, 2022: 10362–10383.
[51] LUO Shitong, HU Wei. Diffusion probabilistic models for 3D point cloud generation[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 2836-2844.
[52] XU Minkai, YU Lantao, SONG Yang, et al. GeoDiff: a geometric diffusion model for molecular conformation generation[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2022: 1–19.
[53] BORTOLI V De, MATHIEU é, HUTCHINSON M, et al. Riemannian score-based generative modelling[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022: 2406–2422.
[54] HUANG C W, AGHAJOHARI M, BOSE A J, et al. Riemannian diffusion models[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022: 2750–2761.
[55] VAHDAT A, KREIS K, KAUTZ J. Score-based generative modeling in latent space[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 11287–11302.
[56] HO J, SAHARIA C, CHAN W, et al. Cascaded diffusion models for high fidelity image generation[J]. The journal of machine learning research, 2022, 23(1): 2249-2281.
[57] DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2021: 8780–8794.
[58] KIM D, KIM Y, KWON S J, et al. Refining generative process with discriminator guidance in score-based diffusion models[C]//Proceedings of the 40th International Conference on Machine Learning. Honolulu: JMLR, 2023: 16567–16598.
[59] GAO Shanghua, ZHOU Pan, CHENG Mingming, et al. MDTv2: masked diffusion transformer is a strong image synthesizer[EB/OL]. (2023-03-25)[2023-12-27]. https://arxiv.org/abs/2303.14389.
[60] BATZOLIS G, STANCZUK J, SCH?NLIEB C B, et al. Conditional image generation with score-based diffusion models[EB/OL]. (2021-11-26)[2023-12-27]. https://arxiv.org/abs/2111.13606.
[61] NICHOL A, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models[C]//Proceedings of the 39th International Conference on Machine Learning. Baltimore: PMLR, 2022: 16784-16804.
[62] SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022: 36479-36494.
[63] CROWSON K, BIDERMAN S, KORNIS D, et al. VQGAN-clip: open domain image generation and Editing with Natural language guidance[C]//Lecture Notes in Computer Science. Cham: Springer Nature Switzerland, 2022: 88-105.
[64] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. (2022-04-13)[2023-12-27]. https://arxiv.org/abs/2204.06125.
[65] KAWAR B, ZADA S, LANG O, et al. Imagic: text-based real image editing with diffusion models[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 6007-6017.
[66] YANG Ling, HUANG Zhilin, SONG Yang, et al. Diffusion-based scene graph to image generation with masked contrastive pre-training[EB/OL]. (2022-11-21)[2023-12-27]. https://arxiv.org/abs/2211.11138.
[67] SADAT S, BUHMANN J, BRADLEY D, et al. CADS: unleashing the diversity of diffusion models through condition-annealed sampling[EB/OL]. (2023-10-26)[2023-12-27]. https://arxiv.org/abs/2310.17347.
[68] HATAMIZADEH A, SONG Jiaming, LIU Guilin, et al. DiffiT: diffusion vision transformers for image generation[EB/OL]. (2023-12-04)[2023-12-27]. https://arxiv.org/abs/2312.02139.
[69] HO J, SALIMANS T, GRITSENKO A, et al. Video diffusion models[EB/OL]. (2022-04-07)[2023-12-27]. https://arxiv.org/abs/2204.03458.
[70] NI Haomiao, SHI Changhao, LI Kai, et al. Conditional image-to-video generation with latent flow diffusion models[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 18444-18455.
[71] LUO Zhengxiong, CHEN Dayou, ZHANG Yingya, et al. VideoFusion: decomposed diffusion models for high-quality video generation[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 10209–10218.
[72] YU S, SOHN K, KIM S, et al. Video probabilistic diffusion models in projected latent space[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 18456-18466.
[73] VAN DER HELM P A. Simplicity in vision: a multidisciplinary account of perceptual organization[M]. Cambridge: Cambridge University Press, 2014: 1–8.
[74] SAHARIA C, CHAN W, CHANG Huiwen, et al. Palette: image-to-image diffusion models[C]//Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings. Vancouver: ACM, 2022: 1–10.
[75] OZBEY M, DALMAZ O, DAR S U H, et al. Unsupervised medical image translation with adversarial diffusion models[J]. IEEE transactions on medical imaging, 2023, 42(12): 3524-3539.
[76] SAHARIA C, HO J, CHAN W, et al. Image super-resolution via iterative refinement[J]. IEEE transactions on pattern analysis and machine intelligence, 2023, 45(4): 4713-4726.
[77] SAHAK H, WATSON D, SAHARIA C, et al. Denoising diffusion probabilistic models for robust image super-resolution in the wild[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020: 6840-6851.
[78] GAO Sicheng, LIU Xuhui, ZENG Bohan, et al. Implicit diffusion models for continuous super-resolution[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 10021-10030.
[79] LUGMAYR A, DANELLJAN M, ROMERO A, et al. RePaint: inpainting using denoising diffusion probabilistic models[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 11451-11461.
[80] XIE Shaoan, ZHANG Zhifei, LIN Zhe, et al. SmartBrush: text and shape guided object inpainting with diffusion model[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 22428-22437.
[81] WANG Zhixin, ZHANG Ziying, ZHANG Xiaoyun, et al. DR2: diffusion-based robust degradation remover for blind face restoration[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 1704-1713.
[82] FEI Ben, LYU Zhaoyang, PAN Liang, et al. Generative diffusion prior for unified image restoration and enhancement[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 9935-9946.
[83] BARANCHUK D, RUBACHEV I, VOYNOV A, et al. Label-efficient semantic segmentation with diffusion models[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2021: 1–15.
[84] BREMPONG E A, KORNBLITH S, CHEN Ting, et al. Denoising pretraining for semantic segmentation[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New Orleans: IEEE, 2022: 4174-4185.
[85] AMIT T, NACHMANI E, SHAHARBANY T, et al. SegDiff: image segmentation with diffusion probabilistic models[EB/OL]. (2021-12-01)[2023-12-27]. https://arxiv.org/abs/2112.00390.
[86] CHEN Shoufa, SUN Peize, SONG Yibing, et al. DiffusionDet: diffusion model for object detection[C]//2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 19773-19786.
[87] ZHOU Linqi, DU Yilun, WU Jiajun. 3D shape generation and completion through point-voxel diffusion[C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 5806-5815.
[88] LYU Zhaoyang, KONG Zhifeng, XU Xudong, et al. A conditional point diffusion-refinement paradigm for 3D point cloud completion[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2021: 1–24.
[89] WANG Tengfei, ZHANG Bo, ZHANG Ting, et al. RODIN: a generative model for sculpting 3D digital avatars using diffusion[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 4563-4573.
[90] LYU Zhaoyang, WANG Jinyi, AN Yuwei, et al. Controllable mesh generation through sparse latent point diffusion models[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 271-280.
[91] CHUNG H, RYU D, MCCANN M T, et al. Solving 3D inverse problems using pre-trained 2D diffusion models[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 22542-22551.
[92] SONG Yang, SHEN Liyue, XING Lei, et al. Solving inverse problems in medical imaging with score-based generative models[C]//International Conference on Learning Representations. Virtual: OpenReview.net, 2021: 1–18.

Similar References:

Memo

Last Update: 2025-03-05

Research status of diffusion models in computer vision PDF DownloadHTML

Memo

Research status of diffusion models in computer vision

PDF Download HTML