AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

Zhihang Lin¹³,
Mingbao Lin¹⁴,
Meng Zhao¹⁵ &
…
Rongrong Ji¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15064))

Included in the following conference series:

European Conference on Computer Vision

138 Accesses

Abstract

This paper attempts to address the object repetition issue in patch-wise higher-resolution image generation. We propose AccDiffusion, an accurate method for patch-wise higher-resolution image generation without training. An in-depth analysis in this paper reveals an identical text prompt for different patches causes repeated object generation, while no prompt compromises the image details. Therefore, our AccDiffusion, for the first time, proposes to decouple the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of an image patch. Besides, AccDiffusion also introduces dilated sampling with window interaction for better global consistency in higher-resolution image generation. Experimental comparison with existing methods demonstrates that our AccDiffusion effectively addresses the issue of repeated object generation and leads to better performance in higher-resolution image generation. Our code is released at https://github.com/lzhxmu/AccDiffusion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

BeyondScene: Higher-Resolution Human-Centric Scene Generation with Pretrained Diffusion

Mask-ControlNet: Higher-Quality Image Generation with an Additional Mask Prompt

LayerDiff: Exploring Text-Guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

References

Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: fusing diffusion paths for controlled image generation. In: ICML (2023)
Google Scholar
Chai, L., Gharbi, M., Shechtman, E., Isola, P., Zhang, R.: Any-resolution training for high-resolution image synthesis. In: ECCV, pp. 170–188 (2022)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: NeurIPS, pp. 8780–8794 (2021)
Google Scholar
Du, R., Chang, D., Hospedales, T., Song, Y.Z., Ma, Z.: Demofusion: democratising high-resolution image generation with no $$\$. In: CVPR, pp. 6158–6168 (2024)
Google Scholar
Ghosal, D., Majumder, N., Mehrish, A., Poria, S.: Text-to-audio generation using instruction-tuned llm and latent diffusion model. In: ACM MM (2023)
Google Scholar
He, Y., et al.: Scalecrafter: tuning-free higher-resolution visual generation with diffusion models. In: ICLR (2024)
Google Scholar
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Google Scholar
Ho, J., et al.: Imagen video: High definition video generation with diffusion models (2022). arXiv preprint arXiv:2210.02303
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Huang, R., et al.: Make-an-audio: text-to-audio generation with prompt-enhanced diffusion models. In: ICML (2023)
Google Scholar
Jin, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for variable-sized text-to-image synthesis. In: NeurIPS (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Google Scholar
Lee, Y., Kim, K., Kim, H., Sung, M.: Syncdiffusion: coherent montage via synchronized joint diffusions. In: NeurIPS (2023)
Google Scholar
Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: CVPR (2023)
Google Scholar
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML (2021)
Google Scholar
Podell, D., et al.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: ICLR (2024)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. In: ICLR (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Robin Rombach, P.E.: Stable diffusion v1-5 model card. https://huggingface.co/runwayml/stable-diffusion-v1-5
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. TPAMI (2022)
Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: NeurIPS (2016)
Google Scholar
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Google Scholar
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. In: ICLR (2023)
Google Scholar
Soille, P., et al.: Morphological image analysis: principles and applications, vol. 2. Springer (1999)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Wang, J., Yue, Z., Zhou, S., Chan, K.C., Loy, C.C.: Exploiting diffusion prior for real-world image super-resolution (2023). arXiv preprint arXiv:2305.07015
Xie, E., et al.: Difffit: unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In: ICCV (2022)
Google Scholar
Xu, J., et al.: Dream3d: zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In: CVPR (2023)
Google Scholar
Zhang, K., Liang, J., Van Gool, L., Timofte, R.: Designing a practical degradation model for deep blind image super-resolution. In: CVPR (2021)
Google Scholar
Zheng, Q., et al.: Any-size-diffusion: toward efficient text-driven synthesis for any-size hd images. In: AAAI (2024)
Google Scholar

Download references

Acknowledgements

This work was supported by National Science and Technology Major Project (No. 2022ZD0118202), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (No. U21B2037, No. U22B2051, No. U23A20383, No. 62176222, No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, No. 62002305 and No. 62272401), and the Natural Science Foundation of Fujian Province of China (No. 2022J06001).

Author information

Authors and Affiliations

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, China
Zhihang Lin & Rongrong Ji
Skywork AI, Singapore, Singapore
Mingbao Lin
Tencent Youtu Lab., Shanghai, China
Meng Zhao

Authors

Zhihang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Mingbao Lin
View author publications
You can also search for this author in PubMed Google Scholar
Meng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Rongrong Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rongrong Ji .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6108 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, Z., Lin, M., Zhao, M., Ji, R. (2025). AccDiffusion: An Accurate Method for Higher-Resolution Image Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15064. Springer, Cham. https://doi.org/10.1007/978-3-031-72658-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-72658-3_3
Published: 02 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72657-6
Online ISBN: 978-3-031-72658-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BeyondScene: Higher-Resolution Human-Centric Scene Generation with Pretrained Diffusion

Mask-ControlNet: Higher-Quality Image Generation with an Additional Mask Prompt

LayerDiff: Exploring Text-Guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6108 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BeyondScene: Higher-Resolution Human-Centric Scene Generation with Pretrained Diffusion

Mask-ControlNet: Higher-Quality Image Generation with an Additional Mask Prompt

LayerDiff: Exploring Text-Guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6108 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation