Abstract
This paper attempts to address the object repetition issue in patch-wise higher-resolution image generation. We propose AccDiffusion, an accurate method for patch-wise higher-resolution image generation without training. An in-depth analysis in this paper reveals an identical text prompt for different patches causes repeated object generation, while no prompt compromises the image details. Therefore, our AccDiffusion, for the first time, proposes to decouple the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of an image patch. Besides, AccDiffusion also introduces dilated sampling with window interaction for better global consistency in higher-resolution image generation. Experimental comparison with existing methods demonstrates that our AccDiffusion effectively addresses the issue of repeated object generation and leads to better performance in higher-resolution image generation. Our code is released at https://github.com/lzhxmu/AccDiffusion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: fusing diffusion paths for controlled image generation. In: ICML (2023)
Chai, L., Gharbi, M., Shechtman, E., Isola, P., Zhang, R.: Any-resolution training for high-resolution image synthesis. In: ECCV, pp. 170–188 (2022)
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: NeurIPS, pp. 8780–8794 (2021)
Du, R., Chang, D., Hospedales, T., Song, Y.Z., Ma, Z.: Demofusion: democratising high-resolution image generation with no $$\$. In: CVPR, pp. 6158–6168 (2024)
Ghosal, D., Majumder, N., Mehrish, A., Poria, S.: Text-to-audio generation using instruction-tuned llm and latent diffusion model. In: ACM MM (2023)
He, Y., et al.: Scalecrafter: tuning-free higher-resolution visual generation with diffusion models. In: ICLR (2024)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Ho, J., et al.: Imagen video: High definition video generation with diffusion models (2022). arXiv preprint arXiv:2210.02303
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Huang, R., et al.: Make-an-audio: text-to-audio generation with prompt-enhanced diffusion models. In: ICML (2023)
Jin, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for variable-sized text-to-image synthesis. In: NeurIPS (2023)
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Lee, Y., Kim, K., Kim, H., Sung, M.: Syncdiffusion: coherent montage via synchronized joint diffusions. In: NeurIPS (2023)
Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: CVPR (2023)
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML (2021)
Podell, D., et al.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: ICLR (2024)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. In: ICLR (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Robin Rombach, P.E.: Stable diffusion v1-5 model card. https://huggingface.co/runwayml/stable-diffusion-v1-5
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. TPAMI (2022)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: NeurIPS (2016)
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. In: ICLR (2023)
Soille, P., et al.: Morphological image analysis: principles and applications, vol. 2. Springer (1999)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Wang, J., Yue, Z., Zhou, S., Chan, K.C., Loy, C.C.: Exploiting diffusion prior for real-world image super-resolution (2023). arXiv preprint arXiv:2305.07015
Xie, E., et al.: Difffit: unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In: ICCV (2022)
Xu, J., et al.: Dream3d: zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In: CVPR (2023)
Zhang, K., Liang, J., Van Gool, L., Timofte, R.: Designing a practical degradation model for deep blind image super-resolution. In: CVPR (2021)
Zheng, Q., et al.: Any-size-diffusion: toward efficient text-driven synthesis for any-size hd images. In: AAAI (2024)
Acknowledgements
This work was supported by National Science and Technology Major Project (No. 2022ZD0118202), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (No. U21B2037, No. U22B2051, No. U23A20383, No. 62176222, No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, No. 62002305 and No. 62272401), and the Natural Science Foundation of Fujian Province of China (No. 2022J06001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, Z., Lin, M., Zhao, M., Ji, R. (2025). AccDiffusion: An Accurate Method for Higher-Resolution Image Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15064. Springer, Cham. https://doi.org/10.1007/978-3-031-72658-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-72658-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72657-6
Online ISBN: 978-3-031-72658-3
eBook Packages: Computer ScienceComputer Science (R0)