Nothing Special   »   [go: up one dir, main page]

Skip to main content

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as TexForce. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.

C. Chen and A. Wang: These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://huggingface.co/runwayml/stable-diffusion-v1-5runwayml/stable-diffusion-v1-5.

  2. 2.

    https://huggingface.co/stabilityai/stable-diffusion-2-1.

References

  1. Alam, M.M., Islam, M.T., Rahman, S.M.M.: Unified learning approach for egocentric hand gesture recognition and fingertip detection. Pattern Recogn. (2021). https://doi.org/10.1016/j.patcog.2021.108200

    Article  Google Scholar 

  2. Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. Int. Conf. Learn, Represent (2024)

    Google Scholar 

  3. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2019)

    Google Scholar 

  4. Chen, C., Mo, J.: IQA-PyTorch: PyTorch toolbox for image quality assessment. Available https://github.com/chaofengc/IQA-PyTorch (2022)

  5. Chen, C., et al.: Topiq: a top-down approach from semantics to distortions for image quality assessment. IEEE Trans. Image Process. (2024)

    Google Scholar 

  6. Christoph, S., Romain, B.: Laion-aesthetics (2022)

    Google Scholar 

  7. Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards. In: International Conference on Learning Representations (2024)

    Google Scholar 

  8. Dai, X., et al.: Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)

  9. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Adv. Neural Inform. Process. Syst. (2021)

    Google Scholar 

  10. Dong, H., et al.: Raft: reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767 (2023)

  11. Fan, Y., et al.: DPOK: reinforcement learning for fine-tuning text-to-image diffusion models. Adv. Neural Inform. Process, Syst (2024)

    Google Scholar 

  12. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: International Conference on Learning Representations (2023)

    Google Scholar 

  13. Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  14. Guo, Y., et al.: AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning. In: International Conference on Learning Representations (2024)

    Google Scholar 

  15. Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. In: Advances in Neural Information Processing Systems (2023)

    Google Scholar 

  16. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  17. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  18. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

    Google Scholar 

  19. Hu, Y., Liu, B., Kasai, J., Wang, Y., Ostendorf, M., Krishna, R., Smith, N.A.: Tifa: accurate and interpretable text-to-image faithfulness evaluation with question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

    Google Scholar 

  20. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)

    Google Scholar 

  21. Khachatryan, L., et al.: Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

    Google Scholar 

  22. Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. In: Advances in Neural Information Processing Systems(2024)

    Google Scholar 

  23. Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)

  24. Li, C., et al.: Agiqa-3k: An open database for AI-generated image quality assessment. IEEE Trans. Circuit Syst. Video Technol. 1–1 (2023).https://doi.org/10.1109/TCSVT.2023.3319020

  25. Li, X., Hou, X., Loy, C.C.: When stylegan meets stable diffusion: a \(\cal{W} _+\) adapter for personalized image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2024)

    Google Scholar 

  26. Li, Y., et al.: Textcraftor: your text encoder can be image quality controller. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7985–7995 (2024)

    Google Scholar 

  27. Liu, R., et al.: Character-aware models improve visual text rendering. arXiv preprint arXiv:2212.10562 (2022)

  28. Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)

    Google Scholar 

  29. Mou, C., et al.: T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. AAAI (2024)

    Google Scholar 

  30. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  31. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: International Conference on Learning Representations (2023)

    Google Scholar 

  32. Prabhudesai, M., Goyal, A., Pathak, D., Fragkiadaki, K.: Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739 (2023)

  33. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proc. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  34. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

  35. Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference On Machine Learning (2021)

    Google Scholar 

  36. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

    Google Scholar 

  37. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

    Google Scholar 

  38. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  39. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022)

  40. Schuhmann, C., et al.: LAION-400M: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)

  41. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  42. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning (2015)

    Google Scholar 

  43. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2021)

    Google Scholar 

  44. Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score-based diffusion models. Adv. Neural Inform. Process. Syst. 34, 1415–1428 (2021)

    Google Scholar 

  45. Song, Y., Ermon, S.: Improved techniques for training score-based generative models. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  46. Su, S., et al.: Going the Extra Mile in Face Image Quality Assessment: A Novel Database and Model. IEEE Trans, Multimedia (2023)

    Google Scholar 

  47. Wang, Z.J., Montoya, E., Munechika, D., Yang, H., Hoover, B., Chau, D.H.: DiffusionDB: a large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint arXiv:2210.14896 (2022)

  48. Witteveen, S., Andrews, M.: Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462 (2022)

  49. Wu, H., et al.: Q-bench: a benchmark for general-purpose foundation models on low-level vision. In: International Conference on Learning Representations (2024)

    Google Scholar 

  50. Wu, H., et al.: Q-instruct: improving low-level visual abilities for multi-modality foundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Google Scholar 

  51. Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)

    Google Scholar 

  52. Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Better aligning text-to-image models with human preference. Int. Conf. Comput, Vis (2023)

    Google Scholar 

  53. Xu, J., et al.: Imagereward: learning and evaluating human preferences for text-to-image generation. In: Advances in Neural Information Processing Systems (2023)

    Google Scholar 

  54. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

    Google Scholar 

Download references

Acknowledgments

This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weisi Lin .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 101722 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, C. et al. (2025). Enhancing Diffusion Models with Text-Encoder Reinforcement Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72698-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72697-2

  • Online ISBN: 978-3-031-72698-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics