Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15083))

Included in the following conference series:

European Conference on Computer Vision

192 Accesses

Abstract

Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as TexForce. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.

C. Chen and A. Wang: These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Large-Scale Reinforcement Learning for Diffusion Models

Notes

References

Alam, M.M., Islam, M.T., Rahman, S.M.M.: Unified learning approach for egocentric hand gesture recognition and fingertip detection. Pattern Recogn. (2021). https://doi.org/10.1016/j.patcog.2021.108200
Article Google Scholar
Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. Int. Conf. Learn, Represent (2024)
Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2019)
Google Scholar
Chen, C., Mo, J.: IQA-PyTorch: PyTorch toolbox for image quality assessment. Available https://github.com/chaofengc/IQA-PyTorch (2022)
Chen, C., et al.: Topiq: a top-down approach from semantics to distortions for image quality assessment. IEEE Trans. Image Process. (2024)
Google Scholar
Christoph, S., Romain, B.: Laion-aesthetics (2022)
Google Scholar
Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards. In: International Conference on Learning Representations (2024)
Google Scholar
Dai, X., et al.: Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Adv. Neural Inform. Process. Syst. (2021)
Google Scholar
Dong, H., et al.: Raft: reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767 (2023)
Fan, Y., et al.: DPOK: reinforcement learning for fine-tuning text-to-image diffusion models. Adv. Neural Inform. Process, Syst (2024)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: International Conference on Learning Representations (2023)
Google Scholar
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Guo, Y., et al.: AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning. In: International Conference on Learning Representations (2024)
Google Scholar
Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. In: Advances in Neural Information Processing Systems (2023)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
Google Scholar
Hu, Y., Liu, B., Kasai, J., Wang, Y., Ostendorf, M., Krishna, R., Smith, N.A.: Tifa: accurate and interpretable text-to-image faithfulness evaluation with question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Google Scholar
Khachatryan, L., et al.: Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. In: Advances in Neural Information Processing Systems(2024)
Google Scholar
Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)
Li, C., et al.: Agiqa-3k: An open database for AI-generated image quality assessment. IEEE Trans. Circuit Syst. Video Technol. 1–1 (2023).https://doi.org/10.1109/TCSVT.2023.3319020
Li, X., Hou, X., Loy, C.C.: When stylegan meets stable diffusion: a $\cal{W} _+$ adapter for personalized image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2024)
Google Scholar
Li, Y., et al.: Textcraftor: your text encoder can be image quality controller. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7985–7995 (2024)
Google Scholar
Liu, R., et al.: Character-aware models improve visual text rendering. arXiv preprint arXiv:2212.10562 (2022)
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)
Google Scholar
Mou, C., et al.: T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. AAAI (2024)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: International Conference on Learning Representations (2023)
Google Scholar
Prabhudesai, M., Goyal, A., Pathak, D., Fragkiadaki, K.: Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proc. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference On Machine Learning (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022)
Schuhmann, C., et al.: LAION-400M: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2021)
Google Scholar
Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score-based diffusion models. Adv. Neural Inform. Process. Syst. 34, 1415–1428 (2021)
Google Scholar
Song, Y., Ermon, S.: Improved techniques for training score-based generative models. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Su, S., et al.: Going the Extra Mile in Face Image Quality Assessment: A Novel Database and Model. IEEE Trans, Multimedia (2023)
Google Scholar
Wang, Z.J., Montoya, E., Munechika, D., Yang, H., Hoover, B., Chau, D.H.: DiffusionDB: a large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint arXiv:2210.14896 (2022)
Witteveen, S., Andrews, M.: Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462 (2022)
Wu, H., et al.: Q-bench: a benchmark for general-purpose foundation models on low-level vision. In: International Conference on Learning Representations (2024)
Google Scholar
Wu, H., et al.: Q-instruct: improving low-level visual abilities for multi-modality foundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Google Scholar
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)
Google Scholar
Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Better aligning text-to-image models with human preference. Int. Conf. Comput, Vis (2023)
Google Scholar
Xu, J., et al.: Imagereward: learning and evaluating human preferences for text-to-image generation. In: Advances in Neural Information Processing Systems (2023)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar

Download references

Acknowledgments

This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

Author information

Authors and Affiliations

S-Lab, Nanyang Technological University, Singapore, Singapore
Chaofeng Chen, Annan Wang, Haoning Wu & Liang Liao
CCDS, Nanyang Technological University, Singapore, Singapore
Weisi Lin
Sensetime Research, Hong Kong, China
Wenxiu Sun & Qiong Yan

Authors

Chaofeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Annan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haoning Wu
View author publications
You can also search for this author in PubMed Google Scholar
Liang Liao
View author publications
You can also search for this author in PubMed Google Scholar
Wenxiu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Qiong Yan
View author publications
You can also search for this author in PubMed Google Scholar
Weisi Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weisi Lin .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 101722 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, C. et al. (2025). Enhancing Diffusion Models with Text-Encoder Reinforcement Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-72698-9_11
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72697-2
Online ISBN: 978-3-031-72698-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Large-Scale Reinforcement Learning for Diffusion Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 101722 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Large-Scale Reinforcement Learning for Diffusion Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 101722 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation