Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Free access
Just Accepted

Prompt-Based Modality Bridging for Unified Text-to-Face Generation and Manipulation

Online AM: 24 September 2024 Publication History

Abstract

Text-driven face image generation and manipulation are significant tasks. However, such tasks are quite challenging due to the gap between text and image modalities. It is difficult to utilize current methods to deal with both of the two problems because these methods are usually designed for one certain task, limiting their application in real scenarios. To address the two problems in one framework, we propose a Unified Prompt-based Cross-Modal Framework (UPCM-Frame) to bridge the gap between the text modality and image modality with CLIP and StyleGAN, which are two large-scale pre-trained models. The proposed framework is combined with two main modules: a Text Embedding-to-Image Embedding projection module based on a special prompt embedding pair, and a projection module mapping Image Embeddings to semantically aligned StyleGAN Embeddings which can be used in both image generation and manipulation. The proposed framework is able to handle complicated descriptions and generate impressive results with high quality due to the utilization of large-scale pre-trained models. In order to demonstrate the effectiveness of the proposed method in the two tasks, we conduct experiments to evaluate the results of our method both quantitatively and qualitatively.

References

[1]
Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended diffusion for text-driven editing of natural images. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[2]
Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. InstructPix2Pix: Learning to follow image editing instructions. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[3]
Zijun Deng, Xiangteng He, and Yuxin Peng. 2023. LFR-GAN: Local Feature Refinement based Generative Adversarial Network for Text-to-Image Generation. ACM Trans. On Multimedia Comput., Commun., and Appl. (2023).
[4]
Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. 2017. Semantic Image Synthesis via Adversarial Learning. In Proc. IEEE/CVF Int’l Conf. Comput. Vision.
[5]
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Reddy Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2023. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In Proc. Int’l Conf. Learn. Representations.
[6]
Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Trans. Graph. (2022).
[7]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proc. Annu. Conf. Neural Inf. Process. Systems.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proc. IEEE/CVF Int’l Conf. Comput. Vision.
[9]
Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen. 2019. AttGAN: Facial attribute editing by only changing what you want. IEEE Trans. Image Process. (2019).
[10]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proc. Annu. Conf. Neural Inf. Process. Systems.
[11]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. Proc. Annu. Conf. Neural Inf. Process. Systems.
[12]
Trang-Thi Ho, John Jethro Virtusio, Yung-Yao Chen, Chih-Ming Hsu, and Kai-Lung Hua. 2020. Sketch-guided deep portrait generation. ACM Trans. On Multimedia Comput., Commun., and Appl. (2020).
[13]
Tao Hu, Chengjiang Long, and Chunxia Xiao. 2021. A novel visual representation on text using diverse conditional gan for visual recognition. IEEE Trans. Image Process. (2021).
[14]
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. Int’l Conf. Mach. Learn.
[15]
Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[16]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[17]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. Int’l Conf. Learn. Representations.
[18]
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. 2019. Controllable Text-to-Image Generation. In Proc. Annu. Conf. Neural Inf. Process. Systems.
[19]
Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. 2019. Object-driven Text-to-Image Synthesis via Adversarial Training. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[20]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surveys (2023).
[21]
Ruoyu Liu, Yao Zhao, Shikui Wei, Liang Zheng, and Yi Yang. 2019. Modality-Invariant Image-Text Embedding for Image-Sentence Matching. ACM Trans. On Multimedia Comput., Commun., and Appl. (2019).
[22]
Shiguang Liu and Huixin Wang. 2023. Talking Face Generation via Facial Anatomy. ACM Trans. On Multimedia Comput., Commun., and Appl. (2023).
[23]
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2018. Large-scale celebfaces attributes (CelebA) dataset. Retrieved August 2018 (2018).
[24]
Yiyang Ma, Huan Yang, Bei Liu, Jianlong Fu, and Jiaying Liu. 2022. AI Illustrator: Translating raw descriptions into images by prompt-based cross-modal generation. In Proc. ACM Int’l Conf. Multimedia.
[25]
Yiyang Ma, Huan Yang, Wenjing Wang, Jianlong Fu, and Jiaying Liu. 2023. Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation. arXiv preprint arXiv:2303.09319 (2023).
[26]
Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. arXiv preprint arXiv:1411.1784 (2014).
[27]
Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin, Chen Change Loy, and Ping Luo. 2021. Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation. IEEE Trans. Image Process. (2021).
[28]
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proc. IEEE/CVF Int’l Conf. Comput. Vision.
[29]
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. MirrorGAN: Learning Text-to-Image Generation by Redescription. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[30]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning Transferable Visual Models from Natural Language Supervision. In Proc. Int’l Conf. Mach. Learn.
[31]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125 (2022).
[32]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. In Proc. Int’l Conf. Mach. Learn.
[33]
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative Adversarial Text to Image Synthesis. In Proc. Int’l Conf. Mach. Learn.
[34]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[35]
Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. 2023. MM-diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[36]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Proc. Annu. Conf. Neural Inf. Process. Systems.
[37]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. The Journal of Machine Learning Research (2014).
[38]
Hongchen Tan, Xiuping Liu, Meng Liu, Baocai Yin, and Xin Li. 2020. KT-GAN: Knowledge-transfer generative adversarial network for text-to-image synthesis. IEEE Trans. Image Process. (2020).
[39]
Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. 2022. DF-GAN: A simple and Effective Baseline for Text-to-Image Synthesis. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proc. Annu. Conf. Neural Inf. Process. Systems.
[41]
Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. 2021. TediGAN: Text-Guided Diverse Face Image Generation and Manipulation. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[42]
Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. 2021. Towards Open-World Text-Guided Face Image Generation and Manipulation. arXiv preprint arXiv: 2104.08910 (2021).
[43]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[44]
Liu Yang, Liping Jing, and Michael K Ng. 2015. Robust and Non-Negative Collective Matrix Factorization for Text-to-Image Transfer Learning. IEEE Trans. Image Process. (2015).
[45]
Yanhua Yang, Lei Wang, De Xie, Cheng Deng, and Dacheng Tao. 2021. Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis. IEEE Trans. Image Process. (2021).
[46]
Tao Yao, Yiru Li, Ying Li, Yingying Zhu, Gang Wang, and Jun Yue. 2023. Cross-Modal Semantically Augmented Network for Image-Text Matching. ACM Trans. On Multimedia Comput., Commun., and Appl. (2023).
[47]
Zili Yi, Zhiqin Chen, Hao Cai, Wendong Mao, Minglun Gong, and Hao Zhang. 2020. BSD-GAN: Branched Generative Adversarial Network for Scale-Disentangled Representation Learning and Image Synthesis. IEEE Trans. Image Process. (2020).
[48]
Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang Wang, and Jing Shao. 2019. Semantics Disentangling for Text-to-Image Generation. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[49]
Feifei Zhang, Mingliang Xu, and Changsheng Xu. 2022. Tell, Imagine, and Search: End-to-end Learning for Composing Text and Image to Image Retrieval. ACM Trans. On Multimedia Comput., Commun., and Appl. (2022).
[50]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2018. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. IEEE Trans. Pattern Anal. Mach. Intell. (2018).
[51]
Zizhao Zhang, Yuanpu Xie, and Lin Yang. 2018. Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.
[52]
Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. In Proc. IEEE/CVF Int’l Conf. Comput. Vision and Pattern Recognit.

Index Terms

  1. Prompt-Based Modality Bridging for Unified Text-to-Face Generation and Manipulation
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted
      EISSN:1551-6865
      Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Online AM: 24 September 2024
      Accepted: 31 July 2024
      Revised: 24 May 2024
      Received: 27 December 2023

      Check for updates

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 76
        Total Downloads
      • Downloads (Last 12 months)76
      • Downloads (Last 6 weeks)71
      Reflects downloads up to 12 Nov 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media