Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3469877.3490567acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Multi-branch Semantic Learning Network for Text-to-Image Synthesis

Published: 10 January 2022 Publication History

Abstract

In this paper, we propose a multi-branch semantic learning network (MSLN) to generate image according to textual description by taking into account global and local textual semantics, which consists of two stages. The first stage generates a coarse-grained image based on the sentence features. In the second stage, a multi-branch fine-grained generation model is constructed to inject the sentence-level and word-level semantics into two coarse-grained images by global and local attention modules, which generate global and local fine-grained image textures, respectively. In particular, we devise a channel fusion module (CFM) to fuse the global and local fine-grained features in the multi-branch fine-grained stage and generate the output image. Extensive experiments conducted on the CUB-200 dataset and Oxford-102 dataset demonstrate the superior performance of the proposed method. (e.g., FID is reduced from 16.09 to 14.43 on CUB-200).

References

[1]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, 2015. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1473–1482.
[2]
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5630–5639.
[3]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
[4]
Sergey Ioffe and Christian Szegedy Batch Normalization. [n.d.]. Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167([n. d.]).
[5]
Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1219–1228.
[6]
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. 2019. Controllable text-to-image generation. arXiv preprint arXiv:1909.07083(2019).
[7]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Icml.
[8]
Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 722–729.
[9]
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1505–1514.
[10]
Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016. Learning what and where to draw. Advances in neural information processing systems 29 (2016), 217–225.
[11]
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. Advances in neural information processing systems 29 (2016), 2234–2242.
[12]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
[13]
Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. 2020. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865(2020).
[14]
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200-2011 dataset. (2011).
[15]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1316–1324.
[16]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision. 5907–5915.
[17]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2018. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41, 8(2018), 1947–1962.
[18]
Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5802–5810.

Index Terms

  1. Multi-branch Semantic Learning Network for Text-to-Image Synthesis
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia
      December 2021
      508 pages
      ISBN:9781450386074
      DOI:10.1145/3469877
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 January 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Feature fusion
      2. Global and local semantic
      3. Multi-branch networks
      4. Text-to-image Synthesis

      Qualifiers

      • Short-paper
      • Research
      • Refereed limited

      Conference

      MMAsia '21
      Sponsor:
      MMAsia '21: ACM Multimedia Asia
      December 1 - 3, 2021
      Gold Coast, Australia

      Acceptance Rates

      Overall Acceptance Rate 59 of 204 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 62
        Total Downloads
      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media