Abstract
Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem. It needs to solve several difficult tasks across the fields of natural language processing and computer vision. We model it as a combination of semantic entity recognition, object retrieval and recombination, and objects’ status optimization. To reach a satisfactory result, we propose a comprehensive pipeline to convert the input text to its visual counterpart. The pipeline includes text processing, foreground objects and background scene retrieval, image synthesis using constrained MCMC, and post-processing. Firstly, we roughly divide the objects parsed from the input text into foreground objects and background scenes. Secondly, we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset, and retrieve an appropriate background scene image from the background image dataset extracted from the Internet. Thirdly, in order to ensure the rationality of foreground objects’ positions and sizes in the image synthesis step, we design a cost function and use the Markov Chain Monte Carlo (MCMC) method as the optimizer to solve this constrained layout problem. Finally, to make the image look natural and harmonious, we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step. The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks (GANs) in visual quality of generated scene images.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Lin T Y, Maire M, Belongie S et al. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.740-755.
Krishna R, Zhu Y, Groth O et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123(1): 32-73.
Mansimov E, Parisotto E, Ba J L et al. Generating images from captions with attention. arXiv:1511.02793, 2015. https://arxiv.org/abs/1511.02793, October 2019.
Reed S, Akata Z, Yan X et al. Generative adversarial text to image synthesis. arXiv:1605.05396, 2016. https://arxiv.org/abs/1605.05396, October 2019.
Zhang H, Xu T, Li H et al. StackGAN: Text to photorealistic image synthesis with stacked generative adversarial networks. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.5907-5915.
Lalonde J F, Hoiem D, Efros A A et al. Photo clip art. ACM Transactions on Graphics, 2007, 26(3): Article No. 3.
Chen T, Cheng M M, Tan P et al. Sketch2Photo: Internet image montage. ACM Transactions on Graphics, 2009, 28(5): Article No. 124.
Chen T, Tan P, Ma L Q et al. PoseShop: Human image database construction and personalized content synthesis. IEEE Transactions on Visualization and Computer Graphics, 2013, 19(5): 824-837.
Fang F, Yi M, Feng H et al. Narrative collage of image collections by scene graph recombination. IEEE Transactions on Visualization and Computer Graphics, 2018, 24(9): 2559-2572.
Zitnick C L, Parikh D. Bringing semantics into focus using visual abstraction. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.3009-3016.
Zitnick C L, Parikh D, Vanderwende L. Learning the visual interpretation of sentences. In Proc. the IEEE International Conference on Computer Vision, December 2013, pp.1681-1688.
Coyne B, Sproat R. WordsEye: An automatic text-to-scene conversion system. In Proc. the 28th Annual Conference on Computer Graphics and Interactive Techniques, August 2001, pp.487-496.
Chang A, Savva M, Manning C D. Learning spatial knowledge for text to 3D scene generation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, October 2014, pp.2028-2038.
Reed S, van den Oord A, Kalchbrenner N et al. Generating interpretable images with controllable structure. In Proc. the International Conference on Learning Representations, April 2017.
Goodfellow I, Pouget-Abadie J, Mirza M et al. Generative adversarial nets. In Proc. the Annual Conference on Neural Information Processing Systems, December 2014, pp.2672-2680.
Reed S E, Akata Z, Mohan S et al. Learning what and where to draw. In Proc. the Annual Conference on Neural Information Processing Systems, December 2016, pp.217-225.
Zhang H, Xu T, Li H et al. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947-1962.
Xu T, Zhang P, Huang Q et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.1316-1324.
Yin G, Liu B, Sheng L et al. Semantics disentangling for text-to-image generation. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.2327-2336.
Zhou X, Huang S, Li B et al. Text guided person image synthesis. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.3663-3672.
Tan H, Liu X, Li X et al. Semantics-enhanced adversarial nets for text-to-image synthesis. In Proc. the IEEE International Conference on Computer Vision, October 2019, pp.10500-10509.
Qiao T, Zhang J, Xu D et al. MirrorGAN: Learning text-to-image generation by redescription. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.1505-1514.
Johnson J, Gupta A, Li F F. Image generation from scene graphs. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.1219-1228.
Li W, Zhang P, Zhang L et al. Object-driven text-to-image synthesis via adversarial training. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.12174-12182.
Hinz T, Heinrich S, Wermter S. Generating multiple objects at spatially distinct locations. arXiv:1901.00686, 2019. https://arxiv.org/abs/1901.00686, October 2019.
Xu K, Ba J, Kiros R et al. Show, attend and tell: Neural image caption generation with visual attention. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.2048-2057.
Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.3128-3137.
Johnson J, Karpathy A, Li F F. DenseCap: Fully convolutional localization networks for dense captioning. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4565-4574.
Krause J, Johnson J, Krishna R et al. A hierarchical approach for generating descriptive image paragraphs. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.3337-3345.
Yao L, Torabi A, Cho K et al. Describing videos by exploiting temporal structure. In Proc. the IEEE International Conference on Computer Vision, December 2015, pp.4507-4515.
Yu H, Wang J, Huang Z et al. Video paragraph captioning using hierarchical recurrent neural networks. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4584-4593.
Li A, Sun J, Ng J Y H et al. Generating holistic 3D scene abstractions for text-based image retrieval. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1942-1950.
Fellbaum C. WordNet. In Theory and Applications of Ontology: Computer Applications, Poli P, Healy M, Kameas A (eds.), Springer Netherlands, 2010, pp.231-243.
He K, Gkioxari G, Dollár P et al. Mask R-CNN. In Proc. the IEEE International Conference on Computer Vision, October 2017, pp.2980-2988.
Laina I, Rupprecht C, Belagiannis V et al. Deeper depth prediction with fully convolutional residual networks. In Proc. the 4th International Conference on 3D Vision, October 2016, pp.239-248.
Yeh Y T, Yang L, Watson M et al. Synthesizing open worlds with constraints using locally annealed reversible jump MCMC. ACM Transactions on Graphics, 2012, 31(4): Article No. 56.
Pérez P, Gangnet M, Blake A. Poisson image editing. ACM Transactions on Graphics, 2003, 22(3): 313-318.
Liao Z, Karsch K, Forsyth D. An approximate shading model for object relighting. In Proc. the IEEE Conference on Computer Vission and Pattern Recognition, June 2015, pp.5307-5314.
Elder J H. Shape from contour: Computation and representation. Annual Review of Vision Science, 2018, 4(1): 423-450.
Johnston S F. Lumo: Illumination for cel animation. In Proc. the 2nd International Symposium on Non-Photorealistic Animation and Rendering, June 2002, pp.45-52.
Wu T P, Sun J, Tang C K et al. Interactive normal reconstruction from a single image. ACM Transactions on Graphics, 2008, 27(5): Article No. 119.
Grosse R, Johnson M K, Adelson E H et al. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In Proc. the 12th IEEE International Conference on Computer Vision, September 2009, pp.2335-2342.
Karsch K, Sunkavalli K, Hadap S et al. Automatic scene inference for 3D object compositing. ACM Transactions on Graphics, 2014, 33(3): Article No. 32.
Godard C, Aodha M O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.6602-6611.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
ESM 1
(PDF 884 kb)
Rights and permissions
About this article
Cite this article
Fang, F., Luo, F., Zhang, HP. et al. A Comprehensive Pipeline for Complex Text-to-Image Synthesis. J. Comput. Sci. Technol. 35, 522–537 (2020). https://doi.org/10.1007/s11390-020-0305-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-020-0305-9