Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3372278.3390735acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections

Visual Story Ordering with a Bidirectional Writer

Published: 08 June 2020 Publication History


This paper introduces visual story ordering, a challenging task in which images and text are ordered in a visual story jointly. We propose a neural network model based on the reader-processor-writer architecture with a self-attention mechanism. A novel bidirectional decoder is further proposed with bidirectional beam search. Experimental results show the effectiveness of the approach. The information gained from multimodal learning is presented and discussed. We also find that the proposed embedding narrows the distance between images and their corresponding story sentences, even though we do not align the two modalities explicitly. As it addresses a general issue in generative models, the proposed bidirectional inference mechanism applies to a variety of applications.


Harsh Agrawal, Arjun Chandrasekaran, Dhruv Batra, Devi Parikh, and Mohit Bansal. 2016. Sort Story: Sorting Jumbled Images and Captions into Stories. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 925--931.
Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. 2002. Inferring Strategies for Sentence Ordering in Multi document News Summarization. J. Artif. Int. Res.17, 1 (Aug. 2002), 35--55.
Regina Barzilay and Mirella Lapata. 2008. Modeling Local Coherence: An Entity-based Approach. Comput. Linguist. 34, 1 (March 2008), 1--34.
Danushka Bollegala, Naoaki Okazaki, and Mitsuru Ishizuka. 2010. A Bottom-up Approach to Sentence Ordering for Multi-document Summarization. Inf. Process.Manage.46, 1 (Jan. 2010), 89--109.
Jonathan M. Campbell and David M. McCord. 1996.The WAIS-R Comprehension and Picture Arrangement Subtests as Measures of Social Intelligence: Testing Traditional Interpretations. Journal of Psychoeducational As-sessment14, 3 (1996), 240--249.
Patricia Chisholm. 1970. The Structure of Intellect Theory: Implications for More Meaningful Mental Test Interpretations. (1970).
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. [n.d.]. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26.
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'14). MIT Press, Cambridge, MA,USA, 2672--2680.
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 770--778.
Ting-Hao (Kenneth) Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, and Margaret Mitchell. 2016. Visual Storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Association for Computational Linguistics, San Diego, California, 1233--1239.
A. Karpathy and Fei-Fei Li. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (April 2017), 664--676.
Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014. Multimodal Neural Language Models. In Proceedings of the 31st International Conference on Machine Learning. 595--603.
Mirella Lapata. 2003. Probabilistic Text Structuring: Experiments with Sentence Ordering. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Sapporo, Japan,545--552.
Karen Le. 2015. Narrative and Horticultural Imperative: Predicting Discourse Ability in Traumatic Brain Injury from Cognitive and Communicative Factors.(2015).
Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunning-ham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.2017. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4681--4690.
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130(2017).
Lajanugen Logeswaran, Honglak Lee, and Dragomir Radev. 2018. Sentence Ordering and Coherence Modeling using Recurrent Neural Networks.
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations Using LSTMs. In Proceedings of the32Nd International Conference on International Conference on Machine Learning- Volume 37 (ICML'15)., 843--852.
Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-In-The-Blank Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR).
Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015. Order matters: Sequence to sequence for sets.arXiv preprint arXiv:1511.06391(2015).
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. [n.d.]. Pointer Networks. In Advances in Neural Information Processing Systems 28.
Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling Up to Large Vocabulary Image Annotation.
Kuo-Hao Zeng, William B Shen, De-An Huang, Min Sun, and Juan Carlos Niebles. 2017. Visual forecasting by imitating dynamics in natural sequences. In Proceedings of the IEEE International Conference on Computer Vision. 2999--3008.
M. Zhou, M. Huang, and X. Zhu. 2019. Story Ending Selection by Finding Hints From Pairwise Candidate Endings. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 4 (April 2019), 719--729.



Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors


Published In

cover image ACM Conferences
ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval
June 2020
605 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]



Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2020


Request permissions for this article.

Check for updates

Author Tags

  1. multimodal modeling
  2. sentence ordering
  3. temporal information ordering
  4. visual-semantic representation


  • Short-paper

Funding Sources


ICMR '20

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%


Other Metrics

Bibliometrics & Citations


Article Metrics

  • 0
    Total Citations
  • 67
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics


View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.







Share this Publication link

Share on social media