Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Combining Multiple Cues for Visual Madlibs Question Answering

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art and confirm that answering questions from a wide range of types benefits from examining a variety of image cues and carefully choosing the spatial support for feature extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Note that the images of the Visual Madlibs dataset are sampled from the MSCOCO dataset (Lin et al. 2014) to contain at least one person.

  2. The Madlibs training set contains only the correct image descriptions, not the incorrect distractor choices.

References

  • Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016a). Deep compositional question answering with neural module networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual Question Answering. In IEEE International Conference on Computer Vision (ICCV).

  • Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). DBpedia: A Nucleus for a Web of Open Data. In International Semantic Web Conference, Asian Semantic Web Conference (ISWC + ASWC).

  • Bourdev, L., Maji, S., & Malik, J. (2011). Describing people: Poselet-based attribute classification. In IEEE International Conference on Computer Vision (ICCV).

  • Chao, YW., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). HICO: A Benchmark for Recognizing Human-Object Interactions in Images. In IEEE International Conference on Computer Vision (ICCV).

  • Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l1-ball for learning in high dimensions. In International Conference on Machine Learning (ICML).

  • Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing (EMNLP).

  • Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? dataset and methods for multilingual image question answering. In Neural Information Processing Systems (NIPS).

  • Geman, D., Geman, S., Hallonquist, N., & Younes, L. (2015). Visual turing test for computer vision systems. PNAS, 112(12), 3618–23.

    Google Scholar 

  • Girshick, R. (2015). Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV).

  • Gong, Y., Ke, Q., Isard, M., & Lazebnik, S. (2014). A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106(2), 210–233.

    Article  Google Scholar 

  • Hardoon, D., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12), 2639–2664.

    Article  MATH  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778.

  • Hotelling, H. (1936). Relations between two sets of variables. Biometrika, 28, 312–377.

    Article  MATH  Google Scholar 

  • Ilievski, I., Yan, S., & Feng, J. (2016). A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485.

  • Lassila, O., & Swick, R. R. (1999). Resource Description Framework (RDF) Model and Syntax Specification. Tech. rep., W3C, http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/.

  • Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV).

  • Liu, H., & Singh, P. (2004). Conceptnet—a practical commonsense reasoning tool-kit. BT Technology Journal, 22(4), 211–226.

    Article  Google Scholar 

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2015). SSD: Single shot multibox detector. arXiv preprint arXiv:1512.02325.

  • Lyu, S. (2005). Mercer kernels for object recognition with local features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.

  • Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In Neural Information Processing Systems (NIPS).

  • Mallya, A., & Lazebnik, S. (2016). Learning models for actions and person-object interactions with transfer to question answering. In European Conference on Computer Vision (ECCV).

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Neural Information Processing Systems (NIPS).

  • Mokarian, A., Malinowski, M., & Fritz, M. (2016). Mean box pooling: A rich image representation and output embedding for the visual madlibs task. In British Machine Vision Conference (BMVC).

  • Pishchulin, L., Andriluka, M., & Schiele, B. (2014). Fine-grained activity recognition with holistic and pose based features. In German Conference on Pattern Recognition (GCPR).

  • Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2017). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 123(1), 74–93.

    Article  MathSciNet  Google Scholar 

  • Ren, M., Kiros, R., & Zemel, R. (2015a). Exploring models and data for image question answering. In Neural Information Processing Systems (NIPS).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015b). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Image net large scale visual recognition challenge. IJCV, 115(3), 211–252.

    Article  Google Scholar 

  • Saito, K., Shin, A., Ushiku, Y., & Harada, T. (2017). Dualnet: Domain-invariant network for visual question answering. In IEEE International Conference on Multimedia and Expo (ICME), pp 829–834.

  • Shih, K. J., Singh, S., & Hoiem, D. (2016). Where to look: Focus regions for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  • Socher, R., Bauer, J., Manning, C. D., & Ng, A. Y. (2013). Parsing With Compositional Vector Grammars. In ACL.

  • Sudowe, P., Spitzer, H., & Leibe, B. (2015). Person attribute recognition with a jointly-trained holistic cnn model. In ICCV’15 ChaLearn Looking at People Workshop.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR).

  • Tandon, N., de Melo, G., Suchanek, F., & Weikum, G. (2014). Webchild: Harvesting and organizing commonsense knowledge from the web. In ACM International Conference on Web Search and Data Mining.

  • Tommasi, T., Mallya, A., Plummer, B., Lazebnik, S., Berg, AC., & Berg, TL. (2016). Solving visual madlibs with multiple cues. In British Machine Vision Conference (BMVC).

  • Wang, P., Wu, Q., Shen, C., Dick, A., & van den Hengel, A. (2017a). FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).

  • Wang, P., Wu, Q., Shen, C., & van den Hengel, A. (2017b). The VQA-machine: Learning how to use existing vision algorithms to answer new questions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Wu, Q., Shen, C., Hengel, Avd., Wang, P., & Dick, A. (2016a). Image captioning and visual question answering based on attributes and their related external knowledge. arXiv preprint arXiv:1603.02814.

  • Wu, Q., Wang, P., Shen, C., Dick, A. R., & van den Hengel, A. (2016b). Ask me anything: Free-form visual question answering based on knowledge from external sources. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4622–4630.

  • Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Xu, H., & Saenko, K. (2015). Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. arXiv preprint arXiv:1511.05234.

  • Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. J. (2016). Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Yu, D., Fu, J., Mei, T., & Rui, Y. (2017). Multi-level attention networks for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual Madlibs: Fill in the blank Image Generation and Question Answering. In IEEE International Conference on Computer Vision (ICCV).

  • Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Neural Information Processing Systems (NIPS).

  • Zhu, Y., Zhang, C., Ré, C., & Fei-Fei, L. (2015). Building a large-scale multimodal knowledge base for visual question answering. arXiv preprint arXiv:1507.05670.

  • Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded Question Answering in Images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grants 1302438, 1563727, 1405822, 1444234, 1562098, 1633295, 1452851, Xerox UAC, Microsoft Research Faculty Fellowship, and the Sloan Foundation Fellowship. T.T. was partially supported by the ERC Grant 637076 - RoboExNovo.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tatiana Tommasi.

Additional information

Communicated by Xiaoou Tang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tommasi, T., Mallya, A., Plummer, B. et al. Combining Multiple Cues for Visual Madlibs Question Answering. Int J Comput Vis 127, 38–60 (2019). https://doi.org/10.1007/s11263-018-1096-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-1096-0

Keywords

Navigation