research-article

Resolving vision and language ambiguities together

Authors:

Gordon Christie,

Aishwarya Agrawal,

Stanislaw Antol,

Kevin Kochersberger,

Dhruv BatraAuthors Info & Claims

Computer Vision and Image Understanding, Volume 163, Issue C

Pages 101 - 112

https://doi.org/10.1016/j.cviu.2017.09.001

Published: 01 October 2017 Publication History

Abstract

We perform semantic segmentation and prepositional phrase attachment resolution.Multiple hypotheses are shown to be crucial to improved multiple-module reasoning.Joint reasoning produces more accurate results than any module operating in isolation. We present an approach to simultaneously perform semantic segmentation and prepositional phrase attachment resolution for captioned images. Some ambiguities in language cannot be resolved without simultaneously reasoning about an associated image. If we consider the sentence I shot an elephant in my pajamas, looking at language alone (and not using common sense), it is unclear if it is the person or the elephant wearing the pajamas or both. Our approach produces a diverse set of plausible hypotheses for both semantic segmentation and prepositional phrase attachment resolution that are then jointly re-ranked to select the most consistent pair. We show that our semantic segmentation and prepositional phrase attachment resolution modules have complementary strengths, and that joint reasoning produces more accurate results than any module operating in isolation. Multiple hypotheses are also shown to be crucial to improved multiple-module reasoning. Our vision and language approach significantly outperforms the Stanford Parser (De Marneffe etal., 2006) by 17.91% (28.69% relative) and 12.83% (25.28% relative) in two different experiments. We also make small improvements over DeepLab-CRF (Chen etal., 2015).

References

[1]

F. Ahmed, D. Tarlow, D. Batra, Optimizing expected intersection-over-union with candidate-constrained crfs, 2015.

[2]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, Vqa: visual question answering, ICCV (2015).

Digital Library

[3]

Bach, K., 2016. Routledge Encyclopedia of philosophy entry. http://online.sfsu.edu/kbach/ambguity.html.

[4]

K. Barnard, M. Johnson, D. Forsyth, Word sense disambiguation with pictures, Association for Computational Linguistics, 2003.

[5]

D. Batra, An efficient message-passing algorithm for the M-Best MAP problem, 2012.

[6]

D. Batra, P. Yadollahpour, A. Guzman-Rivera, G. Shakhnarovich, Diverse M-Best solutions in Markov random fields, 2012.

[7]

Berzak, Y., Barbu, A., Harari, D., Katz, B., Ullman, S., 2016. Do you see what I mean? visual resolution of linguistic ambiguities. arXiv preprint arXiv:1603.08079.

[8]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Semantic image segmentation with deep convolutional nets and fully connected crfs, 2015.

[9]

Davis, E., 2016. Notes on ambiguity. http://cs.nyu.edu/faculty/davise/ai/ambiguity.html.

[10]

M.-C. De Marneffe, B. MacCartney, C.D. Manning, Generating typed dependency parses from phrase structure parses, 2006.

[11]

M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, IJCV, 88 (2010) 303-338.

Digital Library

[12]

H. Fang, S. Gupta, F.N. Iandola, R. Srivastava, L. Deng, P. Dollr, J. Gao, X. He, M. Mitchell, J.C. Platt, C.L. Zitnick, G. Zweig, From captions to visual concepts and back, 2015.

[13]

S. Fidler, A. Sharma, R. Urtasun, A sentence is worth a thousand pixels, 2013.

[14]

Gella, S., Lapata, M., Keller, F., 2016. Unsupervised visual sense disambiguation for verbs using multimodal embeddings. arXiv preprint arXiv:1603.09188.

[15]

D. Geman, S. Geman, N. Hallonquist, L. Younes, A visual turing test for computer vision systems, 2014.

[16]

K. Gimpel, D. Batra, C. Dyer, G. Shakhnarovich, A systematic exploration of diversity in machine translation, 2013.

[17]

A. Guzman-Rivera, P. Kohli, D. Batra, Divmcuts: faster training of structural SVMs with diverse m-best cutting-planes., 2013.

[18]

G. Heitz, S. Gould, A. Saxena, D. Koller, Cascaded classification models: combining models for holistic scene understanding, 2008.

[19]

L. Huang, D. Chiang, Better k-best parsing, 2005.

[20]

J.H. Kappes, B. Andres, F.A. Hamprecht, C. Schnrr, S. Nowozin, D. Batra, S. Kim, B.X. Kausler, J. Lellmann, N. Komodakis, C. Rother, A comparative study of modern inference techniques for discrete energy minimization problems, 2013.

[21]

C. Kong, D. Lin, M. Bansal, R. Urtasun, S. Fidler, What are you talking about? text-to-image coreference, 2014.

[22]

T. Lan, W. Yang, Y. Wang, G. Mori, Image retrieval with structured object queries using latent ranking SVM, 2012.

[23]

Malinowski, M., Fritz, M., 2014. A pooling approach to modelling spatial relations for image retrieval and annotation. arXiv preprint arXiv:1411.5190.

[24]

M. Malinowski, M. Rohrbach, M. Fritz, Ask your neurons: a neural-based approach to answering questions about images, 2015.

[25]

T. Meltzer, C. Yanover, Y. Weiss, Globally optimal solutions for energy minimization in stereo vision using reweighted belief propagation, 2005.

[26]

T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013.

[27]

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, A. Yuille, The role of context for object detection and semantic segmentation in the wild, 2014.

[28]

M. Poesio, R. Artstein, Annotating (anaphoric) ambiguity, 2005.

[29]

A. Prasad, S. Jegelka, D. Batra, Submodular meets structured: finding diverse subsets in exponentially-large structured item sets, 2014.

[30]

V. Premachandran, D. Tarlow, D. Batra, Empirical minimum Bayes risk prediction: how to extract an extra few% performance from vision models with just three more parameters, 2014.

[31]

C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, Collecting image annotations using Amazons mechanical turk, 2010.

[32]

A. Ratnaparkhi, J. Reynar, S. Roukos, A maximum entropy model for prepositional phrase attachment, 1994.

[33]

Q. Sun, A. Laddha, D. Batra, Active learning for structured probabilistic models with histogram approximation, 2015.

[34]

R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, C. Rother, A comparative study of energy minimization methods for Markov random fields with smoothness-based priors, PAMI, 30 (2008) 1068-1080.

Digital Library

[35]

R. Vedantam, C.L. Zitnick, D. Parikh, Cider: consensus-based image description evaluation, 2014.

[36]

O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: aneural image caption generator, 2015.

[37]

P. Yadollahpour, D. Batra, G. Shakhnarovich, Discriminative re-ranking of diverse segmentations, 2013.

[38]

M. Yatskar, M. Galley, L. Vanderwende, L. Zettlemoyer, See no evil, say no evil: description generation from densely labeled images, 2014.

[39]

L. Yu, E. Park, A.C. Berg, T.L. Berg, Visual Madlibs: fill in the blank description generation and question answering, 2015.

[40]

C.L. Zitnick, D. Parikh, Bringing semantics into focus using visual abstraction, 2013.

Cited By

Shuangshuang Yu (2024)Extraction and Analysis of Semantic Features of English Texts under Intelligent AlgorithmsAutomatic Control and Computer Sciences10.3103/S014641162401011558:1(109-115)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.3103/S0146411624010115
Ye ZKhan RNaqvi NIslam M(2021)A novel automatic image caption generation using bidirectional long-short term memory frameworkMultimedia Tools and Applications10.1007/s11042-021-10632-680:17(25557-25582)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1007/s11042-021-10632-6
Vanzo ACroce DBastianelli EBasili RNardi D(2020)Grounded language interpretation of robotic commands through structured learningArtificial Intelligence10.1016/j.artint.2019.103181278:COnline publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1016/j.artint.2019.103181

Resolving vision and language ambiguities together
1. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Resolving plural ambiguities by type reconstruction
FG'10/FG'11: Proceedings of the 15th and 16th international conference on Formal Grammar

We describe a type reconstruction algorithm for a fragment of natural language. It is based on Hindley's algorithm for simple types, but extends it with subtyping and overloading. We extend one of Montague's fragments of English by plural noun phrases ...
Resolving ambiguities in Mandarin Chinese: implications for machine translation
A measure of semantic relatedness for resolving ambiguities in natural language database requests

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computer Vision and Image Understanding

Computer Vision and Image Understanding Volume 163, Issue C

October 2017

89 pages

ISSN:1077-3142

Issue’s Table of Contents

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 October 2017

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shuangshuang Yu (2024)Extraction and Analysis of Semantic Features of English Texts under Intelligent AlgorithmsAutomatic Control and Computer Sciences10.3103/S014641162401011558:1(109-115)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.3103/S0146411624010115
Ye ZKhan RNaqvi NIslam M(2021)A novel automatic image caption generation using bidirectional long-short term memory frameworkMultimedia Tools and Applications10.1007/s11042-021-10632-680:17(25557-25582)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1007/s11042-021-10632-6
Vanzo ACroce DBastianelli EBasili RNardi D(2020)Grounded language interpretation of robotic commands through structured learningArtificial Intelligence10.1016/j.artint.2019.103181278:COnline publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1016/j.artint.2019.103181

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents